### **Model Comparison & Feature Engineering**

This notebook demonstrates a full machine learning workflow for predicting product price using Databricks. 

Train multiple machine learning models, perform hyperparameter tuning, analyze feature importance, and build Spark ML pipelines. The tasks include training three different models, comparing their metrics using MLflow, constructing a Spark ML pipeline, and selecting the best-performing model.

### Task-1 Train 3 different models

In [0]:
# load spark table
from pyspark.sql import functions as F

events = spark.table("workspace.default.silver_ecommerce_events_event_type_part")


In [0]:
#convert data types
from pyspark.sql import functions as F

events = events.withColumn("product_id", F.col("product_id").cast("long")) \
               .withColumn("user_id", F.col("user_id").cast("long")) \
               .withColumn("price", F.col("price").cast("double")) \
               .withColumn("category_id", F.col("category_id").cast("long")) \
               .withColumn("event_time", F.col("event_time").cast("timestamp")) \
               .withColumn("event_date", F.to_date("event_time"))

In [0]:
#select reuquired columns and convert spark table to pandas dataframe to build models
selected_df = events.select("product_id", "user_id", "price")
display(selected_df.limit(10))

pandas_df = selected_df.toPandas()

product_id,user_id,price
1004858,522850155,131.53
1004864,515027636,115.81
1004767,550404843,245.28
1004768,557311182,245.26
1004838,564297430,141.29
1004992,512430659,231.64
1004766,563561200,244.32
1004741,540014734,189.97
1004870,547109676,281.11
1004836,533886616,223.76


In [0]:
#select required columns for features and label to build models, feature scale the columns, split data into train and test sets and train a regression models
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = pandas_df[["product_id" , "user_id"]]
y = pandas_df["price"]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


In [0]:
# Get Models
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
#from sklearn.ensemble import RandomForestRegressor

models = {
    "linear_regression": LinearRegression(),
    "decision_tree": DecisionTreeRegressor(max_depth=5, random_state=42),
    #"random_forest": RandomForestRegressor(n_estimators=100, random_state=42)
}


In [0]:
# Train Models
results = {}

for name, model in models.items():
    model.fit(X_train, y_train)
    r2 = model.score(X_test, y_test)
    results[name] = r2
    print(f"{name} R2 score: {r2:.4f}")


linear_regression R2 score: 0.0544
decision_tree R2 score: 0.4741


In [0]:
# Use MLflow experiment
import mlflow
import mlflow.sklearn

for name, model in models.items():
    with mlflow.start_run(run_name=name):
        mlflow.log_param("model_type", name)
        mlflow.log_param("features", "vdom, card_adds")

        model.fit(X_train, y_train)

        r2 = model.score(X_test, y_test)

        mlflow.sklearn.log_model(
            sk_model=model,
            artifact_path="model",
            registered_model_name=name,
            signature=mlflow.models.signature.infer_signature(X_train, model.predict(X_train))
        )

        mlflow.log_metric("r2_score", r2)
        mlflow.set_tag("model_type", name)
        mlflow.set_tag("input_sample", X_train[5])

        print(f"{name} logged with R2: {r2:.4f}")


Registered model 'linear_regression' already exists. Creating a new version of this model...
Created version '3' of model 'workspace.default.linear_regression'.


linear_regression logged with R2: 0.0544


Registered model 'decision_tree' already exists. Creating a new version of this model...
Created version '3' of model 'workspace.default.decision_tree'.


decision_tree logged with R2: 0.4741


### Task-2 Compare metrics in MLflow

Visualize the experiment, metrics, parameters, and model in the MLflow tracking UI

### Task-3 Build Spark ML pipeline

In [0]:
# Load spark table and select required columns
spark_df = spark.table("workspace.default.silver_ecommerce_events_event_type_part")
display(spark_df.select("product_id", "user_id", "price").limit(10))

product_id,user_id,price
1004858,522850155,131.53
1004864,515027636,115.81
1004767,550404843,245.28
1004768,557311182,245.26
1004838,564297430,141.29
1004992,512430659,231.64
1004766,563561200,244.32
1004741,540014734,189.97
1004870,547109676,281.11
1004836,533886616,223.76


In [0]:
# Convert datatypes to numeric for required columns
spark_df = spark_df.withColumn("product_id", F.col("product_id").cast("long")) \
                   .withColumn("user_id", F.col("user_id").cast("long")) \
                   .withColumn("price", F.col("price").cast("double"))

In [0]:
# Create feature vector using VectorAssembler
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(
    inputCols=["product_id", "user_id"],
    outputCol="features")

In [0]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(
    featuresCol="features",
    labelCol="price")

In [0]:
# Build Spark ML Pipeline
from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[assembler, lr])

In [0]:
# Split data into train and test sets
train_df, test_df = spark_df.randomSplit([0.8, 0.2], seed=42)

# Fit pipeline on training data
pipeline_model = pipeline.fit(train_df)

# Make predictions on test data
predictions = pipeline_model.transform(test_df)

# Fix: call .limit(10) before display()
display(predictions.select("product_id", "user_id", "price", "prediction").limit(10))

product_id,user_id,price,prediction
1005135,535871217,1747.79,357.3883602446577
1005073,543427258,1207.71,356.7930357673162
10900026,514080443,40.8,289.4286870265142
4802639,514808401,218.77,332.3078321524162
10800132,539194858,22.91,288.15195724562045
1005072,555447922,1044.09,355.84526557638407
1005143,517062545,1541.61,358.87128613985936
50600000,514933060,102.42,9.80243445393046
10800025,539194858,56.6,288.152710717548
1005101,539538500,450.46,357.0994503030662


In [0]:
# calculate R2 value
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(
    labelCol="price",
    predictionCol="prediction",
    metricName="r2"
)

r2_value = evaluator.evaluate(predictions)
display({"r2_value": r2_value})

{'r2_value': 0.054417169334161186}

In [0]:
# Log the spark linear regression model in MLflow artifacts using mlflow.spark.log_model method and register the model in MLflow Model Registry using mlflow.register_model method 
import mlflow
import mlflow.spark
from mlflow.models.signature import infer_signature

# Infer signature from test data and predictions
signature = infer_signature(test_df, predictions.select("prediction"))

with mlflow.start_run(run_name="spark_linear_regression"):
    mlflow.log_param("model_type", "spark_linear_regression")
    mlflow.log_param("features", "product_id, user_id")
    mlflow.spark.log_model(
        spark_model=pipeline_model,
        artifact_path="model",
        registered_model_name="spark_linear_regression",
        dfs_tmpdir="/Volumes/workspace/ecommerce/ecommerce_data",  # Replace with your actual UC volume path
        signature=signature
    )
    mlflow.log_metric("r2_score", r2_value)
    mlflow.set_tag("model_type", "spark_linear_regression")
    print(f"Spark Linear Regression model logged with R2: {r2_value:.4f}")

Registered model 'spark_linear_regression' already exists. Creating a new version of this model...


Spark Linear Regression model logged with R2: 0.0544


Created version '2' of model 'workspace.default.spark_linear_regression'.


Lets build spark decision tree

In [0]:
# train decision tree model using spark
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml import Pipeline

# Create DecisionTreeRegressor
dt = DecisionTreeRegressor(
    featuresCol="features",
    labelCol="price",
    maxDepth=5
)

# Build pipeline with assembler and decision tree
dt_pipeline = Pipeline(stages=[assembler, dt])

# Fit pipeline on training data
dt_model = dt_pipeline.fit(train_df)

# Make predictions on test data
dt_predictions = dt_model.transform(test_df)

display(dt_predictions.select("product_id", "user_id", "price", "prediction").limit(10))

# Calculate R2 value
dt_r2 = evaluator.evaluate(dt_predictions)
display({"decision_tree_r2": dt_r2})

product_id,user_id,price,prediction
1005135,535871217,1747.79,1247.9223399252578
1005073,543427258,1207.71,356.80787772283
10900026,514080443,40.8,133.9266503171796
4802639,514808401,218.77,164.61008266537212
10800132,539194858,22.91,133.9266503171796
1005072,555447922,1044.09,356.80787772283
1005143,517062545,1541.61,1252.1573979085813
50600000,514933060,102.42,263.42773152243905
10800025,539194858,56.6,133.9266503171796
1005101,539538500,450.46,356.80787772283


{'decision_tree_r2': 0.38181019200492583}

In [0]:
# Log the spark decision tree regressor in MLflow artifacts using mlflow.spark.log_model method and register the model in MLflow Model Registry using mlflow.register_model method 
import mlflow
import mlflow.spark
from mlflow.models.signature import infer_signature

# Infer signature from test data and predictions
dt_signature = infer_signature(test_df, dt_predictions.select("prediction"))

with mlflow.start_run(run_name="spark_decision_tree_regressor"):
    mlflow.log_param("model_type", "spark_decision_tree_regressor")
    mlflow.log_param("features", "product_id, user_id")
    mlflow.spark.log_model(
        spark_model=dt_model,
        artifact_path="model",
        registered_model_name="spark_decision_tree_regressor",
        dfs_tmpdir="/Volumes/workspace/ecommerce/ecommerce_data",
        signature=dt_signature
    )
    mlflow.log_metric("r2_score", dt_r2)
    mlflow.set_tag("model_type", "spark_decision_tree_regressor")
    print(f"Spark Decision Tree Regressor model logged with R2: {dt_r2:.4f}")

Registered model 'spark_decision_tree_regressor' already exists. Creating a new version of this model...


Spark Decision Tree Regressor model logged with R2: 0.3818


Created version '2' of model 'workspace.default.spark_decision_tree_regressor'.


### Task-4 Select best model

**Why this is the correct choice:**

Sklearn Decision tree model got Highest R² score


**Spark Pipeline ensures:**

Uses Spark-native execution

Feature consistency

Reproducibility

Production readiness

Easily schedulable as a Databricks Job

Can scale when data grows (today thousands → tomorrow millions)