A baseline model generally involves a simple model with minimal preprocessing. We can assume the following:
- **Text Vectorization**: OHE
- **Scaling**: None (no scaling)
- **Feature Selection**: None (no feature selection)
- **Imputation**: Mean (for numerical) and Mode (for categorical)
- **Algorithm**: Logistic Regression (or a simple model)

In [14]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import mlflow
import mlflow.sklearn
import dagshub

In [2]:
df = pd.read_excel(r"D:\campusx_dsmp2\9. MLOps revisited\cars24_mlops_project\experiment\cars24_v3.xlsx")

In [3]:
dagshub.init(repo_owner='iamprashantjain', repo_name='MLOps_UsedCarPricePrediction', mlflow=True)
mlflow.set_tracking_uri("https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow")
mlflow.set_experiment('BaseLine Model')

<Experiment: artifact_location='mlflow-artifacts:/7ff86105bf8e4e4588ac9923aa613f86', creation_time=1745938738264, experiment_id='1', last_update_time=1745938738264, lifecycle_stage='active', name='BaseLine Model', tags={}>

In [15]:
# Define available models
regression_models = {
    "LinearRegression": LinearRegression(),
    "Ridge": Ridge(),
    "Lasso": Lasso(),
    "ElasticNet": ElasticNet(),
    "DecisionTreeRegressor": DecisionTreeRegressor(),
    "RandomForestRegressor": RandomForestRegressor(),
    "GradientBoostingRegressor": GradientBoostingRegressor(),
    "SVR": SVR()
}

# --- Preprocessing ---
# Assume df is already loaded
numerical_cols = ['listingPrice', 'odometer', 'fitnessAge', 'cashDownPayment', 'emiStartingValue',
                  'emiEndingValue', 'roiMinDiscounted', 'roiMaxDiscounted', 'roiMinOriginal',
                  'roiMaxOriginal', 'emiOriginalStartingValue', 'emiOriginalEndingValue', 'featureCount', 'avgEmi']

categorical_cols = ['make', 'model', 'variant', 'year', 'transmissionType', 'bodyType', 'fuelType',
                    'ownership', 'color', '360DegreeCamera', 'AlloyWheels', 'AppleCarplayAndroidAuto',
                    'Bluetooth', 'CruiseControl', 'GpsNavigation', 'InfotainmentSystem', 'LeatherSeats',
                    'ParkingAssist', 'PushButtonStart', 'RearAc', 'SpecialRegNo', 'Sunroof/Moonroof',
                    'TopModel', 'Tpms', 'VentilatedSeats']

# Imputation
numerical_imputer = SimpleImputer(strategy='mean')
df[numerical_cols] = numerical_imputer.fit_transform(df[numerical_cols])

categorical_imputer = SimpleImputer(strategy='most_frequent')
df[categorical_cols] = categorical_imputer.fit_transform(df[categorical_cols])

# Encoding
encoder = OneHotEncoder(sparse=False, drop='first')
encoded_categorical = encoder.fit_transform(df[categorical_cols])
encoded_df = pd.DataFrame(encoded_categorical, columns=encoder.get_feature_names_out(categorical_cols))

# Combine features
df_encoded = pd.concat([df[numerical_cols], encoded_df], axis=1)

# Split
X = df_encoded.drop(columns=["listingPrice"])
y = df_encoded["listingPrice"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- Main MLflow run with nested runs for each model ---
with mlflow.start_run(run_name="All_Regression_Models") as parent_run:
    mlflow.set_tag("experiment_type", "regression_comparison")
    mlflow.log_param("test_size", 0.2)
    mlflow.log_param("imputer_numerical", "mean")
    mlflow.log_param("imputer_categorical", "most_frequent")
    mlflow.log_param("encoder", "OneHotEncoder(drop='first')")

    for model_name, model in regression_models.items():
        with mlflow.start_run(run_name=model_name, nested=True):
            mlflow.log_param("model", model_name)

            # Train
            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)

            # Evaluate
            mae = mean_absolute_error(y_test, y_pred)
            mse = mean_squared_error(y_test, y_pred)
            r2 = r2_score(y_test, y_pred)

            # Log metrics
            mlflow.log_metric("mae", mae)
            mlflow.log_metric("mse", mse)
            mlflow.log_metric("r2_score", r2)

            # Log model
            mlflow.sklearn.log_model(model, f"{model_name}_model")

            print(f"Model: {model_name}")
            print(f"  MAE: {mae:.2f}")
            print(f"  MSE: {mse:.2f}")
            print(f"  R2:  {r2:.2f}")
            print("-" * 40)



Model: LinearRegression
  MAE: 273660976.15
  MSE: 1522465778255408384.00
  R2:  -4931519.64
----------------------------------------


2025/04/29 21:24:05 INFO mlflow.tracking._tracking_service.client: 🏃 View run LinearRegression at: https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow/#/experiments/1/runs/d6e3917de5294cbbb08a2f88a92847b2.
2025/04/29 21:24:05 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow/#/experiments/1.


Model: Ridge
  MAE: 45748.79
  MSE: 3988014745.88
  R2:  0.99
----------------------------------------


2025/04/29 21:24:21 INFO mlflow.tracking._tracking_service.client: 🏃 View run Ridge at: https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow/#/experiments/1/runs/c684af42f8ac48a68679c6eb98245214.
2025/04/29 21:24:21 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow/#/experiments/1.
  model = cd_fast.enet_coordinate_descent(


Model: Lasso
  MAE: 88298.72
  MSE: 51654462347.84
  R2:  0.83
----------------------------------------


2025/04/29 21:24:54 INFO mlflow.tracking._tracking_service.client: 🏃 View run Lasso at: https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow/#/experiments/1/runs/31ea7fc55c7c473dbf9b8e65076f8d36.
2025/04/29 21:24:54 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow/#/experiments/1.
  model = cd_fast.enet_coordinate_descent(


Model: ElasticNet
  MAE: 107200.91
  MSE: 44570036104.87
  R2:  0.86
----------------------------------------


2025/04/29 21:25:35 INFO mlflow.tracking._tracking_service.client: 🏃 View run ElasticNet at: https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow/#/experiments/1/runs/76d2df2445074002b0a35e0cf72bd0c9.
2025/04/29 21:25:35 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow/#/experiments/1.


Model: DecisionTreeRegressor
  MAE: 27147.66
  MSE: 48030197550.26
  R2:  0.84
----------------------------------------


2025/04/29 21:25:48 INFO mlflow.tracking._tracking_service.client: 🏃 View run DecisionTreeRegressor at: https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow/#/experiments/1/runs/23aa848619fa46fcab3bb4efcfb9971a.
2025/04/29 21:25:48 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow/#/experiments/1.


Model: RandomForestRegressor
  MAE: 23698.16
  MSE: 17767087544.26
  R2:  0.94
----------------------------------------


2025/04/29 21:27:23 INFO mlflow.tracking._tracking_service.client: 🏃 View run RandomForestRegressor at: https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow/#/experiments/1/runs/c0e0a9ebebab4c55a47664b87fbdfb66.
2025/04/29 21:27:23 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow/#/experiments/1.


Model: GradientBoostingRegressor
  MAE: 55549.68
  MSE: 26776026141.63
  R2:  0.91
----------------------------------------


2025/04/29 21:28:00 INFO mlflow.tracking._tracking_service.client: 🏃 View run GradientBoostingRegressor at: https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow/#/experiments/1/runs/2b78bb99edeb432b8f3ba3ae509fad3a.
2025/04/29 21:28:00 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow/#/experiments/1.


Model: SVR
  MAE: 321347.51
  MSE: 326415385284.65
  R2:  -0.06
----------------------------------------


2025/04/29 21:29:52 INFO mlflow.tracking._tracking_service.client: 🏃 View run SVR at: https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow/#/experiments/1/runs/8e59271a5c424d2b979457b206f4d4c5.
2025/04/29 21:29:52 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow/#/experiments/1.
2025/04/29 21:29:53 INFO mlflow.tracking._tracking_service.client: 🏃 View run All_Regression_Models at: https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow/#/experiments/1/runs/a223e322b3914636860cc0182e08002d.
2025/04/29 21:29:53 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dagshub.com/iamprashantjain/MLOps_UsedCarPricePrediction.mlflow/#/experiments/1.
