# Task 2 — PyCaret Regression Pipeline (Used Car Prices)

We use the **cleaned dataset** produced in Task 1 (`02_Used_Car_Prices_Cleaned.xlsx`).  
Cleaning already done:
- Standardized column names and **renamed target** to `Price`
- Extracted numerics from `Mileage`, `Engine`, `Power` (removed units like *kmpl*, *CC*, *bhp*)
- Cast `Seats`, `Year`, `Kilometers_Driven` to numeric
- Normalized `Location` (strip + title case), cast all categoricals
- Added `Brand` from `Brand_Model`
- Median imputation for numeric, mode imputation for categoricals
- Saved as `data/02_Used_Car_Prices_Cleaned.xlsx`

Now, we’ll:
1) Engineer a few helpful features (e.g., **Age**),  
2) Initialize **PyCaret** with strong preprocessing,  
3) Compare, **tune**, and **ensemble** top models via 5-fold CV,  
4) Evaluate on hold-out + unseen,  
5) **Save the full pipeline** for Task 3, and  
6) Ensure **MLflow** logging is enabled (as required).  

In [25]:
import warnings; warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from datetime import datetime

# PyCaret - regression
from pycaret.regression import *

from pathlib import Path
import json, numpy as np, mlflow
import os, mlflow

## Load cleaned dataset + confirmation

In [2]:
# Load the cleaned file from Task 1
df = pd.read_excel("../data/02_Used_Car_Prices_Cleaned.xlsx")

print("Shape:", df.shape)
display(df.head())

# Quick type check (should already be clean)
df.info()

Shape: (6019, 13)


Unnamed: 0,Brand_Model,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,Price,Brand
0,Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.6,998,58.16,5,175000.0,Maruti
1,Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.67,1582,126.2,5,1250000.0,Hyundai
2,Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.2,1199,88.7,5,450000.0,Honda
3,Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,First,20.77,1248,88.76,7,600000.0,Maruti
4,Audi A4 New 2.0 TDI Multitronic,Coimbatore,2013,40670,Diesel,Automatic,Second,15.2,1968,140.8,5,1774000.0,Audi


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6019 entries, 0 to 6018
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Brand_Model        6019 non-null   object 
 1   Location           6019 non-null   object 
 2   Year               6019 non-null   int64  
 3   Kilometers_Driven  6019 non-null   int64  
 4   Fuel_Type          6019 non-null   object 
 5   Transmission       6019 non-null   object 
 6   Owner_Type         6019 non-null   object 
 7   Mileage            6019 non-null   float64
 8   Engine             6019 non-null   int64  
 9   Power              6019 non-null   float64
 10  Seats              6019 non-null   int64  
 11  Price              6019 non-null   float64
 12  Brand              6019 non-null   object 
dtypes: float64(3), int64(4), object(6)
memory usage: 611.4+ KB


## Extra feature engineering

In [3]:
# Age of car = current year - Year (floor at 0)
this_year = datetime.now().year
df["Age"] = (this_year - df["Year"]).clip(lower=0)

# Optional: log transform helper column for km (kept as numeric feature)
df["Log_Km"] = np.log1p(df["Kilometers_Driven"])

# (Optional) target sanity check
df["Price"] = pd.to_numeric(df["Price"], errors="coerce")
df = df.dropna(subset=["Price"]).reset_index(drop=True)

df[["Price","Year","Age","Kilometers_Driven","Log_Km","Mileage","Engine","Power","Seats"]].describe()

Unnamed: 0,Price,Year,Age,Kilometers_Driven,Log_Km,Mileage,Engine,Power,Seats
count,6019.0,6019.0,6019.0,6019.0,6019.0,6019.0,6019.0,6019.0,6019.0
mean,947946.8,2013.358199,11.641801,58738.38,10.758812,18.134966,1620.509221,112.883539,5.27679
std,1118792.0,3.269742,3.269742,91268.84,0.715736,4.581528,599.635458,53.283701,0.806346
min,44000.0,1998.0,6.0,171.0,5.147494,0.0,72.0,34.2,0.0
25%,350000.0,2011.0,9.0,34000.0,10.434145,15.17,1198.0,78.0,5.0
50%,564000.0,2014.0,11.0,53000.0,10.878066,18.15,1493.0,97.7,5.0
75%,995000.0,2016.0,14.0,73000.0,11.198228,21.1,1969.0,138.03,5.0
max,16000000.0,2019.0,27.0,6500000.0,15.687313,33.54,5998.0,560.0,10.0


# MLflow
- Data: `../data/02_Used_Car_Prices_Cleaned.xlsx` (from Task 1)
- We’ll engineer `Age` & `Log_Km`, build a PyCaret regression pipeline,
  compare/tune/ensemble models, and log artifacts/metrics to MLflow **manually**
  (works with MLflow 3.x).
- Experiment name: **usedcar_prices**

In [26]:
# Reproducibility
np.random.seed(42)

## MLflow tracking (local folder)

In [16]:
# Use a project-local mlruns folder
project_root = Path.cwd().resolve()
tracking_dir = (project_root / "../mlruns").resolve()
tracking_dir.mkdir(parents=True, exist_ok=True)

mlflow_uri = tracking_dir.as_uri()     # file:///C:/.../mlruns (Windows-safe)
os.environ["MLFLOW_TRACKING_URI"] = mlflow_uri
mlflow.set_tracking_uri(mlflow_uri)

EXP_NAME = "usedcar_prices"
if mlflow.get_experiment_by_name(EXP_NAME) is None:
    mlflow.create_experiment(EXP_NAME)
mlflow.set_experiment(EXP_NAME)

print("Tracking URI:", mlflow.get_tracking_uri())
print("Experiment:", EXP_NAME)

Tracking URI: file:///C:/Users/keaga_0idqj9o/OneDrive/Desktop/School/Y3T2/MLOps/Assignment/usedcar_project/mlruns
Experiment: usedcar_prices


## Pycaret setup (no internal MLflow)

In [18]:
numeric_features = [
    "Year","Age","Kilometers_Driven","Log_Km",
    "Mileage","Engine","Power","Seats"
]
categorical_features = [
    "Fuel_Type","Transmission","Owner_Type",
    "Location","Brand_Model","Brand"
]

reg_exp = setup(
    data=df,
    target="Price",
    session_id=42,
    train_size=0.8,
    fold=10,
    fold_shuffle=True,

    # 1) Robust preprocessing for skew/outliers
    normalize=True,
    normalize_method="robust",
    transformation=True,                 # transform X (not y)
    transformation_method="quantile",

    # 2) Outliers & multicollinearity
    remove_outliers=True,
    outliers_method="iforest",
    outliers_threshold=0.02,             # top/bottom ~2%
    remove_multicollinearity=True,
    multicollinearity_threshold=0.90,

    # 3) Feature selection (model-based)
    feature_selection=True,
    feature_selection_method="classic",
    feature_selection_estimator="lightgbm",
    n_features_to_select=0.6,            # keep ~60% best features

    # 4) Encoding controls
    max_encoding_ohe=50,                 # cap OHE width
    rare_to_value=0.01,                  # merge levels <1% freq
    rare_value="Other",

    # 5) Helpful binning for non-linearities
    bin_numeric_features=["Age", "Kilometers_Driven"],

    # 6) Imputation (explicit)
    numeric_imputation="median",
    categorical_imputation="mode",

    # 7) Lock schema
    numeric_features=numeric_features,
    categorical_features=categorical_features,

    # We log to MLflow manually (avoid PyCaret↔MLflow conflict)
    log_experiment=False,
    verbose=True,
)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000154 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 950
[LightGBM] [Info] Number of data points in the train set: 4718, number of used features: 42
[LightGBM] [Info] Start training from score 940764.518864


Unnamed: 0,Description,Value
0,Session id,42
1,Target,Price
2,Target type,Regression
3,Original data shape,"(6019, 15)"
4,Transformed data shape,"(5922, 9)"
5,Transformed train set shape,"(4718, 9)"
6,Transformed test set shape,"(1204, 9)"
7,Numeric features,8
8,Categorical features,6
9,Preprocess,True


- Uses robust scaling + quantile transforms on features (good for skewed Price drivers).
- IForest outlier removal at 2% to trim only the worst tails.
- Multicollinearity cut at 0.90 (safer than 0.95 for Engine/Power/Log_Km).
- Model-based feature selection with LightGBM; caps features at 60%.
- Binning Age and Kilometers_Driven adds stepwise effects that tree models exploit.
- Rare category merging prevents exploding dummies for huge Brand_Model/Location.

## Compare Models

In [19]:
top_models = compare_models(n_select=5, sort="R2")
leaderboard_compare = pull()
leaderboard_compare

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
catboost,CatBoost Regressor,153922.0448,129602150192.8826,343869.076,0.9022,0.2542,0.1967,0.758
xgboost,Extreme Gradient Boosting,156893.4031,140751117516.8,360050.4406,0.8932,0.2608,0.1999,0.249
lightgbm,Light Gradient Boosting Machine,162653.6012,139448973222.7112,360547.7546,0.8932,0.272,0.2034,0.271
et,Extra Trees Regressor,161997.187,144667446008.3153,364541.1528,0.8905,0.2555,0.2042,0.275
rf,Random Forest Regressor,160659.3745,144907284522.9275,366650.5263,0.8896,0.2532,0.2004,0.318
gbr,Gradient Boosting Regressor,185636.2605,151380688186.2406,379467.6363,0.8826,0.2894,0.2391,0.252
knn,K Neighbors Regressor,204706.1422,211435830476.8,449364.3344,0.8371,0.2962,0.2442,0.205
dt,Decision Tree Regressor,202847.2071,252239485625.6871,492084.891,0.7991,0.3223,0.2526,0.197
ada,AdaBoost Regressor,546613.0085,461077176723.5908,669890.9037,0.6313,0.8779,1.3784,0.234
ridge,Ridge Regression,428383.2267,487254263478.5092,692851.4226,0.6151,0.8139,0.8209,0.199


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
catboost,CatBoost Regressor,153922.0448,129602200000.0,343869.1,0.9022,0.2542,0.1967,0.758
xgboost,Extreme Gradient Boosting,156893.4031,140751100000.0,360050.4,0.8932,0.2608,0.1999,0.249
lightgbm,Light Gradient Boosting Machine,162653.6012,139449000000.0,360547.8,0.8932,0.272,0.2034,0.271
et,Extra Trees Regressor,161997.187,144667400000.0,364541.2,0.8905,0.2555,0.2042,0.275
rf,Random Forest Regressor,160659.3745,144907300000.0,366650.5,0.8896,0.2532,0.2004,0.318
gbr,Gradient Boosting Regressor,185636.2605,151380700000.0,379467.6,0.8826,0.2894,0.2391,0.252
knn,K Neighbors Regressor,204706.1422,211435800000.0,449364.3,0.8371,0.2962,0.2442,0.205
dt,Decision Tree Regressor,202847.2071,252239500000.0,492084.9,0.7991,0.3223,0.2526,0.197
ada,AdaBoost Regressor,546613.0085,461077200000.0,669890.9,0.6313,0.8779,1.3784,0.234
ridge,Ridge Regression,428383.2267,487254300000.0,692851.4,0.6151,0.8139,0.8209,0.199


### Model Comparison Insights

From the model leaderboard, we can draw several key observations:

1. **Top Performing Models**  
   - **CatBoost Regressor** achieved the **best performance** with  
     - **R² = 0.90** (explains ~90% of the variance in car prices)  
     - **MAE ≈ 154k INR**, meaning on average the model’s predictions are within ±1.5 lakhs of the actual selling price.  
   - **XGBoost** and **LightGBM** follow closely with R² around **0.89**, also providing strong generalization.  
   - These gradient-boosting tree ensembles are well-suited for tabular data with mixed numeric and categorical features.

2. **Strong Baseline Models**  
   - **Extra Trees** and **Random Forest** also perform competitively (R² ≈ 0.89), showing that tree ensembles in general capture the key price-driving patterns.  
   - **Gradient Boosting Regressor (sklearn)** lags slightly (R² ≈ 0.88), but still provides a good baseline.

3. **Weaker / Linear Models**  
   - Linear-based models (Ridge, Lasso, Elastic Net, Bayesian Ridge, Linear Regression) plateau around **R² = 0.61**, confirming that **used car pricing is highly non-linear** and cannot be captured well with simple linear assumptions.  
   - These models also show much higher MAE (> 400k INR), making them less reliable.

4. **Poor Performers**  
   - **AdaBoost, Huber, Passive Aggressive, OMP** and the **Dummy baseline** all perform significantly worse, with R² < 0.65. These models are unsuitable for the task.

5. **Interpretation**  
   - The best models (CatBoost, XGBoost, LightGBM) are tree-based ensemble learners that:  
     - Handle complex interactions between features (e.g., brand × engine power × location).  
     - Are robust to outliers and non-linear relationships.  
     - Natively support categorical variables (especially CatBoost).  

---
**Conclusion**: CatBoost is the current champion and will likely be selected as the final model. However, XGBoost and LightGBM remain strong contenders and should be considered for blending or stacking in the final pipeline.


## Tuning the top models

In [20]:
tuned_models = [tune_model(m, optimize="R2") for m in top_models]
leaderboard_tune = pull()
leaderboard_tune

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,187981.347,122386002152.6833,349837.108,0.8938,0.3108,0.2552
1,198324.0789,235505352852.6586,485288.9375,0.8401,0.2869,0.2238
2,194882.3836,109834600725.6721,331413.0364,0.9208,0.303,0.2428
3,199803.6119,138585574456.7012,372270.8348,0.8448,0.415,0.2679
4,179223.6075,98024871439.4442,313089.2388,0.9288,0.2889,0.2349
5,186429.5551,110697849556.3288,332712.8635,0.8967,0.3363,0.2621
6,198553.0226,323072675425.52,568394.8235,0.8063,0.3361,0.2668
7,207833.2433,185195512352.9718,430343.4818,0.8546,0.3811,0.2616
8,186639.8431,138770995308.8971,372519.7918,0.8574,0.3273,0.2594
9,197565.4206,104707726447.1791,323585.7328,0.9179,0.3654,0.2676


Fitting 10 folds for each of 10 candidates, totalling 100 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,156920.5312,87108452352.0,295141.4062,0.9244,0.2569,0.2052
1,172217.5312,254284627968.0,504266.4375,0.8273,0.2334,0.179
2,185673.4062,119069851648.0,345065.0,0.9142,0.2783,0.2287
3,157690.25,91052736512.0,301749.4688,0.898,0.2626,0.2136
4,152331.4844,95415402496.0,308893.8438,0.9307,0.247,0.1966
5,154037.3281,92287188992.0,303788.0625,0.9139,0.2564,0.1946
6,173632.9531,329153085440.0,573718.625,0.8027,0.2673,0.2116
7,175320.3125,117970796544.0,343468.7812,0.9074,0.2554,0.2099
8,149012.7344,71807451136.0,267969.125,0.9262,0.2667,0.2178
9,165293.2344,92996362240.0,304953.0625,0.927,0.306,0.2346


Fitting 10 folds for each of 10 candidates, totalling 100 fits


Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,147505.8449,85654437050.5715,292667.793,0.9257,0.2435,0.1898
1,178318.7008,276986122644.221,526294.7108,0.8119,0.2437,0.1748
2,183692.0797,124032074213.0112,352181.8766,0.9106,0.26,0.2142
3,161070.93,114303394646.4894,338087.8505,0.872,0.271,0.2103
4,145419.2683,78801334075.3254,280715.7532,0.9428,0.2362,0.1877
5,162636.7859,94669468325.292,307684.0398,0.9116,0.2633,0.1991
6,175697.4059,329089234618.6413,573662.9974,0.8027,0.2809,0.2348
7,168798.1369,128676203968.6,358714.6554,0.899,0.2737,0.1948
8,144655.9689,67911899144.859,260599.1158,0.9302,0.2658,0.2098
9,160501.6917,94526815481.4672,307452.1353,0.9258,0.3346,0.2163


Fitting 10 folds for each of 10 candidates, totalling 100 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,263025.2851,249310717042.2098,499310.2413,0.7836,0.4026,0.3812
1,242869.7996,362962061638.0772,602463.328,0.7536,0.3459,0.302
2,275898.4085,283166825158.728,532134.2172,0.7959,0.3856,0.3616
3,207886.3924,117230906135.0284,342389.9913,0.8687,0.3434,0.3125
4,252479.2425,274001431240.4878,523451.4603,0.8011,0.3627,0.3297
5,243365.2662,214939963310.7125,463616.181,0.7994,0.39,0.3461
6,259291.0947,468803618156.5866,684692.3529,0.7189,0.3806,0.3578
7,282378.1879,248716195528.2556,498714.5431,0.8048,0.4087,0.3911
8,220510.4652,126130008587.3736,355147.8686,0.8704,0.3785,0.3595
9,264344.3721,226935017991.2459,476376.9705,0.822,0.4179,0.3931


Fitting 10 folds for each of 10 candidates, totalling 100 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,156700.2396,102778491050.3046,320590.8468,0.9108,0.2596,0.2066
1,175837.4062,297075947195.7447,545046.7385,0.7983,0.237,0.1789
2,195648.4271,178460089721.8668,422445.3689,0.8714,0.2585,0.2134
3,160295.1841,95037306147.7933,308281.2128,0.8936,0.2555,0.2076
4,152476.1454,91399549687.0635,302323.5844,0.9336,0.243,0.1971
5,161236.2069,105482454479.3359,324780.6252,0.9016,0.2614,0.2006
6,179271.0559,350434641162.3024,591975.2032,0.7899,0.2744,0.2228
7,176696.5709,133044133555.9455,364752.1536,0.8956,0.2704,0.2203
8,143317.7217,62770724396.9971,250540.8637,0.9355,0.2616,0.2108
9,176224.8215,106350454690.2349,326114.1743,0.9166,0.284,0.2358


Fitting 10 folds for each of 10 candidates, totalling 100 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,156700.2396,102778500000.0,320590.8468,0.9108,0.2596,0.2066
1,175837.4062,297075900000.0,545046.7385,0.7983,0.237,0.1789
2,195648.4271,178460100000.0,422445.3689,0.8714,0.2585,0.2134
3,160295.1841,95037310000.0,308281.2128,0.8936,0.2555,0.2076
4,152476.1454,91399550000.0,302323.5844,0.9336,0.243,0.1971
5,161236.2069,105482500000.0,324780.6252,0.9016,0.2614,0.2006
6,179271.0559,350434600000.0,591975.2032,0.7899,0.2744,0.2228
7,176696.5709,133044100000.0,364752.1536,0.8956,0.2704,0.2203
8,143317.7217,62770720000.0,250540.8637,0.9355,0.2616,0.2108
9,176224.8215,106350500000.0,326114.1743,0.9166,0.284,0.2358


## Blend & Stack Models

In [21]:
blended = blend_models(estimator_list=tuned_models, optimize="R2")
meta = create_model("lightgbm")
stacked = stack_models(estimator_list=tuned_models, meta_model=meta, optimize="R2")


Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,143016.1947,80836340715.4829,284317.324,0.9298,0.235,0.1848
1,163090.5989,259478792406.7036,509390.6089,0.8238,0.215,0.1611
2,168851.1498,104104832137.2697,322652.8043,0.925,0.2387,0.1949
3,146929.8761,82958579291.8495,288025.3102,0.9071,0.2343,0.1872
4,139884.4058,71590856061.0532,267564.6764,0.948,0.2202,0.1775
5,149582.8011,92861808725.2686,304732.3559,0.9133,0.2409,0.1804
6,164254.9594,328947176493.2176,573539.1674,0.8028,0.2733,0.2357
7,164123.7131,125776477521.0735,354649.7956,0.9013,0.2484,0.1969
8,132188.293,57255734031.9766,239281.7043,0.9412,0.239,0.1898
9,151040.1387,76898588189.9876,277305.9469,0.9397,0.255,0.2039


Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,154179.4086,92176079923.718,303605.1382,0.92,0.2539,0.2011
1,174550.8938,272861620694.91,522361.5804,0.8147,0.231,0.1743
2,181546.8902,126764781718.9275,356040.4215,0.9086,0.2614,0.2111
3,157835.5283,99848871189.3838,315988.72,0.8882,0.3387,0.2039
4,146246.9552,74197804528.94,272392.7395,0.9461,0.2422,0.194
5,159033.4351,104229357169.0507,322845.7173,0.9027,0.2677,0.1905
6,175368.1609,321740144696.6518,567221.4248,0.8071,0.2706,0.2202
7,169099.3824,135569933410.5112,368198.2257,0.8936,0.2625,0.2088
8,139512.0688,72467011185.2536,269196.9747,0.9256,0.2557,0.2028
9,169163.2892,94634127709.766,307626.6044,0.9258,0.3359,0.2275


Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,147978.4257,84421497362.5574,290553.7771,0.9267,0.2332,0.1853
1,173972.0395,256503128357.5767,506461.3789,0.8258,0.2181,0.1636
2,178836.9719,122905694844.3458,350579.085,0.9114,0.2479,0.1975
3,160638.6966,111404773527.5676,333773.5363,0.8752,0.2366,0.188
4,143640.8523,80075506117.43,282976.1582,0.9419,0.2192,0.1773
5,156939.9362,103275120012.0924,321364.466,0.9036,0.245,0.182
6,167754.135,314650704484.7026,560937.3445,0.8113,0.2778,0.246
7,170510.432,136195015683.0136,369046.0888,0.8931,0.2533,0.1985
8,144799.6246,92452185264.1786,304059.5094,0.905,0.24,0.1909
9,159384.3266,78480730247.8323,280144.1241,0.9384,0.2604,0.21


## Hold-out Evaluation Table

In [22]:
candidates = tuned_models + [blended, stacked]
rows = []
for m in candidates:
    _ = predict_model(m)
    metrics = pull().iloc[-1].to_dict()
    rows.append({"model": str(m), **metrics})

holdout_df = pd.DataFrame(rows).sort_values("R2", ascending=False)
holdout_df


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,CatBoost Regressor,162293.9037,166033167385.4427,407471.6768,0.8651,0.2844,0.2371


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Extreme Gradient Boosting,172713.7188,172596035584.0,415446.7812,0.8597,0.2908,0.2504


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Light Gradient Boosting Machine,169493.2104,158944869512.2779,398678.9053,0.8708,0.2903,0.2453


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Extra Trees Regressor,173159.5543,211037540771.4932,459388.2245,0.8285,0.2834,0.2412


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Random Forest Regressor,171501.1951,203185137171.5814,450760.6207,0.8349,0.2869,0.2453


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Voting Regressor,160698.8858,168096878707.3926,409996.1935,0.8634,0.2701,0.2323


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Stacking Regressor,164227.6258,136734854717.4399,369776.7634,0.8889,0.2686,0.2215


Unnamed: 0,model,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
6,"StackingRegressor(cv=5,\n est...",Stacking Regressor,164227.6258,136734900000.0,369776.7634,0.8889,0.2686,0.2215
2,"LGBMRegressor(n_jobs=-1, random_state=42)",Light Gradient Boosting Machine,169493.2104,158944900000.0,398678.9053,0.8708,0.2903,0.2453
0,<catboost.core.CatBoostRegressor object at 0x0...,CatBoost Regressor,162293.9037,166033200000.0,407471.6768,0.8651,0.2844,0.2371
5,VotingRegressor(estimators=[('CatBoost Regress...,Voting Regressor,160698.8858,168096900000.0,409996.1935,0.8634,0.2701,0.2323
1,"XGBRegressor(base_score=None, booster='gbtree'...",Extreme Gradient Boosting,172713.71875,172596000000.0,415446.78125,0.8597,0.2908,0.2504
4,"RandomForestRegressor(n_jobs=-1, random_state=42)",Random Forest Regressor,171501.1951,203185100000.0,450760.6207,0.8349,0.2869,0.2453
3,"ExtraTreesRegressor(n_jobs=-1, random_state=42)",Extra Trees Regressor,173159.5543,211037500000.0,459388.2245,0.8285,0.2834,0.2412


## Finalize, Save, and Log the Champion Model
- Champion: **Stacking Regressor** (highest R² on hold-out).
- Action:
  1) `finalize_model()` to fit on all available data inside the PyCaret session.
  2) Save the full pipeline to `../models/usedcar_price_model.pkl`.
  3) Log artifacts and metrics to MLflow (same experiment).


In [29]:
# --- pick the tuned CatBoost from your tuned_models list ---
cat_list = [m for m in tuned_models if "CatBoost" in str(m)]
if cat_list:
    best_cat = cat_list[0]
else:
    best_cat = tune_model(create_model("catboost"), optimize="R2")

# Finalize on full data (train on all rows within the PyCaret session)
final_champion = finalize_model(best_cat)

# ---- Pull metrics WITHOUT calling predict_model (avoid CatBoost name-order issue) ----
# Prefer metrics from your holdout table
import numpy as np, json, mlflow
from pathlib import Path

if 'holdout_df' in globals():
    row = (holdout_df[holdout_df['Model'].str.contains('CatBoost', na=False)]
           .head(1))
    if not row.empty:
        best_metrics = row.iloc[0][['MAE','MSE','RMSE','R2','RMSLE','MAPE']].to_dict()
    else:
        best_metrics = {}   # fallback: no metrics available
else:
    best_metrics = {}

# ---- Save model pipeline ----
models_dir = Path("../models"); models_dir.mkdir(parents=True, exist_ok=True)
save_path = models_dir / "usedcar_price_model"
save_model(final_champion, str(save_path))   # -> ../models/usedcar_price_model.pkl

# ---- Log to MLflow ----
with mlflow.start_run(run_name="champion_finalize_catboost"):
    # Log metrics if we have them
    for k, v in best_metrics.items():
        try:
            if isinstance(v, (int, float, np.floating)):
                mlflow.log_metric(k, float(v))
        except Exception:
            pass

    # Optional leaderboards if present
    for p in [models_dir / "compare_leaderboard.csv",
              models_dir / "tune_leaderboard.csv",
              models_dir / "holdout_results.csv"]:
        try:
            mlflow.log_artifact(str(p), artifact_path="tables")
        except Exception:
            pass

    # Log the model + input schema
    mlflow.log_artifact(f"{save_path}.pkl", artifact_path="model")
    schema = {
        "numeric_features": ["Year","Age","Kilometers_Driven","Log_Km","Mileage","Engine","Power","Seats"],
        "categorical_features": ["Fuel_Type","Transmission","Owner_Type","Location","Brand_Model","Brand"],
        "target": "Price"
    }
    schema_path = models_dir / "input_schema.json"
    with open(schema_path, "w") as f:
        json.dump(schema, f, indent=2)
    mlflow.log_artifact(str(schema_path), artifact_path="model")

print("Champion (CatBoost) saved at:", f"{save_path}.pkl")
print("Metrics (from holdout) logged to MLflow under run: champion_finalize_catboost")
print("View with:")
print("   mlflow ui --backend-store-uri .\\mlruns --host 127.0.0.1 --port 5001")
print("   then open http://127.0.0.1:5001")


Transformation Pipeline and Model Successfully Saved
Champion (CatBoost) saved at: ..\models\usedcar_price_model.pkl
Metrics (from holdout) logged to MLflow under run: champion_finalize_catboost
View with:
   mlflow ui --backend-store-uri .\mlruns --host 127.0.0.1 --port 5001
   then open http://127.0.0.1:5001
