# Introduction

**During the course of this project we will attempt to train multiples model with the focus on the quality of the prediction, the speed of the prediction and the time required for the training. Output data generate by model will be analyzes and conclusion will be drawn.**

<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2>   Reviewer's comment </h2>
    
Good introduction that reflects the core goals. Well done!  
    
</div>

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from catboost import CatBoostClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
import lightgbm as lgb
from catboost import CatBoostRegressor, Pool
import matplotlib.pyplot as plt
import warnings
import time

## Data preparation

In [2]:
df = pd.read_csv('/datasets/car_data.csv')

In [3]:
display(df.columns)

Index(['DateCrawled', 'Price', 'VehicleType', 'RegistrationYear', 'Gearbox',
       'Power', 'Model', 'Mileage', 'RegistrationMonth', 'FuelType', 'Brand',
       'NotRepaired', 'DateCreated', 'NumberOfPictures', 'PostalCode',
       'LastSeen'],
      dtype='object')

In [4]:
df = df.rename(columns={'DateCrawled': 'date_crawled', 'Price': 'price', 'VehicleType': 'vehicle_type', 'RegistrationYear': 'registration_year', 
                        'Gearbox': 'gearbox', 'Power': 'power', 'Model': 'model', 'Mileage': 'mileage', 'RegistrationMonth': 'registration_month',
                       'FuelType': 'fuel_type', 'Brand': 'brand', 'NotRepaired': 'not_repaired', 'DateCreated': 'date_created',
                       'NumberOfPictures': 'number_of_pictures', 'PostalCode': 'postal_code', 'LastSeen': 'last_seen'})

In [5]:
df.describe(include='all').round(2)

Unnamed: 0,date_crawled,price,vehicle_type,registration_year,gearbox,power,model,mileage,registration_month,fuel_type,brand,not_repaired,date_created,number_of_pictures,postal_code,last_seen
count,354369,354369.0,316879,354369.0,334536,354369.0,334664,354369.0,354369.0,321474,354369,283215,354369,354369.0,354369.0,354369
unique,15470,,8,,2,,250,,,7,40,2,109,,,18592
top,05/03/2016 14:25,,sedan,,manual,,golf,,,petrol,volkswagen,no,03/04/2016 00:00,,,07/04/2016 07:16
freq,66,,91457,,268251,,29232,,,216352,77013,247161,13719,,,654
mean,,4416.66,,2004.23,,110.09,,128211.17,5.71,,,,,0.0,50508.69,
std,,4514.16,,90.23,,189.85,,37905.34,3.73,,,,,0.0,25783.1,
min,,0.0,,1000.0,,0.0,,5000.0,0.0,,,,,0.0,1067.0,
25%,,1050.0,,1999.0,,69.0,,125000.0,3.0,,,,,0.0,30165.0,
50%,,2700.0,,2003.0,,105.0,,150000.0,6.0,,,,,0.0,49413.0,
75%,,6400.0,,2008.0,,143.0,,150000.0,9.0,,,,,0.0,71083.0,


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   date_crawled        354369 non-null  object
 1   price               354369 non-null  int64 
 2   vehicle_type        316879 non-null  object
 3   registration_year   354369 non-null  int64 
 4   gearbox             334536 non-null  object
 5   power               354369 non-null  int64 
 6   model               334664 non-null  object
 7   mileage             354369 non-null  int64 
 8   registration_month  354369 non-null  int64 
 9   fuel_type           321474 non-null  object
 10  brand               354369 non-null  object
 11  not_repaired        283215 non-null  object
 12  date_created        354369 non-null  object
 13  number_of_pictures  354369 non-null  int64 
 14  postal_code         354369 non-null  int64 
 15  last_seen           354369 non-null  object
dtypes:

In [7]:
df = df.drop_duplicates()

In [8]:
df.duplicated().sum()

0

In [9]:
df.isnull().sum().sum()

181049

<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2>   Reviewer's comment </h2>
    
The data was successfully read, well done! 
    
</div>

**Excessive variables to remove or drop.**

In [10]:
display(df['registration_year'].describe())

count    354107.000000
mean       2004.235355
std          90.261168
min        1000.000000
25%        1999.000000
50%        2003.000000
75%        2008.000000
max        9999.000000
Name: registration_year, dtype: float64

In [11]:
df = df[(df['registration_year'] >= 1900) & (df['registration_year'] <= 2016)]

In [12]:
#we use sine and cosine transformations to encode the cyclical pattern:
df['reg_month_sin'] = np.sin(2 * np.pi * df['registration_month'] / 12)
df['reg_month_cos'] = np.cos(2 * np.pi * df['registration_month'] / 12)

In [13]:
# 'postal_code' is super high cardinality(large number of unique values.)
# Convert postal_code to string and extract first 2 digits
df['postal_region'] = df['postal_code'].astype(str).str[:2]

In [14]:
#let check for any correlation between 'date_create' and 'price' before dropping it if necessary 
df['date_created'] = pd.to_datetime(df['date_created'], errors='coerce')

In [15]:
df['date_created_numeric'] = df['date_created'].astype('int64') // 1e9

In [16]:
correlation = df[['date_created_numeric', 'price']].corr().iloc[0, 1]
display(f"Correlation: {correlation:.4f}")

'Correlation: -0.0090'

In [17]:
# 'date_crawled'and'last_seen' can lead to huge feature sets(more memory use, slower training, overfitting).
# 'number_of_pictures' all values are 0 as shown by describe()and provides no useful information for prediction
# 'registration_month'cyclical nature makes it hard for models to interpret
#no correlation between 'price' and 'date_create'
df.drop(columns=['date_crawled', 'last_seen', 'number_of_pictures', 'registration_month', 
                 'postal_code', 'date_created'], inplace=True)

<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2>   Reviewer's comment </h2>
    
Agreed! We don't need these columns. 
    
</div>


In [18]:
df = df.drop_duplicates()

In [19]:
df.duplicated().sum()

0

<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2>   Reviewer's comment 2 </h2>
    
Well done! 
    
</div>

In [20]:
display(df['price'].describe())
display(df['price'].value_counts().sort_index())

count    329317.000000
mean       4449.925516
std        4539.379201
min           0.000000
25%        1050.000000
50%        2750.000000
75%        6499.000000
max       20000.000000
Name: price, dtype: float64

0        9765
1        1096
2          11
3           6
4           1
         ... 
19995       9
19997       1
19998       5
19999     264
20000     245
Name: price, Length: 3694, dtype: int64

In [21]:
# let apply filter
df = df[(df['price'] >= 100) & (df['price'] <= 20000)]

In [22]:
# Count the number of rows after filtering
filtered_count = df[(df['price'] <= 100) & (df['price'] >= 20000)].shape[0]
display(f"Number of rows with price is less 100 and greater 20,000: {filtered_count}")

'Number of rows with price is less 100 and greater 20,000: 0'

In [23]:
display(df['power'].describe())
display(df['power'].value_counts().sort_index())

count    317147.000000
mean        112.152557
std         186.962846
min           0.000000
25%          71.000000
50%         105.000000
75%         143.000000
max       20000.000000
Name: power, dtype: float64

0        31027
1           23
2            9
3            8
4           30
         ...  
17932        1
19208        1
19211        1
19312        1
20000        1
Name: power, Length: 683, dtype: int64

In [24]:
# let apply filter
df = df[(df['power'] >= 50) & (df['power'] <= 20000)]

In [25]:
# Count the number of rows after filtering
filtered_count = df[(df['power'] <= 50) & (df['power'] >= 20000)].shape[0]
display(f"Number of rows with power is less 50 and greater 20,000: {filtered_count}")

'Number of rows with power is less 50 and greater 20,000: 0'

In [26]:
display(df['mileage'].describe())

count    280842.000000
mean     128552.584727
std       36525.298570
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: mileage, dtype: float64

In [27]:
df = df[df['mileage'] >= 5000]

In [28]:
display(df['brand'].describe())

count         280842
unique            40
top       volkswagen
freq           60555
Name: brand, dtype: object

In [29]:
display(df['brand'].head(10))

1           audi
2           jeep
3     volkswagen
4          skoda
5            bmw
6        peugeot
8           ford
9     volkswagen
10         mazda
11    volkswagen
Name: brand, dtype: object

In [30]:
df.shape

(280842, 14)

In [31]:
display(df.isnull().sum()) 

price                       0
vehicle_type            10365
registration_year           0
gearbox                  4833
power                       0
model                   10400
mileage                     0
fuel_type               14624
brand                       0
not_repaired            40622
reg_month_sin               0
reg_month_cos               0
postal_region               0
date_created_numeric        0
dtype: int64

<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2>   Reviewer's comment 2 </h2>
    
Great! 
    
</div>


In [32]:
# Let fill with explicit 'unknown' category
for column in ['vehicle_type', 'gearbox', 'model', 'fuel_type','not_repaired']:
    df[column] = df[column].fillna('unknown')

<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2>   Reviewer's comment  </h2>
    
Yes, it's better than deleting them. Moreover, it is normal that sometimes sellers do not specify some information. The model should "know" about such cases. We even should not use median or mode. Even though the median does not skew the distribution, we have to many missing values to fill in because there is a risk of biasing the data.   
    
</div>

In [34]:
display(df.isnull().sum())

price                   0
vehicle_type            0
registration_year       0
gearbox                 0
power                   0
model                   0
mileage                 0
fuel_type               0
brand                   0
not_repaired            0
reg_month_sin           0
reg_month_cos           0
postal_region           0
date_created_numeric    0
dtype: int64

<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2> Reviewer's comment  </h2>
    
Good job here!     
    
</div>

## Model training

**ClassSGB_LightGBM that incorporates LightGBM, compares it with Random Forest, and includes a linear regression sanity check.**

In [35]:
warnings.filterwarnings("ignore", category=UserWarning)

In [36]:
class classSGB_LightGBM:
    def __init__(self, epochs=50, categorical_columns=None):
        self.epochs = epochs
        self.categorical_columns = categorical_columns
        self.label_encoders = {}
        self.models = {}
        self.best_model = None
        self.best_rmse = float('inf')
        self.performance_summary = {}
        self.training_times = {}
        self.prediction_times = {}
        self.features_val = None #Holds validation and test data for futur evaluation.
        self.target_val = None
        self.features_test = None
        self.target_test = None

    def _preprocess_data(self, df, target_column='price'):
        df = df.copy()#Avoids modifying the original df.
        if not self.categorical_columns:#will automatically detects categorical column
            self.categorical_columns = df.select_dtypes(include='object').columns.tolist()

        for column in self.categorical_columns:
            lab_enc = LabelEncoder()
            df[column] = lab_enc.fit_transform(df[column].astype(str)).astype(int)
            self.label_encoders[column] = lab_enc

        return df.drop(columns=[target_column]), df[target_column]

    def _compute_rmse(self, true_target, pred_target):
        return np.sqrt(mean_squared_error(true_target, pred_target))
    
    
    # Model Tracking 
    def _update_best_model(self, model_name, model, rmse, training_time=None, prediction_time=None):
        #stores a trained model with its name and performance metrics
        self.models[model_name] = model
        self.performance_summary[model_name] = rmse
        # Updates training and prediction times when provided
        if training_time:
            self.training_times[model_name] = training_time
        if prediction_time:
            self.prediction_times[model_name] = prediction_time
        #Keeps track of the best model based on rmse    
        if rmse < self.best_rmse:
            self.best_rmse = rmse
            self.best_model = model
     # LightGBM training method
    def _train_lightgbm(self, features_train, target_train, features_val, target_val, categorical_features):
        #Trains multiple LightGBM models using (2 sets) parameter tuning.
        display("Training LightGBM with tuning")
        param_grid = [
            {
                'boosting_type': 'gbdt',
                'objective': 'regression',
                'metric': 'rmse',
                'num_leaves': 30,
                'learning_rate': 0.05,
                'feature_fraction': 0.9,
                'bagging_fraction': 0.8,
                'bagging_freq': 5,
                'verbose': -1
            },
            {
                'boosting_type': 'gbdt',
                'objective': 'regression',
                'metric': 'rmse',
                'num_leaves': 60,
                'learning_rate': 0.1,
                'feature_fraction': 0.8,
                'bagging_fraction': 0.7,
                'bagging_freq': 3,
                'verbose': -1
            }
        ]
        
        # we trains a LightGBM model for each parameter set
        for i, params in enumerate(param_grid):
            train_data = lgb.Dataset(features_train, label=target_train,
                                   categorical_feature=categorical_features, free_raw_data=False)
            val_data = lgb.Dataset(features_val, label=target_val,
                                 categorical_feature=categorical_features, free_raw_data=False)

            start_time = time.time()
            model = lgb.train(params, train_data, num_boost_round=self.epochs,
                            valid_sets=[val_data], valid_names=['valid'],
                            callbacks=[lgb.early_stopping(stopping_rounds=10, verbose=False)])
            training_time = time.time() - start_time
            
            start_pred_time = time.time()
            preds = model.predict(features_val, num_iteration=model.best_iteration)
            prediction_time = time.time() - start_pred_time
            
            rmse = self._compute_rmse(target_val, preds)
            display(f"LightGBM Set {i+1} RMSE: {rmse:.4f} | Training time: {training_time:.2f}s | Prediction time: {prediction_time:.2f}s")
            self._update_best_model(f"LightGBM (Set {i+1})", model, rmse, training_time, prediction_time)

    def _train_random_forest(self, features_train, target_train, features_val, target_val):
        display("Training Random Forest with hyperparameter tuning")
        rf = RandomForestRegressor(
            n_estimators=100,
            max_depth=15,
            min_samples_split=5,
            min_samples_leaf=2,
            random_state=12345,
            n_jobs=-1
        )
        
        start_time = time.time()
        rf.fit(features_train, target_train)
        training_time = time.time() - start_time
        
        start_pred_time = time.time()
        preds = rf.predict(features_val)
        prediction_time = time.time() - start_pred_time
        
        rmse = self._compute_rmse(target_val, preds)
        display(f"Random Forest RMSE: {rmse:.4f} | Training time: {training_time:.2f}s | Prediction time: {prediction_time:.2f}s")
        self._update_best_model("Random Forest", rf, rmse, training_time, prediction_time)

    def fit(self, df, target_column='price'):
        features, target = self._preprocess_data(df, target_column)

        # Split into train and temp (validation + test)
        features_train, features_temp, target_train, target_temp = train_test_split(
            features, target, test_size=0.4, random_state=12345)

        # Split temp into validation and test
        features_val, features_test, target_val, target_test = train_test_split(
            features_temp, target_temp, test_size=0.5, random_state=12345)

        self.features_val = features_val
        self.target_val = target_val
        self.features_test = features_test
        self.target_test = target_test

        categorical_features = [
            features.columns.get_loc(col)
            for col in self.categorical_columns if col in features.columns
        ]

        # Train models on training set and evaluate on validation set
        self._train_lightgbm(features_train, target_train, features_val, target_val, categorical_features)
        self._train_random_forest(features_train, target_train, features_val, target_val)

        # Linear regression as baseline
        display("Training Linear Regression")
        lr = LinearRegression()
        
        start_train_time = time.time()
        lr.fit(features_train, target_train)
        training_time = time.time() - start_train_time
        
        start_pred_time = time.time()
        pred = lr.predict(features_val)
        prediction_time = time.time() - start_pred_time
        
        rmse = self._compute_rmse(target_val, pred)
        display(f"Linear Regression RMSE: {rmse:.4f} | Training time: {training_time:.2f}s | Prediction time: {prediction_time:.2f}s")
        self._update_best_model("Linear Regression", lr, rmse, training_time, prediction_time)

        # Display performance summary
        display("Model Performance Summary:")
        perf_table = pd.DataFrame({
            'RMSE': self.performance_summary,
            'Training Time (s)': self.training_times,
            'Prediction Time (s)': self.prediction_times
        }).sort_values(by='RMSE')
        display(perf_table)

        return self
        
    def evaluate_on_test(self):
        display("Final Evaluation on Test Set (only done once at the end)")
        start_time = time.time()
        if hasattr(self.best_model, 'best_iteration'):
            test_preds = self.best_model.predict(self.features_test, num_iteration=self.best_model.best_iteration)
        else:
            test_preds = self.best_model.predict(self.features_test)
        prediction_time = time.time() - start_time
        
        test_rmse = self._compute_rmse(self.target_test, test_preds)
        #display(f"Best Model: {self.best_model}")
        display(f"Best Model: {self.best_model.__class__.__name__}")
        display(f"Test RMSE: {test_rmse:.4f} | Prediction time: {prediction_time:.2f}s")
        
        return test_rmse
    
    
   

    def predict(self, features):
        if self.best_model is None:
            raise ValueError("Model not trained yet")

        features = features.copy()
        for column in self.categorical_columns:
            if column in features.columns:
                lab_enc = self.label_encoders[column]
                features[column] = features[column].apply(
                    lambda x: lab_enc.transform([str(x)])[0] if x in lab_enc.classes_ else -1
                ).astype(int)
        
        start_time = time.time()
        predictions = self.best_model.predict(features)
        prediction_time = time.time() - start_time
        display(f"Prediction completed in {prediction_time:.2f} seconds")
        
        return predictions
    
model = classSGB_LightGBM(
    epochs=50,
    categorical_columns=['vehicle_type', 'gearbox', 'model', 'fuel_type', 'not_repaired', 'brand', 'postal_region']
)
model.fit(df)# here we held out the test set
predictions = model.predict(model.features_test)#here we use the test set once
test_rmse = model.evaluate_on_test()

'Training LightGBM with tuning'

'LightGBM Set 1 RMSE: 1836.6051 | Training time: 2.67s | Prediction time: 0.20s'

'LightGBM Set 2 RMSE: 1632.6453 | Training time: 3.09s | Prediction time: 0.22s'

'Training Random Forest with hyperparameter tuning'

'Random Forest RMSE: 1656.7389 | Training time: 43.69s | Prediction time: 0.70s'

'Training Linear Regression'

'Linear Regression RMSE: 3368.2956 | Training time: 0.03s | Prediction time: 0.06s'

'Model Performance Summary:'

Unnamed: 0,RMSE,Training Time (s),Prediction Time (s)
LightGBM (Set 2),1632.645348,3.093858,0.21575
Random Forest,1656.738858,43.691014,0.698676
LightGBM (Set 1),1836.605096,2.666669,0.196634
Linear Regression,3368.295551,0.029903,0.058012


'Prediction completed in 0.08 seconds'

'Final Evaluation on Test Set (only done once at the end)'

'Best Model: Booster'

'Test RMSE: 1636.8813 | Prediction time: 0.28s'

<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2>   Reviewer's comment 3 </h2>
    
Great!     


## Model analysis

**Performance Evaluation**

**We measure model error using Root Mean Squared Error (RMSE) of price.This is done for the following
models:Base LightGBM ,Tuned LightGBM models ,Random Forest and Linear Regression. LightGBM(set 2) on test set with an RMSE of 1636 and at 1632 on validation set,the lowest among all.**

**Comparison Across Models.**

**Here we store the RMSE of each model in self.performance_summary and display it in a form of a table. Random Forest was train
next but still about few points worse than the best LightGBM model.Linear Regression had the worst performance.**

**Best Model Selection.**

**The model with the lowest RMSE is automatically marked as self.best_model. LightGBM with an RMSE of 1632 the lowest among all**

**Sanity Check.**

**Here do a quick comparison between the simplest and best models and this helps show how much improvement was gained from using more advanced models. here Linear Regression acts as a good baseline and sanity check with the performance worse than all ensemble methods**


**Class CatBoostModel that incorporates CatBoost model, compares it with Random Forest, and includes a linear regression for sanity check.**

In [37]:
class CatBoostModel:
    def __init__(self, iterations=500, learning_rate=0.1, depth=6, categorical_columns=None):
        self.iterations = iterations
        self.learning_rate = learning_rate
        self.depth = depth
        self.categorical_columns = categorical_columns or []
        self.label_encoders = {}
        self.models = {}
        self.validation_rmse = {}
        self.training_times = {}
        self.prediction_times = {}
        self.best_model = None
        self.best_model_name = None
        self.features_val = None
        self.target_val = None
        self.features_test = None
        self.target_test = None

    def _encode_features(self, df):
        df = df.copy()
        for column in self.categorical_columns:
            if column in df.columns:
                le = LabelEncoder()
                df[column] = le.fit_transform(df[column].astype(str))
                self.label_encoders[column] = le
        return df

    def _compute_rmse(self, true_target, pred_target):
        return np.sqrt(mean_squared_error(true_target, pred_target))

    def _predict(self, model, model_name, features):
        features = features.copy()
        if model_name in ["Random Forest", "Linear Regression"]:
            for column in self.categorical_columns:
                if column in features.columns:
                    le = self.label_encoders[column]
                    features[column] = features[column].map(
                        lambda x: le.transform([str(x)])[0] if str(x) in le.classes_ else -1
                    )
        return model.predict(features)

    def fit(self, df, target_column='price'):
        df = df.copy()
        if not self.categorical_columns:
            self.categorical_columns = df.select_dtypes(include='object').columns.tolist()

        features = df.drop(columns=[target_column])
        target = df[target_column]

        
        features_train, features_temp, target_train, target_temp = train_test_split(
            features, target, test_size=0.4, random_state=12345)

        features_val, features_test, target_val, target_test = train_test_split(
            features_temp, target_temp, test_size=0.5, random_state=12345)

        self.features_val = features_val
        self.target_val = target_val
        self.features_test = features_test
        self.target_test = target_test

        # CatBoost 
        display("Training CatBoost Regressor")
        train_pool = Pool(features_train, label=target_train, cat_features=self.categorical_columns)
        val_pool = Pool(features_val, label=target_val, cat_features=self.categorical_columns)

        start_time = time.time()
        catboost_model = CatBoostRegressor(
            iterations=self.iterations,
            learning_rate=self.learning_rate,
            depth=self.depth,
            loss_function='RMSE',
            eval_metric='RMSE',
            verbose=0,
            early_stopping_rounds=20
        )
        catboost_model.fit(train_pool, eval_set=val_pool)
        training_time = time.time() - start_time

        start_pred_time = time.time()
        val_preds = catboost_model.predict(features_val)
        prediction_time = time.time() - start_pred_time

        val_rmse = self._compute_rmse(target_val, val_preds)

        self.models["CatBoost"] = catboost_model
        self.validation_rmse["CatBoost"] = val_rmse
        self.training_times["CatBoost"] = training_time
        self.prediction_times["CatBoost"] = prediction_time

        # Random Forest
        display("Training Random Forest")
        encoded_train = self._encode_features(features_train)
        encoded_val = self._encode_features(features_val)

        start_time = time.time()
        rf = RandomForestRegressor(
            n_estimators=100,
            max_depth=15,
            min_samples_split=5,
            min_samples_leaf=2,
            random_state=12345,
            n_jobs=-1
        )
        rf.fit(encoded_train, target_train)
        training_time = time.time() - start_time

        start_pred_time = time.time()
        val_preds = rf.predict(encoded_val)
        prediction_time = time.time() - start_pred_time

        val_rmse = self._compute_rmse(target_val, val_preds)

        self.models["Random Forest"] = rf
        self.validation_rmse["Random Forest"] = val_rmse
        self.training_times["Random Forest"] = training_time
        self.prediction_times["Random Forest"] = prediction_time

        # Linear Regression 
        display("Training Linear Regression")
        start_time = time.time()
        lr = LinearRegression()
        lr.fit(encoded_train, target_train)
        training_time = time.time() - start_time

        start_pred_time = time.time()
        val_preds = lr.predict(encoded_val)
        prediction_time = time.time() - start_pred_time

        val_rmse = self._compute_rmse(target_val, val_preds)

        self.models["Linear Regression"] = lr
        self.validation_rmse["Linear Regression"] = val_rmse
        self.training_times["Linear Regression"] = training_time
        self.prediction_times["Linear Regression"] = prediction_time

        # Best model based on validation RMSE
        self.best_model_name = min(self.validation_rmse, key=self.validation_rmse.get)
        self.best_model = self.models[self.best_model_name]

        # Table
        display("Model Validation Performance Summary:")
        perf_table = pd.DataFrame({
            'Validation RMSE': self.validation_rmse,
            'Training Time (s)': self.training_times,
            'Prediction Time (s)': self.prediction_times
        }).sort_values(by='Validation RMSE')
        display(perf_table)

        display(f" Best Model Selected (based on validation RMSE): {self.best_model_name}")
        return self

    def evaluate_on_test(self):
        display(f"Evaluating best model on test set: {self.best_model_name}")
        preds = self._predict(self.best_model, self.best_model_name, self.features_test)
        test_rmse = self._compute_rmse(self.target_test, preds)
        display(f"Test RMSE (Best Model: {self.best_model_name}): {test_rmse:.4f}")
        return test_rmse

    def predict(self, features):
        if self.best_model is None:
            raise ValueError("Model not trained yet.")
        start_time = time.time()
        predictions = self._predict(self.best_model, self.best_model_name, features)
        prediction_time = time.time() - start_time
        display(f"Prediction using {self.best_model_name} completed in {prediction_time:.2f} seconds")
        return predictions
    
model = CatBoostModel(
    iterations=300,
    learning_rate=0.05,
    depth=8,
    categorical_columns=['vehicle_type', 'gearbox', 'model', 'fuel_type', 'not_repaired', 'brand', 'postal_region']
)

model.fit(df) 
predictions = model.predict(model.features_test)
#predictions = model.predict(df)    
test_rmse = model.evaluate_on_test()   # Evaluates only best model on test set

'Training CatBoost Regressor'

'Training Random Forest'

'Training Linear Regression'

'Model Validation Performance Summary:'

Unnamed: 0,Validation RMSE,Training Time (s),Prediction Time (s)
CatBoost,1648.854673,81.755875,0.170393
Random Forest,1733.289898,43.146114,0.680131
Linear Regression,3367.032198,0.030923,0.022759


' Best Model Selected (based on validation RMSE): CatBoost'

'Prediction using CatBoost completed in 0.18 seconds'

'Evaluating best model on test set: CatBoost'

'Test RMSE (Best Model: CatBoost): 1644.5471'

<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2>   Reviewer's comment 2 </h2>
    
Very good!     
</div>

## Model analysis

**Performance Evaluation**

**We measure model error using Root Mean Squared Error (RMSE) of price.This is done for the following
models:Base CatBoost model, Random Forest and Linear Regression. CatBoost on test set with an RMSE of about 1644 and 1648 on validation set.**

**Comparison Across Models.**

**Here we store the RMSE of each model in self.performance_summary and display it in a form of a table. Random Forest was train
next but still about a few points worse than the best CatBoost model.Linear Regression had the worst performance.**

**Best Model Selection.**

**The model with the lowest RMSE is automatically marked as self.best_model. CatBoost with an RMSE of 1644 on test set is the lowest among all**

**Sanity Check.**

**Here do a quick comparison between the simplest and best models and this helps show how much improvement was gained from using more advanced models. here Linear Regression acts as a good baseline and sanity check with the performance worse than all ensemble methods**


## Conclusion

**With the support of the output generated by all the models train we observe that Gradient boosting models (LightGBM, CatBoost) outperform traditional models like Random Forest and Linear Regression when properly tuned.We can also say that LightGBM stands out as the best choice for this problem based on RMSE.It is important to also note that hyperparameter tuning is very crucial because poor tuning can degrade performance and successful tuning however can yields significant improvements(see the above performance table).As far as computational efficiency we see that LightGBM and CatBoostModel on prediction were slightly the different  (.18s vs. 28s for CatBoost).**

<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2> Reviewer's comment</h2>

    
Great conclusion! This is a solid final summary with comparison across models. 
    
    
</div>    
 


# Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [ ]  Code is error free
- [ ]  The cells with the code have been arranged in order of execution
- [ ]  The data has been downloaded and prepared
- [ ]  The models have been trained
- [ ]  The analysis of speed and quality of the models has been performed