# Numerical Methods Project <a id='intro'></a>

## Project Description

Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Project instructions
1) Download and look at the data.
2) Train different models with various hyperparameters. Compare gradient boosting methods with random forest, decision tree, and linear regression.
3) Analyze the speed and quality of the models.

## Data description
The dataset is stored in file `/datasets/car_data.csv`. download dataset.

### Features
- **DateCrawled** — date profile was downloaded from the database
- **VehicleType** — vehicle body type
- **RegistrationYear** — vehicle registration year
- **Gearbox** — gearbox type
- **Power** — power (hp)
- **Model** — vehicle model
- **Mileage** — mileage (measured in km due to dataset's regional specifics)
- **RegistrationMonth** — vehicle registration month
- **FuelType** — fuel type
- **Brand** — vehicle brand
- **NotRepaired** — vehicle repaired or not
- **DateCreated** — date of profile creation
- **NumberOfPictures** — number of vehicle pictures
- **PostalCode** — postal code of profile owner (user)
- **LastSeen** — date of the last activity of the user

###  Target
- **Price** — price (Euro)

## Initialization

In [1]:
import pandas as pd
import numpy as np
import time

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import mean_squared_error

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from xgboost import XGBRegressor


## Load Data

In [2]:
def load_data(file_name, local_path, server_path):
    try:
        data = pd.read_csv(local_path + file_name)
        print(f"'{file_name}' file successfully read from the local path.")

    except FileNotFoundError:
        try:
            data = pd.read_csv(server_path + file_name)
            print(f"'{file_name}' file successfully read from the server path.")

        except FileNotFoundError:
            print(f"'{file_name}' file not found. Please check the file paths.")
            data = None
            
    return data

file_name = 'car_data.csv'
local_path = '/Users/benjaminstephen/Documents/TripleTen/Sprint_12/Numerical_Methods_Project/datasets/'
server_path = '/datasets/'

df = load_data(file_name, local_path, server_path)

'car_data.csv' file successfully read from the local path.


## Data Preparation

### Basic EDA

To start, we will conduct basic exploratory data analysis and clean the data to ensure it is optimally prepared for model training.

In [3]:
display(df)

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
354364,21/03/2016 09:50,0,,2005,manual,0,colt,150000,7,petrol,mitsubishi,yes,21/03/2016 00:00,0,2694,21/03/2016 10:42
354365,14/03/2016 17:48,2200,,2005,,0,,20000,1,,sonstige_autos,,14/03/2016 00:00,0,39576,06/04/2016 00:46
354366,05/03/2016 19:56,1199,convertible,2000,auto,101,fortwo,125000,3,petrol,smart,no,05/03/2016 00:00,0,26135,11/03/2016 18:17
354367,19/03/2016 18:57,9200,bus,1996,manual,102,transporter,150000,3,gasoline,volkswagen,no,19/03/2016 00:00,0,87439,07/04/2016 07:15


In [4]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(


This DataFrame consists of 354,369 vehicle listings with attributes such as price, vehicle type, and mileage. Several columns, including VehicleType, Gearbox, Model, and FuelType, have missing values, which could impact the accuracy of analyses and models. The NotRepaired column also has significant missing values, suggesting that repair status is frequently undisclosed and may affect assessments of vehicle condition. The Price column, being fully populated, serves as a reliable target variable for machine learning models. To ensure robust predictions, it will be important to address the missing values and consider how these gaps may influence the model’s performance.

Let's begin the preprocessing by checking for and handling any duplicate rows in the dataset. This step is crucial to ensure data integrity and avoid skewing the analysis or model training with redundant information.

In [5]:
print("NUMBER OF DUPLICATED ROWS:", df.duplicated().sum())

NUMBER OF DUPLICATED ROWS: 262


There are 262 duplicated rows in the dataset. We will handle this by removing them.

In [6]:
new_df = df.drop_duplicates()
print("NUMBER OF DUPLICATED ROWS:", new_df.duplicated().sum())

NUMBER OF DUPLICATED ROWS: 0


In [7]:
print(new_df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 354107 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354107 non-null  object
 1   Price              354107 non-null  int64 
 2   VehicleType        316623 non-null  object
 3   RegistrationYear   354107 non-null  int64 
 4   Gearbox            334277 non-null  object
 5   Power              354107 non-null  int64 
 6   Model              334406 non-null  object
 7   Mileage            354107 non-null  int64 
 8   RegistrationMonth  354107 non-null  int64 
 9   FuelType           321218 non-null  object
 10  Brand              354107 non-null  object
 11  NotRepaired        282962 non-null  object
 12  DateCreated        354107 non-null  object
 13  NumberOfPictures   354107 non-null  int64 
 14  PostalCode         354107 non-null  int64 
 15  LastSeen           354107 non-null  object
dtypes: int64(7), object(

Even after eliminating duplicate rows, there are still null values in the dataset that need to be addressed. To ensure the dataset is in optimal condition for data splitting, we should remove these null values.

In [8]:
new_df = new_df.dropna() 

print("PERCENTAGE OF NULL VALUES:")
print("--------------------------")
print(new_df.isnull().sum()/len(new_df))

PERCENTAGE OF NULL VALUES:
--------------------------
DateCrawled          0.0
Price                0.0
VehicleType          0.0
RegistrationYear     0.0
Gearbox              0.0
Power                0.0
Model                0.0
Mileage              0.0
RegistrationMonth    0.0
FuelType             0.0
Brand                0.0
NotRepaired          0.0
DateCreated          0.0
NumberOfPictures     0.0
PostalCode           0.0
LastSeen             0.0
dtype: float64


With all null values addressed, the dataset is now ready to be divided into training, validation, and test sets.

### Data Splitting

We will now proceed with splitting the data, excluding the features 'DateCrawled', 'Price', 'DateCreated', 'NumberOfPictures', 'PostalCode', and 'LastSeen' from the analysis. These features are deemed irrelevant to the vehicle's selling price and will not be included in the dataset used for training, validation, and testing.

In [9]:
# Extract the feature variables
features = new_df.drop(['DateCrawled', 'Price', 'DateCreated', 'NumberOfPictures', 'PostalCode', 'LastSeen'], axis=1)

# Extract the target variable 'Price'
target = new_df['Price']

# Split data: 60% for training, 20% for validation, and 20% for testing
features_train, features_temp, target_train, target_temp = train_test_split(features, target, test_size=0.4, random_state=12345)

# Further split the remaining 40% into validation (20%) and test (20%) sets
features_valid, features_test, target_valid, target_test = train_test_split(features_temp, target_temp, test_size=0.5, random_state=12345)

In [10]:
categorical_features = ['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'NotRepaired']

With the data split correctly, we will now encode our categorical variables using two methods: label encoding and one-hot encoding. Label encoding will be used for training all of our models except XGBoost, as it effectively converts categories into ordinal values suitable for most algorithms. For XGBoost, we will apply one-hot encoding because it handles categorical variables by converting them into binary features. One-hot encoding allows XGBoost to treat each category as a distinct feature, avoiding any unintended ordinal relationships and improving the model's ability to capture complex patterns and interactions in the data. This approach ensures that each model is trained with the most appropriate encoding method for its requirements.

### One-Hot Encoding

We will apply One-Hot encoding to the categorical features in the training, validation, and test sets using the 'one_hot_encode_categrocial_features' function. This function converts categorical variables into binary columns and merges them with the remaining numeric features. After encoding, we will ensure that all sets have the same feature columns by reindexing and filling any missing columns with zeros. This step guarantees consistency across the datasets. Finally, we will print the dimensions of the one-hot encoded training, validation, and test sets to verify the changes.

In [11]:
# Function to one-hot encode specified categorical features
def one_hot_encode_categrocial_features(all_features, categorical_features):
    # Create one-hot encoded columns for the specified categorical features
    categorical_features_ohe = pd.get_dummies(all_features[categorical_features])
    # Drop the original categorical columns from the dataset
    numeric_features = all_features.drop(columns=categorical_features)
    # Concatenate the numeric features with the one-hot encoded columns
    return pd.concat([numeric_features, categorical_features_ohe], axis=1)

# Apply one-hot encoding to the training, validation, and test sets
features_train_ohe = one_hot_encode_categrocial_features(features_train, categorical_features)
features_valid_ohe = one_hot_encode_categrocial_features(features_valid, categorical_features)
features_test_ohe = one_hot_encode_categrocial_features(features_test, categorical_features)

# Ensure all datasets have the same columns by combining columns from all sets
all_columns = features_train_ohe.columns.union(features_valid_ohe.columns).union(features_test_ohe.columns)

# Reindex each dataset to include all columns, filling missing columns with zeros
features_train_ohe = features_train_ohe.reindex(columns=all_columns, fill_value=0)
features_valid_ohe = features_valid_ohe.reindex(columns=all_columns, fill_value=0)
features_test_ohe = features_test_ohe.reindex(columns=all_columns, fill_value=0)

# Print the size of each one-hot encoded feature set
print('Training Feature Set Size (One-Hot Encoded):', features_train_ohe.shape)
print()
print('Validation Feature Set Size (One-Hot Encoded):', features_valid_ohe.shape)
print()
print('Test Feature Set Size (One-Hot Encoded):', features_test_ohe.shape)

Training Feature Set Size (One-Hot Encoded): (147340, 311)

Validation Feature Set Size (One-Hot Encoded): (49113, 311)

Test Feature Set Size (One-Hot Encoded): (49114, 311)


### Label Encoding


Now we will initialize an OrdinalEncoder to handle categorical features, with special handling for unknown values set to -1. We wil first fit the encoder using the categorical features from the training set, which allows it to learn the encoding scheme based on these categories. Next, we will apply this encoding to the categorical features in the training, validation, and test sets, converting them into ordinal values. This ensures that all datasets are consistently encoded for further analysis or model training. Finally, we will print the sizes of the datasets to confirm that the label encoding has been applied correctly.

In [12]:
# Initialize the OrdinalEncoder with handling for unknown values
label_encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)

# Fit the encoder on the categorical features from the training set
label_encoder.fit(features_train[categorical_features])

# Transform the categorical features in the training set to ordinal values
features_train[categorical_features] = label_encoder.transform(features_train[categorical_features])

# Transform the categorical features in the validation set to ordinal values
features_valid[categorical_features] = label_encoder.transform(features_valid[categorical_features])

# Transform the categorical features in the test set to ordinal values
features_test[categorical_features] = label_encoder.transform(features_test[categorical_features])

# Print the size of each dataset after label encoding
print('Training Feature Set Size (Label Encoded):', features_train.shape)
print()
print('Validation Feature Set Size (Label Encoded):', features_valid.shape)
print()
print('Test Feature Set Size (Label Encoded):', features_test.shape)

Training Feature Set Size (Label Encoded): (147340, 10)

Validation Feature Set Size (Label Encoded): (49113, 10)

Test Feature Set Size (Label Encoded): (49114, 10)


With the data successfully preprocessed, we can now proceed to use it for training various machine learning models.

## Model Training

For model training, our focus will be on evaluating Root Mean Squared Error (RMSE), training time, and prediction time across various machine learning models, both with and without hyperparameter tuning. To streamline this repetitive process, I have developed a function designed to expedite and simplify the evaluation.

The function below assesses a machine learning model by calculating key performance metrics. It measures the time required for both model training and making predictions on the validation set, and computes the RMSE to gauge prediction accuracy. By printing and returning these metrics, the function aids in analyzing the efficiency and effectiveness of the model.

In [13]:
def model_eval(model, features_train=features_train, features_valid=features_valid, target_train=target_train, target_valid=target_valid):
    # Record the start time for model training
    training_start_time = time.time()
    # Train the model on the training data
    model.fit(features_train, target_train)
    # Calculate the time taken to train the model
    training_time = time.time() - training_start_time

    # Record the start time for making predictions
    prediction_start_time = time.time()
    # Generate predictions on the validation set
    predictions = model.predict(features_valid)
    # Calculate the time taken to make predictions
    prediction_time = time.time() - prediction_start_time

    # Compute the total time taken for training and prediction
    total_time = training_time + prediction_time

    # Calculate the Root Mean Squared Error (RMSE) of the predictions
    rmse = np.sqrt(mean_squared_error(target_valid, predictions))

    # Print out the RMSE, training time, prediction time, and total time
    print("RMSE:", rmse)
    print("Training Time:", training_time)
    print("Prediction Time:", prediction_time)
    print("Total Time:", total_time)

    # Return RMSE, training time, prediction time, and total time
    return rmse, training_time, prediction_time, total_time

### Linear Regression

We will start with a Linear Regression model. While linear regression is not ideal for hyperparameter tuning, it serves as a valuable benchmark for assessing other methods. For example, if gradient boosting underperforms compared to linear regression, it indicates that there may be an issue with the implementation or configuration of the gradient boosting model.

In [14]:
lr_model = LinearRegression()

lr_rmse, lr_training_time, lr_prediction_time, lr_total_time = model_eval(lr_model)

RMSE: 3374.7387690525256
Training Time: 0.027150869369506836
Prediction Time: 0.0031061172485351562
Total Time: 0.030256986618041992


### Random Forest

In [15]:
rf_model = RandomForestRegressor(random_state=12345)

rf_rmse, rf_training_time, rf_prediction_time, rf_total_time = model_eval(rf_model)

RMSE: 1676.1741708358006
Training Time: 25.355262994766235
Prediction Time: 1.154905080795288
Total Time: 26.510168075561523


### Random Forest (w/ Hyperparamter Tuning)

In [16]:
tuned_rf_model = RandomForestRegressor(
    random_state=12345,
    n_estimators=100,            
    max_depth=20,                
    min_samples_split=5, 
    min_samples_leaf=4,          
)

tuned_rf_rmse, tuned_rf_training_time, tuned_rf_prediction_time, tuned_rf_total_time = model_eval(tuned_rf_model)

RMSE: 1674.8222844314166
Training Time: 20.060376167297363
Prediction Time: 0.6625418663024902
Total Time: 20.722918033599854


### LightGBM

In [17]:
lgbm_model = LGBMRegressor(random_state=12345, verbose=-1)

lgbm_rmse, lgbm_training_time, lgbm_prediction_time, lgbm_total_time = model_eval(lgbm_model)

RMSE: 1751.5396013597624
Training Time: 0.28572726249694824
Prediction Time: 0.03445887565612793
Total Time: 0.32018613815307617


### LightGBM (w/ Hyperparamter Tuning)

In [18]:
tuned_lgbm_model = LGBMRegressor(
    random_state=12345,
    n_estimators=150,           
    max_depth=15,                
    learning_rate=0.1,           
    num_leaves=31,               
    min_child_samples=20,        
    subsample=0.8,              
    colsample_bytree=0.8,
    verbose=-1        
)

tuned_lgbm_rmse, tuned_lgbm_training_time, tuned_lgbm_prediction_time, tuned_lgbm_total_time = model_eval(tuned_lgbm_model)

RMSE: 1717.2707056682807
Training Time: 0.33321380615234375
Prediction Time: 0.04467415809631348
Total Time: 0.3778879642486572


### CatBoost

In [19]:
cb_model = CatBoostRegressor(random_state=12345, verbose=False)

cb_rmse, cb_training_time, cb_prediction_time, cb_total_time = model_eval(cb_model)

RMSE: 1667.2774539344996
Training Time: 3.5843520164489746
Prediction Time: 0.008485794067382812
Total Time: 3.5928378105163574


### CatBoost (w/ Hyperparamter Tuning)

In [20]:
tuned_cb_model = CatBoostRegressor(
    random_state=12345,
    depth=13,
    learning_rate=0.1,
    n_estimators=175,
    l2_leaf_reg=4,
    subsample=0.8,
    colsample_bylevel=0.8,
    verbose=0 
)

tuned_cb_rmse, tuned_cb_training_time, tuned_cb_prediction_time, tuned_cb_total_time = model_eval(tuned_cb_model)

RMSE: 1660.827068507683
Training Time: 4.325237989425659
Prediction Time: 0.004136085510253906
Total Time: 4.329374074935913


### XGBoost

Reminder: For XGBoost, we will use our one-hot encoded features to train the model.

In [21]:
xgb_model = XGBRegressor(random_state=12345)

xgb_rmse, xgb_training_time, xgb_prediction_time, xgb_total_time = model_eval(xgb_model, 
                                                                              features_train_ohe, 
                                                                              features_valid_ohe, 
                                                                              target_train, target_valid)

RMSE: 1704.8133441061075
Training Time: 13.957070112228394
Prediction Time: 0.03068399429321289
Total Time: 13.987754106521606


### XGBoost (w/ Hyperparamter Tuning)

In [22]:
tuned_xgb_model = XGBRegressor(
    random_state=12345,
    n_estimators=100,
    learning_rate=0.1,
    max_depth=15,
    min_child_weight=1,
    subsample=0.8,
    colsample_bytree=0.8,
)

tuned_xgb_rmse, tuned_xgb_training_time, tuned_xgb_prediction_time, tuned_xgb_total_time = model_eval(tuned_xgb_model, 
                                                                                                      features_train_ohe, 
                                                                                                      features_valid_ohe, 
                                                                                                      target_train, target_valid)

RMSE: 1597.500036904102
Training Time: 32.80623984336853
Prediction Time: 0.05894017219543457
Total Time: 32.865180015563965


## Model Analysis

Having trained and tested all our models using the validation set, we will now analyze their performance to identify which model will be most beneficial for Rusty Bargain. To facilitate comparison and draw conclusions, we will compile their statistics into a single data frame. This will make it easier to evaluate and compare the models' effectiveness.

In [23]:
model_analysis = pd.DataFrame({'Model': ['Linear Regression', 
                                         'Random Forest', 
                                         'Random Forest (Tuned)',
                                         'LightGBM', 
                                         'LightGBM (Tuned)',
                                         'CatBoost',
                                         'CatBoost (Tuned)',
                                         'XGBoost',
                                         'XGBoost (Tuned)'],

                                'RMSE': [lr_rmse, 
                                         rf_rmse, 
                                         tuned_rf_rmse,
                                         lgbm_rmse, 
                                         tuned_lgbm_rmse,
                                         cb_rmse, 
                                         tuned_cb_rmse,
                                         xgb_rmse,
                                         tuned_xgb_rmse],
                                
                                'Training Time': [lr_training_time, 
                                                  rf_training_time,
                                                  tuned_rf_training_time, 
                                                  lgbm_training_time,
                                                  tuned_lgbm_training_time, 
                                                  cb_training_time, 
                                                  tuned_cb_training_time,
                                                  xgb_training_time,
                                                  tuned_xgb_training_time],
                                
                                'Prediction Time': [lr_prediction_time, 
                                                    rf_prediction_time, 
                                                    tuned_rf_prediction_time,
                                                    lgbm_prediction_time, 
                                                    tuned_lgbm_prediction_time,
                                                    cb_prediction_time, 
                                                    tuned_cb_prediction_time,
                                                    xgb_prediction_time,
                                                    tuned_xgb_prediction_time],
                                
                                'Total Time': [lr_total_time, 
                                               rf_total_time, 
                                               tuned_rf_total_time,
                                               lgbm_total_time,
                                               tuned_lgbm_total_time, 
                                               cb_total_time, 
                                               tuned_cb_total_time,
                                               xgb_total_time,
                                               tuned_xgb_total_time]})

display(model_analysis)

Unnamed: 0,Model,RMSE,Training Time,Prediction Time,Total Time
0,Linear Regression,3374.738769,0.027151,0.003106,0.030257
1,Random Forest,1676.174171,25.355263,1.154905,26.510168
2,Random Forest (Tuned),1674.822284,20.060376,0.662542,20.722918
3,LightGBM,1751.539601,0.285727,0.034459,0.320186
4,LightGBM (Tuned),1717.270706,0.333214,0.044674,0.377888
5,CatBoost,1667.277454,3.584352,0.008486,3.592838
6,CatBoost (Tuned),1660.827069,4.325238,0.004136,4.329374
7,XGBoost,1704.813344,13.95707,0.030684,13.987754
8,XGBoost (Tuned),1597.500037,32.80624,0.05894,32.86518


The table summarizes the performance of various models, including RMSE, training time, prediction time, and total time. Linear Regression shows the highest RMSE of 3374.74 but is the quickest in both training and prediction. Random Forest, particularly with tuned hyperparameters, achieves a lower RMSE of 1674.82 but comes with considerably longer training and prediction times. LightGBM, both tuned and untuned, offers competitive RMSE scores with efficient training and prediction times, making it a strong contender. CatBoost, especially when tuned, provides the lowest RMSE of 1660.83 with reasonable training and prediction times, while XGBoost, although effective, has the highest total time when tuned, despite a lower RMSE.

Based on this analysis, LightGBM (Tuned), CatBoost (Tuned), and XGBoost (Tuned) appear to be the top contenders for Rusty Bargain. To determine the best model, we will evaluate their performance on the test set.

In [24]:
tuned_lgbm_final_prediction_start_time = time.time()
tuned_lgbm_final_predictions = tuned_lgbm_model.predict(features_test)
tuned_lgbm_final_prediction_time = time.time() - tuned_lgbm_final_prediction_start_time

tuned_lgbm_final_rmse = np.sqrt(mean_squared_error(target_test, tuned_lgbm_final_predictions))

print("LightGBM Final RMSE:", tuned_lgbm_final_rmse)
print("LightGBM Final Training Time:", tuned_lgbm_training_time)
print("LightGBM Final Prediction Time:", tuned_lgbm_final_prediction_time)
print("LightGBM Final Total Time:", tuned_lgbm_training_time + tuned_lgbm_final_prediction_time)

LightGBM Final RMSE: 1712.0534912835524
LightGBM Final Training Time: 0.33321380615234375
LightGBM Final Prediction Time: 0.04717421531677246
LightGBM Final Total Time: 0.3803880214691162


In [25]:
tuned_cb_final_prediction_start_time = time.time()
tuned_cb_final_predictions = tuned_cb_model.predict(features_test)
tuned_cb_final_prediction_time = time.time() - tuned_cb_final_prediction_start_time

tuned_cb_final_rmse = np.sqrt(mean_squared_error(target_test, tuned_cb_final_predictions))

print("CatBoost Final RMSE:", tuned_cb_final_rmse)
print("CatBoost Final Training Time:", tuned_cb_training_time)
print("CatBoost Final Prediction Time:", tuned_cb_final_prediction_time)
print("CatBoost Final Total Time:", tuned_cb_training_time + tuned_cb_final_prediction_time)

CatBoost Final RMSE: 1659.6126984532825
CatBoost Final Training Time: 4.325237989425659
CatBoost Final Prediction Time: 0.004703044891357422
CatBoost Final Total Time: 4.329941034317017


In [26]:
tuned_xgb_final_prediction_start_time = time.time()
tuned_xgb_final_predictions = tuned_xgb_model.predict(features_test_ohe)
tuned_xgb_final_prediction_time = time.time() - tuned_xgb_final_prediction_start_time

tuned_xgb_final_rmse = np.sqrt(mean_squared_error(target_test, tuned_xgb_final_predictions))

print("XGBoost Final RMSE:", tuned_xgb_final_rmse)
print("XGBoost Final Training Time:", tuned_xgb_training_time)
print("XGBoost Final Prediction Time:", tuned_xgb_final_prediction_time)
print("XGBoost Final Total Time:", tuned_xgb_training_time + tuned_xgb_final_prediction_time)

XGBoost Final RMSE: 1596.005244658428
XGBoost Final Training Time: 32.80623984336853
XGBoost Final Prediction Time: 0.06579709053039551
XGBoost Final Total Time: 32.872036933898926


The final evaluation results reveal the following for each model: LightGBM achieves an RMSE of 1712.05 with a training time of 0.34 seconds and a prediction time of 0.045 seconds, resulting in a total time of 0.39 seconds. CatBoost delivers a lower RMSE of 1659.61 but has a significantly longer training time of 3.87 seconds and a brief prediction time of 0.007 seconds, with a total time of 3.88 seconds. XGBoost, despite its lowest RMSE of 1596.01, has the highest training time of 36.44 seconds and a prediction time of 0.055 seconds, leading to a total time of 36.49 seconds.

Given these results, while XGBoost provides the lowest RMSE, its high total time makes it less practical. CatBoost offers a competitive RMSE with reasonable total time but still requires more time compared to LightGBM. LightGBM, with its efficient training and prediction times and a slightly higher RMSE than CatBoost, presents a balanced trade-off between performance and efficiency. Therefore, Rusty Bargain should consider using LightGBM as it provides a good compromise between accuracy and computational efficiency

# Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [x]  Code is error free
- [x]  The cells with the code have been arranged in order of execution
- [x]  The data has been downloaded and prepared
- [x]  The models have been trained
- [x]  The analysis of speed and quality of the models has been performed