# Rusty Bargain's App to Attract New Customers

## Introduction


Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Data Preprocessing

In [1]:
# Important necessary libraries
import pandas as pd
import seaborn as sns
import numpy as np
import re
from matplotlib import pyplot as plt
from scipy import stats as st

from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings('ignore')

## Data preparation

In [2]:
# Load the datasets
car_data = pd.read_csv('datasets/car_data.csv')

# Display the first few rows of the dataset
car_data.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


In [3]:
# Convert camelCase column names to snake_case
car_data.columns = [re.sub(r'(?<!^)(?=[A-Z])', '_', col).lower() for col in car_data.columns]
print(car_data.columns)

Index(['date_crawled', 'price', 'vehicle_type', 'registration_year', 'gearbox',
       'power', 'model', 'mileage', 'registration_month', 'fuel_type', 'brand',
       'not_repaired', 'date_created', 'number_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')


In [4]:
# Display the number of rows and columns in the DataFrame
n_rows, n_cols = car_data.shape
print(f"The DataFrame has {n_rows} rows and {n_cols} columns")

The DataFrame has 354369 rows and 16 columns


In [5]:
# Display informative summary of the datasets
car_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   date_crawled        354369 non-null  object
 1   price               354369 non-null  int64 
 2   vehicle_type        316879 non-null  object
 3   registration_year   354369 non-null  int64 
 4   gearbox             334536 non-null  object
 5   power               354369 non-null  int64 
 6   model               334664 non-null  object
 7   mileage             354369 non-null  int64 
 8   registration_month  354369 non-null  int64 
 9   fuel_type           321474 non-null  object
 10  brand               354369 non-null  object
 11  not_repaired        283215 non-null  object
 12  date_created        354369 non-null  object
 13  number_of_pictures  354369 non-null  int64 
 14  postal_code         354369 non-null  int64 
 15  last_seen           354369 non-null  object
dtypes:

In [6]:
# Display the descriptive statistics of the DataFrame
car_data.describe()

Unnamed: 0,price,registration_year,power,mileage,registration_month,number_of_pictures,postal_code
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0,50508.689087
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0,25783.096248
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49413.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


In [7]:
# Display the number of duplicate rows in the DataFrame
display(car_data.duplicated().sum())

# Remove duplicate rows from the DataFrame
car_data.drop_duplicates(inplace=True)

# Checking the number of duplicate rows in the DataFrame after removing duplicate rows
car_data.duplicated().sum()

262

0

In [8]:
# Display the number of missing values in the DataFrame
car_data.isna().sum() 

date_crawled              0
price                     0
vehicle_type          37484
registration_year         0
gearbox               19830
power                     0
model                 19701
mileage                   0
registration_month        0
fuel_type             32889
brand                     0
not_repaired          71145
date_created              0
number_of_pictures        0
postal_code               0
last_seen                 0
dtype: int64

In [9]:
# Fill missing values in the 'vehicle_type', 'gearbox', 'model', 'fuel_type' columns with 'unknown'
car_data[['vehicle_type', 'gearbox', 'model', 'fuel_type']] = car_data[['vehicle_type', 'gearbox', 'model', 'fuel_type']].fillna('unknown')

# Fill missing values in column 'not_repaired' with 'no info'
car_data['not_repaired'] = car_data['not_repaired'].fillna('no info')

In [10]:
# Check for missing values after filling missing values in all columns in the DataFrame.
car_data.isna().sum() 

date_crawled          0
price                 0
vehicle_type          0
registration_year     0
gearbox               0
power                 0
model                 0
mileage               0
registration_month    0
fuel_type             0
brand                 0
not_repaired          0
date_created          0
number_of_pictures    0
postal_code           0
last_seen             0
dtype: int64

By analyzing the dataset, there are 262 duplicates and many columns had many missing values found. So, 262 duplicates were dropeed and missing values among all the columns were filled using fillna with 'unknown' expect 'not_repaired' column was filled with 'no info'. 

## Model training

In [11]:
# Drop irrelevant columns to clean the dataset
car_data = car_data.drop(
    [
        'date_crawled',
        'date_created',
        'last_seen',
        'number_of_pictures',
        'postal_code'
        ],
        axis=1)

In [12]:
# Use one-hot encoder to convert categorical columns to numerical columns
car_data = pd.get_dummies(car_data, columns=['vehicle_type', 
                                             'gearbox', 
                                             'model', 
                                             'fuel_type', 
                                             'brand', 
                                             'not_repaired'], drop_first=True)

### Splitting the dataset

In [13]:
# Split data into training set (60%) and temporary set (40%)
car_data_train, car_data_temp = train_test_split(car_data, 
                                                 test_size=0.40, 
                                                 random_state=7)

In [14]:
# Remove the target column 'price' to get training features
features_train = car_data_train.drop(['price'], axis=1)

# Print the shape of traininf features
features_train.shape

(212464, 312)

In [15]:
# Extract the target variable 'price' for training
target_train = car_data_train['price']

# Print the shape of training target
target_train.shape

(212464,)

In [16]:
# Split the remaining data into validation and test sets (each 20% of original data)
car_data_valid, car_data_test = train_test_split(car_data_temp, 
                                                 test_size=0.50, 
                                                 random_state=7)

In [17]:
# Remove the target column 'price' to get validation features
features_valid = car_data_valid.drop(['price'], axis=1)

# Print the shape of validation features
features_valid.shape

(70821, 312)

In [18]:
# Prepare validation target by selecting the 'price' column
target_valid = car_data_valid['price']

# Print the shape of validation target
print(target_valid.shape)

(70821,)


In [19]:
# Remove the target column 'price' to get test features
features_test = car_data_test.drop(['price'], axis=1)

# Print the shape of test features
print(features_test.shape)

(70822, 312)


In [20]:
# Prepare test target by selecting the 'price' column
target_test = car_data_test['price']

# Print the shape of test target
print(target_test.shape)

(70822,)


In [21]:
# Scale numerical columns in training and validation sets using StandardScaler
num_cols = ['power', 'mileage', 'registration_month']
scaler = StandardScaler()
car_data_train = car_data_train.copy()  

# Fit scaler on training data and transform both train and validation data
car_data_train[num_cols] = scaler.fit_transform(car_data_train[num_cols])
car_data_valid[num_cols] = scaler.transform(car_data_valid[num_cols])

### Linear Regression

In [22]:
%%time
# Linear Regression model for Sanity Check
# Initialize a Linear Regression model
model = LinearRegression() 

# Fit the model on the training dataset
model.fit(features_train, target_train)

CPU times: total: 26.6 s
Wall time: 4.51 s


In [23]:
%%time
# Generate predictions on the validation dataset
predictions_valid = model.predict(features_valid)

CPU times: total: 344 ms
Wall time: 246 ms


In [24]:
# Calculate Root Mean Squared Error (RMSE) to evaluate prediction accuracy
rmse = mean_squared_error(target_valid, predictions_valid) ** 0.5 
print("RMSE of the linear regression model on the validation set:", rmse)

RMSE of the linear regression model on the validation set: 3169.213327435372


Linear Regression is used for Sanity Check for this project. When running the above model, RMSE is 3169.2133274352436. RMSE measures how far predictions are from actual values. A lower RMSE means better model performance. An RMSE of 3169 and the mean of target column 'price' is 4417 representing that RMSE is almost as large as the mean price suggesting that the model's predictions deviate significantly, and thus not much reliable. The training time for model fiting is 7.7ms while for making predictions is 127ms. It shows that Linear Regression is very fast compared to complex models like Random Forest or Gradient Boosting and Training is slower than predictions which is normal. 

### Random Forest Regression

In [25]:
# Define the hyperparameter grid for GridSearchCV to tune the Random Forest Regressor. 
param_grid_rf = {
    "n_estimators": [10, 20, 50], 
    "max_depth": [5, 10],  
    "min_samples_split": [2, 5],  
    "min_samples_leaf": [1, 2] 
}

# Initialize a Random Forest Regressor with a fixed random seed for reproducibility
rf = RandomForestRegressor(random_state=7)  

# Perform hyperparameter tuning using GridSearchCV
grid_search_rf = GridSearchCV(
    rf, 
    param_grid_rf, 
    cv=4, 
    scoring='neg_root_mean_squared_error', 
    verbose=1
    )

In [26]:
%%time
# Fit the GridSearchCV to the training data
grid_search_rf.fit(features_train, target_train)

Fitting 4 folds for each of 24 candidates, totalling 96 fits
CPU times: total: 1h 42min 40s
Wall time: 1h 46min 14s


In [27]:
# Extract the best hyperparameters found during grid search
best_params_rf = grid_search_rf.best_params_

# Retrieve the best estimator (model) from the grid search
best_model_rf = grid_search_rf.best_estimator_

# Print the best hyperparameters
print(f"Best Hyperparameters: {best_params_rf}")

Best Hyperparameters: {'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 50}


In [28]:
%%time

# Train a Random Forest Regressor with the best hyperparameters found from tuning
rf = RandomForestRegressor(random_state=7, 
                           max_depth=best_params_rf['max_depth'],
                           min_samples_leaf=best_params_rf['min_samples_leaf'],
                           min_samples_split=best_params_rf['min_samples_split'],
                           n_estimators=best_params_rf['n_estimators']
                           ) 
# Fit the Random Forest model on the training dataset              
rf.fit(features_train, target_train)

CPU times: total: 3min 38s
Wall time: 3min 45s


In [29]:
%%time
# Predict target values for the validation set using the trained Random Forest model
predictions_valid_rf = rf.predict(features_valid)

CPU times: total: 1.2 s
Wall time: 1.22 s


In [30]:
# Calculate RMSE for the Random Forest predictions on the validation set,
mse_rf = mean_squared_error(target_valid, predictions_valid_rf)
rmse_rf = mse_rf ** 0.5

# Print calculated RMSE along with best hyperparameters used for the model
print("Best Parameters:", best_params_rf)
print("Root Mean Squared Error:", rmse_rf)

Best Parameters: {'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 50}
Root Mean Squared Error: 2018.2058209696345


When running the Random Forest Regressor model with best parameters (max_depth: 10, min_samples_leaf: 2, min_samples_split: 5, n_estimators: 50), RMSE is 2018.2058209696345 suggesting that the model's predictions on average deviate by ~2018 from the actual values. The training time for model fiting is 1min 17s while for making predictions is 256ms. Random Forest trains much slower because it grows many independent decision trees, but it is still the best model based on RMSE. Also, prediction time is slowest becasue it needs to aggregrate results from many tress. 

### CatBoost Regression

In [31]:
# Perform hyperparameter tuning for CatBoost Regressor using GridSearchCV
param_grid_cbr = {
    'learning_rate': [0.01, 0.05, 0.1],  
    'iterations': [10, 20, 30],         
    'depth': [1, 2, 5, 10]              
}

# Initialize a CatBoost Regressor with a fixed random seed for reproducibility
cbr = CatBoostRegressor(random_seed=7, 
                        loss_function="RMSE")  

# Perform hyperparameter tuning using Grid search with 4-fold cross-validation
grid_search_cbr = GridSearchCV(cbr, 
                               param_grid_cbr, 
                               cv=4, 
                               verbose=1)

In [32]:
%%time
# Fit the GridSearchCV to the training data
grid_search_cbr.fit(features_train, target_train)

Fitting 4 folds for each of 36 candidates, totalling 144 fits
0:	learn: 4496.7995412	total: 193ms	remaining: 1.74s
1:	learn: 4483.8656752	total: 212ms	remaining: 848ms
2:	learn: 4471.1528770	total: 228ms	remaining: 532ms
3:	learn: 4458.6578797	total: 245ms	remaining: 368ms
4:	learn: 4446.4228049	total: 263ms	remaining: 263ms
5:	learn: 4434.3233669	total: 280ms	remaining: 187ms
6:	learn: 4422.4325757	total: 298ms	remaining: 128ms
7:	learn: 4410.7472953	total: 317ms	remaining: 79.1ms
8:	learn: 4399.2644239	total: 334ms	remaining: 37.1ms
9:	learn: 4387.9396261	total: 350ms	remaining: 0us
0:	learn: 4502.4907229	total: 12.4ms	remaining: 111ms
1:	learn: 4489.6618206	total: 23.5ms	remaining: 94ms
2:	learn: 4476.8750349	total: 33.9ms	remaining: 79.1ms
3:	learn: 4464.3071569	total: 44.5ms	remaining: 66.7ms
4:	learn: 4452.0685296	total: 56ms	remaining: 56ms
5:	learn: 4439.8978986	total: 66.3ms	remaining: 44.2ms
6:	learn: 4428.0491866	total: 76.3ms	remaining: 32.7ms
7:	learn: 4416.2649495	total: 

In [33]:
# Extract the best hyperparameters found during grid search
best_params_cbr = grid_search_cbr.best_params_

# Retrieve the best estimator (model) from the grid search
best_model_cbr = grid_search_cbr.best_estimator_

# Print the best hyperparameters
print(f"Best Hyperparameters: {best_params_cbr}")

Best Hyperparameters: {'depth': 10, 'iterations': 30, 'learning_rate': 0.1}


In [34]:
%%time
# Train a CatBoost Regressor with the best hyperparameters found from tuning       
cbr = CatBoostRegressor(random_state=7, 
                           depth=best_params_cbr['depth'],
                           iterations=best_params_cbr['iterations'],
                           learning_rate=best_params_cbr['learning_rate'],
                           ) 

CPU times: total: 0 ns
Wall time: 0 ns


In [35]:
# Train a CatBoost Regressor using the best hyperparameters from grid search
cbr.fit(features_train, target_train)

0:	learn: 4211.7152438	total: 48.1ms	remaining: 1.4s
1:	learn: 3941.7759894	total: 94.3ms	remaining: 1.32s
2:	learn: 3707.2109597	total: 136ms	remaining: 1.22s
3:	learn: 3497.5854695	total: 178ms	remaining: 1.16s
4:	learn: 3314.0626982	total: 218ms	remaining: 1.09s
5:	learn: 3154.5394502	total: 264ms	remaining: 1.05s
6:	learn: 3015.7884832	total: 306ms	remaining: 1.01s
7:	learn: 2892.9457524	total: 356ms	remaining: 980ms
8:	learn: 2784.2964570	total: 403ms	remaining: 940ms
9:	learn: 2693.6994791	total: 449ms	remaining: 897ms
10:	learn: 2615.7797763	total: 498ms	remaining: 860ms
11:	learn: 2545.8229780	total: 543ms	remaining: 814ms
12:	learn: 2486.5895514	total: 585ms	remaining: 764ms
13:	learn: 2434.5090788	total: 623ms	remaining: 713ms
14:	learn: 2380.5856573	total: 665ms	remaining: 665ms
15:	learn: 2338.6535522	total: 709ms	remaining: 620ms
16:	learn: 2300.7303525	total: 754ms	remaining: 577ms
17:	learn: 2268.8852334	total: 800ms	remaining: 534ms
18:	learn: 2237.8347192	total: 845ms	

<catboost.core.CatBoostRegressor at 0x201343cdc70>

In [36]:
%%time
# Predict target values for the validation set using the trained CatBoost model    
predictions_valid_cbr = cbr.predict(features_valid)

CPU times: total: 78.1 ms
Wall time: 35 ms


In [37]:
# Calculate RMSE for the Random Forest predictions on the validation set
mse_cbr = mean_squared_error(target_valid, predictions_valid_cbr)
rmse_cbr = mse_cbr ** 0.5

 # Print calculated RMSE along with best hyperparameters used for the model
print("Best Parameters:", best_params_cbr)
print("Root Mean Squared Error:", rmse_cbr)

Best Parameters: {'depth': 10, 'iterations': 30, 'learning_rate': 0.1}
Root Mean Squared Error: 2034.4336189856476


When running the above Cat Boost Regressor model with best parameters (depth:10, and iterations:30, and learning_rate:0.1), RMSE is 2034.4336189856476 which suggests that the model's predictions on average deviate by ~2034 from the actual values. The training time for model fiting is 2.33s while for making predictions is 13.6ms. It is the fastest model among all as it is efficient at handling categorical and numerical data. 

### LightGBM Regression

In [38]:
# Perform hyperparameter tuning for LGBMRegressor using GridSearchCV
param_grid_lgbm = {
    'learning_rate': [0.01, 0.05, 0.1],  
    'n_estimators': [10, 20, 30],         
    'max_depth': [1, 2, 5, 10]              
}

In [39]:
# Initialize a LGBMRegressor with a fixed random seed for reproducibility
lgbm = LGBMRegressor(random_state=7)  

# Perform hyperparameter tuning using Grid search with 4-fold cross-validation
grid_search_lgbm = GridSearchCV(estimator=lgbm, param_grid=param_grid_lgbm, cv=4, scoring='neg_root_mean_squared_error', verbose=1)

In [40]:
%%time
# Fit the GridSearchCV to the training data
grid_search_lgbm.fit(features_train, target_train)

# Extract the best hyperparameters found during grid search
best_params_lgbm = grid_search_lgbm.best_params_

# Retrieve the best estimator (model) from the grid search
best_model_lgbm = grid_search_lgbm.best_estimator_

# Print the best hyperparameters
print(f"Best Hyperparameters: {best_params_lgbm}")

Fitting 4 folds for each of 36 candidates, totalling 144 fits


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.007416 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 951
[LightGBM] [Info] Number of data points in the train set: 159348, number of used features: 290
[LightGBM] [Info] Start training from score 4413.717825
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005553 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 946
[LightGBM] [Info] Number of data points in the train set: 159348, number of used features: 288
[LightGBM] [Info] Start training from score 4418.146842
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005549 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is n

In [41]:
%%time
# Train a LGBMRegressor with the best hyperparameters found from tuning  
lgbm = LGBMRegressor(random_state=7, 
                    learning_rate = best_params_lgbm['learning_rate'],
                    n_estimators = best_params_lgbm['n_estimators'],
                    max_depth = best_params_lgbm['max_depth']
                    )

CPU times: total: 0 ns
Wall time: 0 ns


In [42]:
# Train a LGBM Regressor using the best hyperparameters from grid search
lgbm.fit(features_train,target_train)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.011420 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 962
[LightGBM] [Info] Number of data points in the train set: 212464, number of used features: 294
[LightGBM] [Info] Start training from score 4418.712177


In [43]:
%%time
# Predict target values for the validation set using the trained LGBM model   
predictions_valid_lgbm = best_model_lgbm.predict(features_valid)

CPU times: total: 891 ms
Wall time: 167 ms


In [44]:
# Calculate RMSE for the LGBM predictions on the validation set
mse_lgbm = mean_squared_error(target_valid, predictions_valid_lgbm)
rmse_lgbm = mse_lgbm ** 0.5

 # Print calculated RMSE along with best hyperparameters used for the model
print("Best Parameters:", best_params_lgbm)
print("Root Mean Squared Error:", rmse_lgbm)

Best Parameters: {'learning_rate': 0.1, 'max_depth': 10, 'n_estimators': 30}
Root Mean Squared Error: 2028.2235056876787


When running the above LightGBMRegressor model with best parameters (max_depth:10, and n_estimators:30, and learning_rate:0.1), RMSE is 2028.2235056876789 which suggests that the model's predictions on average deviate by ~2028 from the actual values. The training time for model fiting is 2.19s while for making predictions is 236ms. It shows that this model is fast but it takes slightly longer to train and predict than CatBoost. 

### Test Set

In [45]:
# Make predictions on the test set
predictions_test = rf.predict(features_test)

# Evaluate model performance using RMSE
mse_test = mean_squared_error(target_test, predictions_test)
rmse_test = mse_test ** 0.5

# Print calculated RMSE value
print("Root Mean Squared Error on Test Set:", rmse_test)

Root Mean Squared Error on Test Set: 2039.588475157083


Based on RMSE, RandomForestRegressor model is the best model. Thus, I choose it as the best model for test set which resulted into RMSE of 2039.588475157083 suggesting that model's predictions on average deviate by ~2040 from the actual values. The test RMSE is similar to the validation RMSE suggesting no major overfitting. It also shows that the model generalizes well meaning it perform well on unseen data. 

## Model analysis

The project was based on to build the model to determine the market value(price) of the car. But along with this, the company Rusty Bargain is interested in a model which has a good quality of the prediction, fast speed of the prediction and less time required for training. 

In this project after cleaning the dataset, multiple regression models were trained and evaluated to predict the target variable (price) accurately. To get a model with a good quality of the prediction, the goal is to minimize the RMSE. A Linear Regression model was used as a sanity check (baseline), and a progessively more complex models, such as Random Forest, CatBoost, and LightGBM, were trained to improve performance. 

Linear Regression model is the fastest but has a poor predictive performance, confirming the need for more complex models.Random Forest Regressor model has a good improvement in RMSE, but training other more complex models were necessary so CatBoostRegressor and LightGBMRegressor were trained. 

After training all models it is known that, Random Forest Regressor model has the lowest RMSE, 2018 with training time of 1min 17s while the RMSE of CatBoostModel is 2028 with training time of 2.33s. As both, accuracy and speed to train a model are important aspects in determining the best model, the difference in training both model is insignificant. Based on RMSE Random Forest Regressor is the best model and thus, used to determine the RMSE for test set. RMSE for test set is 2039.588475157083 suggesting that model's predictions on average deviate by ~2040 from the actual values. As the test RMSE is similar to the validation RMSE, it represents that there is no major overfitting and that the model generalizes well meaning it perform well on unseen data.