Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Data preparation

In [1]:
pip install lightgbm 

Note: you may need to restart the kernel to use updated packages.


In [2]:
import numpy as np
import pandas as pd
import math
import seaborn as sns
import sklearn.linear_model
import sklearn.metrics
import sklearn.neighbors
import sklearn.preprocessing
from sklearn.neighbors import RadiusNeighborsClassifier
from sklearn.model_selection import train_test_split
from IPython.display import display
from sklearn.impute import SimpleImputer
import plotly.express as px
from sklearn.preprocessing import OrdinalEncoder
from sklearn.datasets import make_regression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
import lightgbm 
from sklearn.linear_model import LinearRegression

In [3]:
data = pd.read_csv('/datasets/car_data.csv')

In [4]:
data.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


In [5]:
# We will change the column names to lowercase letters.
data.columns= data.columns.str.lower()

In [6]:
data.head()

Unnamed: 0,datecrawled,price,vehicletype,registrationyear,gearbox,power,model,mileage,registrationmonth,fueltype,brand,notrepaired,datecreated,numberofpictures,postalcode,lastseen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


In [7]:
# Let's check if there are any missing values, and if so we will handle them.
pd.isnull(data).sum()

datecrawled              0
price                    0
vehicletype          37490
registrationyear         0
gearbox              19833
power                    0
model                19705
mileage                  0
registrationmonth        0
fueltype             32895
brand                    0
notrepaired          71154
datecreated              0
numberofpictures         0
postalcode               0
lastseen                 0
dtype: int64

In [8]:
non_nomeric = data.iloc[:,[2,4,6,9,10,11]]
missing_categorical = data[non_nomeric.columns.to_list()].isna().sum()
list_of_missing_categorical = missing_categorical[missing_categorical>0].index.to_list()

In [9]:
#In this case, in my opinion, it would be best to fill in the missing values with the most frequent value.
simple_imputer = SimpleImputer(strategy = 'most_frequent')

In [10]:
def impute_values_in_columns(data:pd.DataFrame, column_name, my_imputer)-> pd.DataFrame:
    data[ column_name] = my_imputer.fit_transform(data[column_name].values.reshape(-1,1))
    return data

In [11]:
for column_name in list_of_missing_categorical:
    data = impute_values_in_columns(data,column_name, simple_imputer)

In [12]:
pd.isnull(data).sum()

datecrawled          0
price                0
vehicletype          0
registrationyear     0
gearbox              0
power                0
model                0
mileage              0
registrationmonth    0
fueltype             0
brand                0
notrepaired          0
datecreated          0
numberofpictures     0
postalcode           0
lastseen             0
dtype: int64

In [13]:
# We will check if there are duplicates.
data.duplicated().sum()

299

In [14]:
# We will delete the duplicates.
data=data.drop_duplicates()

In [15]:
#In preparation for the prediction, 
#I want to use only the year of manufacture and not the exact date so as not to confuse the model too much, 
#so we will make the date a year only.
data['datecreated']=pd.to_datetime(data['datecreated'])
data['datecrawled']=pd.to_datetime(data['datecreated'])

In [16]:
data['datecreated'] = data['datecreated'].dt.year

In [17]:
#Now we will check if we have unusual values in the power column.
data['power'].value_counts()

0        40218
75       24001
60       15877
150      14568
101      13283
         ...  
323          1
3454         1
1056         1
13636        1
1158         1
Name: power, Length: 712, dtype: int64

In [18]:
# We have a lot of values of 0, since it doesn't make sense we will replace them with the average.
mean_power = data['power'].mean()
data.loc[data.power == 0, 'power'] = mean_power

In [19]:
data['power'].value_counts()

110.087697     40218
75.000000      24001
60.000000      15877
150.000000     14568
101.000000     13283
               ...  
1055.000000        1
8500.000000        1
574.000000         1
1054.000000        1
9010.000000        1
Name: power, Length: 712, dtype: int64

In [20]:
# lets encode the categoriel values.
categorial = ['vehicletype', 'gearbox', 'model', 'fueltype','notrepaired', 'brand']
data_ohe = pd.get_dummies(data, columns=categorial)

In [21]:
data_ohe.head()

Unnamed: 0,datecrawled,price,registrationyear,power,mileage,registrationmonth,datecreated,numberofpictures,postalcode,lastseen,...,brand_seat,brand_skoda,brand_smart,brand_sonstige_autos,brand_subaru,brand_suzuki,brand_toyota,brand_trabant,brand_volkswagen,brand_volvo
0,2016-03-24,480,1993,110.087697,150000,0,2016,0,70435,07/04/2016 03:16,...,0,0,0,0,0,0,0,0,1,0
1,2016-03-24,18300,2011,190.0,125000,5,2016,0,66954,07/04/2016 01:46,...,0,0,0,0,0,0,0,0,0,0
2,2016-03-14,9800,2004,163.0,125000,8,2016,0,90480,05/04/2016 12:47,...,0,0,0,0,0,0,0,0,0,0
3,2016-03-17,1500,2001,75.0,150000,6,2016,0,91074,17/03/2016 17:40,...,0,0,0,0,0,0,0,0,1,0
4,2016-03-31,3600,2008,69.0,90000,7,2016,0,60437,06/04/2016 10:17,...,0,1,0,0,0,0,0,0,0,0


In [22]:
#We will divide the data set into training, validation and test sets.
# set aside 20% of train and test data for evaluation
df_train, df_test = train_test_split(data_ohe, test_size=0.2, random_state=12345)

# Use the same function above for the validation set
df_train, df_valid = train_test_split(data_ohe, test_size=0.25, random_state=12345) # 0.25 x 0.8 = 0.2

In [23]:
# Lets set targets and targets.
features_train = df_train.drop(['datecrawled','price', 'registrationmonth',
'datecreated', 'numberofpictures', 'postalcode', 'lastseen'], axis=1)
target_train = df_train['price']
features_valid = df_valid.drop(['datecrawled','price', 'registrationmonth',
'datecreated', 'numberofpictures', 'postalcode', 'lastseen'], axis=1)
target_valid = df_valid['price']
features_test = df_test.drop(['datecrawled','price', 'registrationmonth',
'datecreated', 'numberofpictures', 'postalcode', 'lastseen'], axis=1)
target_test = df_test['price']

Our dataset is ready to use.

## Model training

In [24]:
%%time
best_model = None
best_result = 10000
best_est = 0
best_depth = 0
for est in range(10, 15):
    for depth in range (1, 8):
        model = GradientBoostingRegressor(n_estimators=est, max_depth=depth)
        model.fit(features_train, target_train)
        predictions_valid = model.predict(features_valid) 
        result = mean_squared_error(target_valid, predictions_valid)**0.5
        if result < best_result:
            best_model = model
            best_result = result
            best_est = est
            best_depth = depth

print("RMSE of the best model on the validation set:", best_result, "n_estimators:", best_est, "best_depth:", depth)

RMSE of the best model on the validation set: 2339.8607810192734 n_estimators: 14 best_depth: 7
CPU times: user 13min 47s, sys: 13.2 s, total: 14min
Wall time: 14min


In [25]:
%%time
model = GradientBoostingRegressor(n_estimators=14, max_depth=7)
model.fit(features_train, target_train)
predictions_valid = model.predict(features_valid) 
result = mean_squared_error(target_valid, predictions_valid)**0.5

CPU times: user 50.9 s, sys: 457 ms, total: 51.4 s
Wall time: 51.4 s


In [None]:
%%time
# We will continue with the lightgbm model.
best_model = None
best_result = 10000
best_est = 0
best_depth = 0
for est in range(10, 21):
    for depth in range (1, 11):
        model = lightgbm.LGBMRegressor(n_estimators=est, max_depth=depth)
        model.fit(features_train, target_train)
        predictions_valid = model.predict(features_valid) 
        result = mean_squared_error(target_valid, predictions_valid)**0.5
        if result < best_result:
            best_model = model
            best_result = result
            best_est = est
            best_depth = depth

print("RMSE of the best model on the validation set:", best_result, "n_estimators:", best_est, "best_depth:", depth)

In [None]:
%%time
model = lightgbm.LGBMRegressor(n_estimators=20, max_depth=10)
model.fit(features_train, target_train)
predictions_valid = model.predict(features_valid) 
result = mean_squared_error(target_valid, predictions_valid)**0.5

In [None]:
%%time
#We will use linear regression to perform a sanity check.
model = LinearRegression()
model.fit(features_train, target_train)
predictions_valid = model.predict(features_valid) 

result = mean_squared_error(target_valid, predictions_valid)**0.5
print("RMSE of the linear regression model on the validation set:", result)

In [26]:
%%time
#final estimate generalization performance.
model = GradientBoostingRegressor(n_estimators=14, max_depth=7)
model.fit(features_train, target_train)
predictions_test = model.predict(features_test) 
result = mean_squared_error(target_test, predictions_test)**0.5
print("RMSE of the final model on the test set:", result)

RMSE of the final model on the test set: 2342.0691444933523
CPU times: user 50.1 s, sys: 368 ms, total: 50.4 s
Wall time: 50.5 s


## Model analysis

The quality of the models according to the RMSE score and the best hyperparameters: 

sklearn GradientBoostingRegressor:  
RMSE of the best model on the validation set: 2339.8 n_estimators: 14 best_depth: 7

lightgbm:  
RMSE of the best model on the validation set: 2238.3 n_estimators: 20 best_depth: 10

linear regression:  
RMSE of the linear regression model on the validation set: 3243.7

final model:  
RMSE of the final model (sklearn GradientBoostingRegressor-n_estimators: 14 best_depth: 7 ) on the test set: 2342.0

Running time of tuning the models: 

sklearn GradientBoostingRegressor - 14min.

lightgbm -  9min 15s.

Running time of the models:

sklearn GradientBoostingRegressor - 51.4 s.

lightgbm -  1min 44s.

linear regression - 23.4 s.

final model - 50.5 s.

final conclusion:

The two models we trained passed our sanity test (which is actually the linear regression) and got much better results than it.
Our two models gave almost identical results, with Skillen's model being more accurate.
In terms of hyperparameter tuning time, lightgbm's model was faster, but what is really important is the running time of the model, in which Skillran's model was much faster, which is why it was chosen.
Skillran's model was subjected to a final evaluation on our test set and gave a similar result to our evaluation set - which confirms the model's validity.