Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Data preparation

In [1]:
import numpy as np
import pandas as pd

import lightgbm as lgb
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OrdinalEncoder

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from lightgbm import LGBMRegressor
import catboost

In [2]:
data = pd.read_csv('/datasets/car_data.csv')
display(data.head())

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


In [3]:
print(data.info())
print(data.groupby('NotRepaired')['NotRepaired'].count())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [4]:
data_dropped = data.dropna()
print(data_dropped.info())
data_filled = data.fillna('Na')
print(data_filled.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 245814 entries, 3 to 354367
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        245814 non-null  object
 1   Price              245814 non-null  int64 
 2   VehicleType        245814 non-null  object
 3   RegistrationYear   245814 non-null  int64 
 4   Gearbox            245814 non-null  object
 5   Power              245814 non-null  int64 
 6   Model              245814 non-null  object
 7   Mileage            245814 non-null  int64 
 8   RegistrationMonth  245814 non-null  int64 
 9   FuelType           245814 non-null  object
 10  Brand              245814 non-null  object
 11  NotRepaired        245814 non-null  object
 12  DateCreated        245814 non-null  object
 13  NumberOfPictures   245814 non-null  int64 
 14  PostalCode         245814 non-null  int64 
 15  LastSeen           245814 non-null  object
dtypes: int64(7), object(

I noticed the columns with integers were filled out, in order to keep the most amount of data for the models I filled the null values in the categorical columns with a filler ('Na'). 

## Feature preparation

In [5]:
columns_obj = ['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'NotRepaired']
data_train, data_test = train_test_split(data_filled, test_size = 0.3, random_state = 12345)
data_train_f = data_train.drop(['Price', 'DateCrawled','DateCreated', 'LastSeen'] , axis = 1)
data_train_t = data_train['Price']
data_test_f = data_test.drop(['Price', 'DateCrawled','DateCreated', 'LastSeen'], axis = 1)
data_test_t = data_test['Price']
print(data_train_f.shape)
print(data_train_t.shape)
print(data_test_f.shape)
print(data_test_t.shape)

(248058, 12)
(248058,)
(106311, 12)
(106311,)


### Ordinal Encoder

In [6]:
encoder = OrdinalEncoder()
encoder.fit(data_train_f)
train_ordinal = pd.DataFrame(encoder.transform(data_train_f), columns=data_train_f.columns)
encoder.fit(data_test_f)
test_ordinal = pd.DataFrame(encoder.transform(data_test_f), columns=data_test_f.columns)
display(train_ordinal.head())
display(test_ordinal.head())

Unnamed: 0,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,NumberOfPictures,PostalCode
0,2.0,89.0,2.0,61.0,107.0,10.0,7.0,7.0,32.0,1.0,0.0,4707.0
1,6.0,87.0,2.0,0.0,76.0,12.0,0.0,7.0,27.0,1.0,0.0,2714.0
2,1.0,90.0,2.0,0.0,246.0,12.0,6.0,7.0,36.0,1.0,0.0,684.0
3,8.0,100.0,1.0,177.0,32.0,12.0,8.0,3.0,1.0,1.0,0.0,1959.0
4,5.0,98.0,2.0,122.0,117.0,11.0,11.0,7.0,38.0,1.0,0.0,5882.0


Unnamed: 0,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,NumberOfPictures,PostalCode
0,1.0,81.0,1.0,149.0,235.0,12.0,4.0,3.0,20.0,1.0,0.0,4055.0
1,4.0,82.0,2.0,142.0,6.0,4.0,5.0,3.0,2.0,1.0,0.0,2364.0
2,5.0,75.0,2.0,124.0,43.0,12.0,12.0,7.0,24.0,1.0,0.0,772.0
3,5.0,78.0,1.0,223.0,60.0,12.0,9.0,3.0,20.0,1.0,0.0,3949.0
4,4.0,65.0,2.0,67.0,223.0,12.0,9.0,3.0,38.0,1.0,0.0,1518.0


### OHE

In [7]:
data_ohe = pd.get_dummies(data_filled[columns_obj], drop_first=True)
data_ohe_filled = data_filled.drop(columns_obj, axis = 1)
print(data_ohe.columns)
data_ohe_fill = pd.concat([data_ohe_filled, data_ohe], axis = 1)
print(data_ohe_filled.shape)

Index(['VehicleType_bus', 'VehicleType_convertible', 'VehicleType_coupe',
       'VehicleType_other', 'VehicleType_sedan', 'VehicleType_small',
       'VehicleType_suv', 'VehicleType_wagon', 'Gearbox_auto',
       'Gearbox_manual',
       ...
       'Brand_smart', 'Brand_sonstige_autos', 'Brand_subaru', 'Brand_suzuki',
       'Brand_toyota', 'Brand_trabant', 'Brand_volkswagen', 'Brand_volvo',
       'NotRepaired_no', 'NotRepaired_yes'],
      dtype='object', length=308)
(354369, 10)


In [8]:
data_train_ohe, data_test_ohe = train_test_split(data_ohe_fill, test_size = 0.3, random_state = 12345)
data_train_f_ohe = data_train_ohe.drop(['Price', 'DateCrawled','DateCreated', 'LastSeen'] , axis = 1)
data_train_t_ohe = data_train_ohe['Price']
data_test_f_ohe = data_test_ohe.drop(['Price', 'DateCrawled','DateCreated', 'LastSeen'], axis = 1)
data_test_t_ohe = data_test_ohe['Price']
print(data_train_f_ohe.shape)
print(data_train_t_ohe.shape)
print(data_test_f_ohe.shape)
print(data_test_t_ohe.shape)

(248058, 314)
(248058,)
(106311, 314)
(106311,)


## Model training

### Random Forest Classifier

In [9]:
best_depth = 0
best_score = 0
best_est = 0

In [10]:
%%time
best_score = 100000000
best_est = 0
for est in range(1, 51, 10):
    model_ran_for = RandomForestRegressor(random_state=54321, n_estimators=est) 
    model_ran_for.fit(train_ordinal, data_train_t)
    predicts = model_ran_for.predict(test_ordinal)
    score = mean_squared_error(predicts, data_test_t)
    if score < best_score:
        best_est = est
        best_score = score


CPU times: user 3min, sys: 964 ms, total: 3min 1s
Wall time: 3min 1s


In [11]:
print(best_score)
print(best_est)

22928134.58540871
41


In [12]:
%%time
model_ran_for = RandomForestRegressor(random_state = 54321, n_estimators = 41)
model_ran_for.fit(train_ordinal, data_train_t)
predicts = model_ran_for.predict(test_ordinal)

CPU times: user 1min 10s, sys: 492 ms, total: 1min 10s
Wall time: 1min 10s


### Linear Regression

In [13]:
%%time
lin_reg = LinearRegression()
lin_reg.fit(data_train_f_ohe, data_train_t_ohe)
lin_predict = lin_reg.predict(data_test_f_ohe)

CPU times: user 17.5 s, sys: 3.22 s, total: 20.7 s
Wall time: 20.7 s


In [14]:
lin_mse = mean_squared_error(lin_predict, data_test_t_ohe)
print(lin_mse)

9991450.213467574


### LGB model

In [15]:
%%time

lgb_score = 100000000
best_depth = 0
for depth in range(1, 51, 10):
    lgb_model = LGBMRegressor(random_state = 54321, max_depth = depth) 
    lgb_model.fit(data_train_f_ohe, data_train_t_ohe)
    lgb_predict = lgb_model.predict(data_test_f_ohe)
    score = mean_squared_error(lgb_predict, data_test_t)
    if score < best_score:
        best_depth = depth
        lgb_score = score
print(lgb_score)
print(best_depth)

3383262.6614660257
41
CPU times: user 41.6 s, sys: 1.5 s, total: 43.1 s
Wall time: 43.3 s


In [16]:
%%time
lgb_score = 100000000
best_bins = 0
for bins in range(3000, 3500, 50):
    lgb_model = LGBMRegressor(random_state = 54321, max_depth = 41, max_bins = bins) 
    lgb_model.fit(data_train_f_ohe, data_train_t_ohe)
    lgb_predict = lgb_model.predict(data_test_f_ohe)
    score = mean_squared_error(lgb_predict, data_test_t)
    if score < best_score:
        best_bins = bins
        lgb_score = score
print(lgb_score)
print(best_bins)

3382292.364033883
3450
CPU times: user 1min 56s, sys: 2.86 s, total: 1min 59s
Wall time: 2min


In [17]:
%%time
lgb_model = LGBMRegressor(random_state = 54321, max_depth = 41, max_bins = 3450) 
lgb_model.fit(data_train_f_ohe, data_train_t_ohe)
lgb_predict = lgb_model.predict(data_test_f_ohe)

CPU times: user 10.6 s, sys: 281 ms, total: 10.9 s
Wall time: 11 s


### Cat Boost

In [18]:
%%time
cb_model = catboost.CatBoostRegressor()
cb_model.fit(data_train_f_ohe, data_train_t_ohe)

Learning rate set to 0.097841
0:	learn: 4250.5054540	total: 105ms	remaining: 1m 44s
1:	learn: 4017.5391150	total: 158ms	remaining: 1m 19s
2:	learn: 3797.4724965	total: 213ms	remaining: 1m 10s
3:	learn: 3614.2829547	total: 271ms	remaining: 1m 7s
4:	learn: 3445.0530515	total: 324ms	remaining: 1m 4s
5:	learn: 3297.8334748	total: 381ms	remaining: 1m 3s
6:	learn: 3170.0936046	total: 436ms	remaining: 1m 1s
7:	learn: 3062.6114380	total: 496ms	remaining: 1m 1s
8:	learn: 2967.4282712	total: 551ms	remaining: 1m
9:	learn: 2879.6429971	total: 607ms	remaining: 1m
10:	learn: 2804.2722631	total: 665ms	remaining: 59.7s
11:	learn: 2732.9740958	total: 717ms	remaining: 59s
12:	learn: 2675.7765435	total: 776ms	remaining: 58.9s
13:	learn: 2619.7803938	total: 829ms	remaining: 58.4s
14:	learn: 2574.5384137	total: 881ms	remaining: 57.8s
15:	learn: 2534.8395543	total: 934ms	remaining: 57.4s
16:	learn: 2495.7388611	total: 986ms	remaining: 57s
17:	learn: 2463.5959950	total: 1.04s	remaining: 56.5s
18:	learn: 2431

<catboost.core.CatBoostRegressor at 0x7f1ff4cb0040>

In [19]:
%%time
cb_predict = cb_model.predict(data_test_f_ohe)
cb_mse = mean_squared_error(cb_predict, data_test_t_ohe)
print(cb_mse)

3058870.6987755955
CPU times: user 240 ms, sys: 7.98 ms, total: 248 ms
Wall time: 248 ms


## Model analysis

In [20]:
random_forest = best_score
lin_reg = lin_mse
lgb_model = lgb_score
cat_boost = cb_mse
print('Random Forest: ', random_forest)
print('Linear Regression: ', lin_reg)
print('LGB_model: ', lgb_model)
print('CatBoost:' , cat_boost)

Random Forest:  22.92813458540871
Linear Regression:  9.991450213467573
LGB_model:  3.382292364033883
CatBoost: 3.0588706987755954


The time to train and predict for the random forest model was 1 min 12s, linear regression was 21s, LGB model was 10.7s, and catboost was 49.7s. 

Once the hyperameters were set the LGB model had a faster run time. However the CatBoost model had slightly lower MSE score. Overall, I would recomend LGB to be the better model because it is the fastest model to fit and predict the data and the error is not compromised with the speed.   
