Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

# 1. Data preparation

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from catboost import CatBoostRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from lightgbm import LGBMRegressor

In [2]:
df = pd.read_csv('/datasets/car_data.csv')

In [3]:
df.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
DateCrawled          354369 non-null object
Price                354369 non-null int64
VehicleType          316879 non-null object
RegistrationYear     354369 non-null int64
Gearbox              334536 non-null object
Power                354369 non-null int64
Model                334664 non-null object
Mileage              354369 non-null int64
RegistrationMonth    354369 non-null int64
FuelType             321474 non-null object
Brand                354369 non-null object
NotRepaired          283215 non-null object
DateCreated          354369 non-null object
NumberOfPictures     354369 non-null int64
PostalCode           354369 non-null int64
LastSeen             354369 non-null object
dtypes: int64(7), object(9)
memory usage: 43.3+ MB


In [5]:
df.describe()

Unnamed: 0,Price,RegistrationYear,Power,Mileage,RegistrationMonth,NumberOfPictures,PostalCode
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0,50508.689087
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0,25783.096248
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49413.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


In [6]:
df['NotRepaired'] = df['NotRepaired'].fillna('unknown')
df['FuelType'] = df['FuelType'].fillna('unknown')
df['VehicleType'] = df['VehicleType'].fillna('unknown')
df['Model'] = df['Model'].fillna('unknown')
df.dropna(inplace=True)

In [7]:
useless = ['DateCrawled', 'DateCreated', 'PostalCode', 'LastSeen', 'NumberOfPictures']
df.drop(useless, axis=1, inplace=True)

In [8]:
X = df.drop('Price', axis=1)
y = df['Price']

Rather than drop missing data, the best approach here is to fill in unknown values as "unknown." The reason for this with the NotRepaired category should be apparent right away -without knowning if its repaired, there's no way to accurately fill missing values. And, since about 20% of the data points are null, cutting those out can cause an unreasonable loss of data. 

Understanding cars, it can also be difficult to infer model, fueltype, or vehicletype in any automated fashion. With that in mind, filling each of these with "unknown" could lead to better results than simply dropping the data. If I find later that the model is not performing as well as hoped, I may revisit this to drop nulls instead. But, losing that much data is not preferred. 

Finally, I proceeded to drop additional columns for features that were unecessary and separate into X and y variables (features/target) for proper model training. 

# 2. Model training

In [9]:
from sklearn.preprocessing import OrdinalEncoder
X2 = X
encoder = OrdinalEncoder()
encoder.fit(X2)
data_ordinal = encoder.transform(X2)

In [10]:
%%time
model = LinearRegression()
score = cross_val_score(model, data_ordinal, y, cv=5)

CPU times: user 1.1 s, sys: 623 ms, total: 1.73 s
Wall time: 1.7 s


In [11]:
print('Average Model Score:', pd.np.mean(score))

Average Model Score: 0.5148508013412083


In [12]:
X_train, X_test, y_train, y_test = train_test_split(data_ordinal, y, test_size=0.25, random_state=47)

In [13]:
%%time
model.fit(X_train, y_train)

CPU times: user 100 ms, sys: 15.8 ms, total: 116 ms
Wall time: 136 ms


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [14]:
%%time
predict = model.predict(X_test)

CPU times: user 28.5 ms, sys: 33.5 ms, total: 62 ms
Wall time: 88.7 ms


In [15]:
rmse = pd.np.sqrt(mean_squared_error(y_test, predict))
rmse

3170.8779260043275

In [16]:
%%time
model = RandomForestRegressor()
score = cross_val_score(model, data_ordinal, y, cv=5)



CPU times: user 1min 4s, sys: 640 ms, total: 1min 4s
Wall time: 1min 5s


In [17]:
print('Average Model Score:', pd.np.mean(score))

Average Model Score: 0.8464434755619253


In [18]:
%%time
model.fit(X_train, y_train)



CPU times: user 10.8 s, sys: 156 ms, total: 11 s
Wall time: 11 s


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)

In [19]:
%%time
predict = model.predict(X_test)
rmse = pd.np.sqrt(mean_squared_error(y_test, predict))
print("RMSE:", rmse)

RMSE: 1812.5574508909638
CPU times: user 480 ms, sys: 83 µs, total: 480 ms
Wall time: 490 ms


In [20]:
%%time
for estims in range(50, 101, 10):
    model = RandomForestRegressor(random_state=47, n_estimators=estims)
    score = cross_val_score(model, data_ordinal, y, cv=5, n_jobs=-1)
    model.fit(X_train, y_train)
    predict = model.predict(X_test)
    rmse = pd.np.sqrt(mean_squared_error(y_test, predict))
    print("n_estimators =", estims, ":", rmse, "cross_val_score:", score.mean())

n_estimators = 50 : 1758.507408287913 cross_val_score: 0.8556229260274224
n_estimators = 60 : 1756.8463763566556 cross_val_score: 0.8561371063720357
n_estimators = 70 : 1753.701416345681 cross_val_score: 0.8565141115808028
n_estimators = 80 : 1752.254240099608 cross_val_score: 0.8567392983275539
n_estimators = 90 : 1751.210969959759 cross_val_score: 0.8568680145109789
n_estimators = 100 : 1751.1072075470656 cross_val_score: 0.8569945307911786
CPU times: user 55min 57s, sys: 30.3 s, total: 56min 28s
Wall time: 56min 35s


<strong>Saving some server time:</strong>
To save some processessing time, I cut the loops I was testing in half this time. Last time I did range(10,101,10) but since I'm adding CV as instructed, I did this to cut back processing time. 

I'll cut all future loops as well to make this uniform and report on time spent accordingly. 

In [21]:
%%time
for depth in range(10, 51, 10):
    model = RandomForestRegressor(random_state=47, n_estimators=100, max_depth=depth)
    score = cross_val_score(model, data_ordinal, y, cv=5, n_jobs=-1)
    model.fit(X_train, y_train)
    predict = model.predict(X_test)
    rmse = pd.np.sqrt(mean_squared_error(y_test, predict))
    print("max_depth=", depth, ":", rmse, "cross_val_score:", score.mean())

max_depth= 10 : 2021.8464652764878 cross_val_score: 0.8062097161017145
max_depth= 20 : 1735.421750860292 cross_val_score: 0.8594758134244804
max_depth= 30 : 1749.6142593486973 cross_val_score: 0.8570749135665074
max_depth= 40 : 1750.8121739120375 cross_val_score: 0.8569580713596269
max_depth= 50 : 1751.1072075470656 cross_val_score: 0.8569945291444683
CPU times: user 52min 9s, sys: 16.5 s, total: 52min 25s
Wall time: 52min 48s


In [22]:
%%time
model = RandomForestRegressor(random_state=47, n_estimators=100, max_depth=30)
score = cross_val_score(model, data_ordinal, y, cv=5)
print('Average Model Score:', pd.np.mean(score))

Average Model Score: 0.8570749135665074
CPU times: user 10min 3s, sys: 2.54 s, total: 10min 6s
Wall time: 10min 6s


In [23]:
%%time
model.fit(X_train, y_train)

CPU times: user 1min 44s, sys: 452 ms, total: 1min 44s
Wall time: 1min 44s


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=30,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100,
                      n_jobs=None, oob_score=False, random_state=47, verbose=0,
                      warm_start=False)

In [24]:
%%time
predict = model.predict(X_test)
rmse = pd.np.sqrt(mean_squared_error(y_test, predict))
print("RMSE:", rmse)

RMSE: 1749.6142593486973
CPU times: user 4.77 s, sys: 70 µs, total: 4.77 s
Wall time: 4.79 s


While Random Forest has proven better than Linear Regression, the RMSE is still high. Even with some hyperparamter tuning, we definitely want to look at other models to achieve better results. 

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=47)

In [34]:
import warnings
warnings.filterwarnings('ignore')

categorical = ['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'NotRepaired']
for col in categorical:
    X_train[col] = X_train[col].astype('category')
    X_test[col] = X_test[col].astype('category')
    X[col] = X[col].astype('category')

In [28]:
from sklearn.model_selection import RandomizedSearchCV

In [31]:
params = {
        'num_leaves': range(100, 201, 10),
        'max_depth': range(10, 51, 10)
    }

In [41]:
%%time 
lgb = LGBMRegressor()
lgbm_random = RandomizedSearchCV(estimator = lgb, param_distributions = params, 
                         n_iter = 20, cv = 3, verbose=10, random_state=47)
lgbm_random.fit(X, y, categorical_feature='auto')

Fitting 3 folds for each of 20 candidates, totalling 60 fits
[CV] num_leaves=110, max_depth=50 ....................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ........ num_leaves=110, max_depth=50, score=0.863, total=  17.1s
[CV] num_leaves=110, max_depth=50 ....................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   17.1s remaining:    0.0s


[CV] ........ num_leaves=110, max_depth=50, score=0.862, total=  17.2s
[CV] num_leaves=110, max_depth=50 ....................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   34.3s remaining:    0.0s


[CV] ........ num_leaves=110, max_depth=50, score=0.864, total=  38.6s
[CV] num_leaves=170, max_depth=40 ....................................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  1.2min remaining:    0.0s


[CV] ........ num_leaves=170, max_depth=40, score=0.865, total=  19.8s
[CV] num_leaves=170, max_depth=40 ....................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  1.5min remaining:    0.0s


[CV] ........ num_leaves=170, max_depth=40, score=0.865, total=  19.8s
[CV] num_leaves=170, max_depth=40 ....................................


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  1.9min remaining:    0.0s


[CV] ........ num_leaves=170, max_depth=40, score=0.866, total=  21.3s
[CV] num_leaves=200, max_depth=30 ....................................


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  2.2min remaining:    0.0s


[CV] ........ num_leaves=200, max_depth=30, score=0.866, total=  21.5s
[CV] num_leaves=200, max_depth=30 ....................................


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:  2.6min remaining:    0.0s


[CV] ........ num_leaves=200, max_depth=30, score=0.866, total=  22.0s
[CV] num_leaves=200, max_depth=30 ....................................


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:  3.0min remaining:    0.0s


[CV] ........ num_leaves=200, max_depth=30, score=0.867, total=  21.7s
[CV] num_leaves=120, max_depth=40 ....................................


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:  3.3min remaining:    0.0s


[CV] ........ num_leaves=120, max_depth=40, score=0.863, total=  18.7s
[CV] num_leaves=120, max_depth=40 ....................................
[CV] ........ num_leaves=120, max_depth=40, score=0.863, total=  16.8s
[CV] num_leaves=120, max_depth=40 ....................................
[CV] ........ num_leaves=120, max_depth=40, score=0.864, total=  18.8s
[CV] num_leaves=130, max_depth=50 ....................................
[CV] ........ num_leaves=130, max_depth=50, score=0.864, total=  18.6s
[CV] num_leaves=130, max_depth=50 ....................................
[CV] ........ num_leaves=130, max_depth=50, score=0.864, total=  18.7s
[CV] num_leaves=130, max_depth=50 ....................................
[CV] ........ num_leaves=130, max_depth=50, score=0.865, total=  17.8s
[CV] num_leaves=180, max_depth=40 ....................................
[CV] ........ num_leaves=180, max_depth=40, score=0.865, total=  20.9s
[CV] num_leaves=180, max_depth=40 ....................................
[CV] .

[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed: 20.7min finished


CPU times: user 20min 47s, sys: 7.36 s, total: 20min 54s
Wall time: 21min 7s


RandomizedSearchCV(cv=3, error_score='raise-deprecating',
                   estimator=LGBMRegressor(boosting_type='gbdt',
                                           class_weight=None,
                                           colsample_bytree=1.0,
                                           importance_type='split',
                                           learning_rate=0.1, max_depth=-1,
                                           min_child_samples=20,
                                           min_child_weight=0.001,
                                           min_split_gain=0.0, n_estimators=100,
                                           n_jobs=-1, num_leaves=31,
                                           objective=None, random_state=None,
                                           reg_alpha=0.0, reg_lambda=0.0,
                                           silent=True, subsample=1.0,
                                           subsample_for_bin=200000,
                                

In [42]:
%%time
lgb = LGBMRegressor(random_state=47, num_leaves=180, max_depth=30)
lgb.fit(X_train, y_train, categorical_feature='auto')

CPU times: user 19.9 s, sys: 120 ms, total: 20 s
Wall time: 20.4 s


LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.1, max_depth=30,
              min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
              n_estimators=100, n_jobs=-1, num_leaves=180, objective=None,
              random_state=47, reg_alpha=0.0, reg_lambda=0.0, silent=True,
              subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

In [43]:
%%time
predict = lgb.predict(X_test)

CPU times: user 2.55 s, sys: 7.77 ms, total: 2.56 s
Wall time: 2.46 s


In [44]:
print('RMSE:', pd.np.sqrt(mean_squared_error(y_test, predict)))

RMSE: 1694.4045196247666


We see here that the base LGBM Regressor does not do any better than a Random Forest. With that in mind, we'll want to try some hpyerparameter tuning.

By tuning both the number of leaves and max depth of the LGBM Regressor, we can improve performance, but not by much. With some more time, we could look at tuning additional parameters. Instead, I would prefer to explore CatBoost. 

In [48]:
%%time
model = CatBoostRegressor()
model.fit(X_train, y_train, cat_features=categorical)
predict = model.predict(X_test)
rmse = pd.np.sqrt(mean_squared_error(y_test, predict))
print(rmse)

0:	learn: 4464.4408376	total: 938ms	remaining: 15m 37s
1:	learn: 4379.6713831	total: 1.84s	remaining: 15m 17s
2:	learn: 4298.4704952	total: 2.82s	remaining: 15m 38s
3:	learn: 4219.5489163	total: 3.53s	remaining: 14m 38s
4:	learn: 4143.6048682	total: 4.23s	remaining: 14m 1s
5:	learn: 4070.6965703	total: 5.03s	remaining: 13m 53s
6:	learn: 4001.1528382	total: 5.82s	remaining: 13m 46s
7:	learn: 3934.4175398	total: 6.53s	remaining: 13m 29s
8:	learn: 3870.3856141	total: 7.32s	remaining: 13m 26s
9:	learn: 3809.9134385	total: 8.12s	remaining: 13m 24s
10:	learn: 3749.7410635	total: 8.93s	remaining: 13m 22s
11:	learn: 3692.7706487	total: 9.63s	remaining: 13m 12s
12:	learn: 3635.0990493	total: 10.4s	remaining: 13m 11s
13:	learn: 3581.1875857	total: 11.2s	remaining: 13m 10s
14:	learn: 3529.5560317	total: 12s	remaining: 13m 8s
15:	learn: 3478.1719445	total: 12.7s	remaining: 13m 2s
16:	learn: 3430.2633445	total: 13.5s	remaining: 13m 2s
17:	learn: 3382.7286120	total: 14.4s	remaining: 13m 6s
18:	learn

# 3. Model analysis

While we could likely try some more paramters with our CatBoost model, I would recommend sticking with LGBMRegressor using 200 leaves and a max depth of 20. 

If we want to further reduce RMSE, we can also explore further parameter tuning. 

To summarize findings, here is a simple table that shows the effectiveness of various approaches (based on using best parameters).

<table>
    <tr>
        <td><strong>Type of Model</strong></td>
        <td><strong>Time to Complete</strong></td>
        <td><strong>RMSE</strong></td>
        <td><strong>Model Score</strong></td>
        <td><strong>Total Time Tuning</strong></td>
    </tr>
    <tr>
        <td>Linear Regression</td>
        <td>A few seconds</td>
        <td>3170.88</td>
        <td>0.51</td>
        <td>N/A</td>
    </tr>
    <tr>
        <td>Random Forest</td>
        <td>About 1 minute</td>
        <td>1735.42</td>
        <td>0.86</td>
        <td>Over 2 Hours</td>
    </tr>
    <tr>
        <td>LGBM</td>
        <td>Less than 30 seconds</td>
        <td>1694.40</td>
        <td>0.866</td>
        <td>About 21 minutes</td>
    </tr>
    <tr>
        <td>Catboost</td>
        <td>11 Minutes, 42 Seconds</td>
        <td>1813.83</td>
        <td>N/A</td>
        <td>11 Minutes, 42 Seconds</td>
    </tr>
</table>