# Numerical Methods

Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Data preparation

### Data Loading

In [25]:
# Importing Libraries
import time
import pandas as pd
import numpy as np

from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.metrics import mean_squared_error

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import lightgbm as lgb
from catboost import CatBoostRegressor

In [26]:
# Loading Data
data = pd.read_csv('car_data.csv')
data.info()
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


The data contains 354369 entries each with 16 columns, which will be split into the following categories: 

**Features:**
- `DateCrawled` — date profile was downloaded from the database
- `VehicleType` — vehicle body type
- `RegistrationYear` — vehicle registration year
- `Gearbox` — gearbox type
- `Power` — power (hp)
- `Model` — vehicle model
- `Mileage` — odometer reading (measured in km due to dataset's regional specifics)
- `RegistrationMonth` — vehicle registration month
- `FuelType` — fuel type
- `Brand` — vehicle brand
- `NotRepaired` — vehicle repaired or not
- `DateCreated` — date of profile creation
- `NumberOfPictures` — number of vehicle pictures
- `PostalCode` — postal code of profile owner (user)
- `LastSeen` — date of the last activity of the user

**Target:**
- `Price` — price (in Euro)

Column names will be renamed for readability, then datatypes will be corrected:

In [27]:
# Renaming columns
data = data.rename(columns={'DateCrawled':'date_crawled','Price':'price','VehicleType':'vehicle_type','RegistrationYear':'registration_year','Gearbox':'gearbox','Power':'power','Model':'model','Mileage':'mileage','RegistrationMonth':'registration_month','FuelType':'fuel_type','Brand':'brand','NotRepaired':'not_repaired','DateCreated':'date_created','NumberOfPictures':'num_of_pictures','PostalCode':'postal_code','LastSeen':'last_seen'})
print(data.columns)

# Converting to datetime
data['date_created'] = pd.to_datetime(data['date_created'], dayfirst=True)
data.dtypes

Index(['date_crawled', 'price', 'vehicle_type', 'registration_year', 'gearbox',
       'power', 'model', 'mileage', 'registration_month', 'fuel_type', 'brand',
       'not_repaired', 'date_created', 'num_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')


date_crawled                  object
price                          int64
vehicle_type                  object
registration_year              int64
gearbox                       object
power                          int64
model                         object
mileage                        int64
registration_month             int64
fuel_type                     object
brand                         object
not_repaired                  object
date_created          datetime64[ns]
num_of_pictures                int64
postal_code                    int64
last_seen                     object
dtype: object

### Missing & Duplicate Values

In [28]:
print(data.isna().sum())
missing_cols = ['vehicle_type', 'model', 'fuel_type', 'gearbox', 'not_repaired']

date_crawled              0
price                     0
vehicle_type          37490
registration_year         0
gearbox               19833
power                     0
model                 19705
mileage                   0
registration_month        0
fuel_type             32895
brand                     0
not_repaired          71154
date_created              0
num_of_pictures           0
postal_code               0
last_seen                 0
dtype: int64


All the columns with missing values are the object dtype. Further analysis is necessary.

In [29]:
for col in missing_cols:
    print()
    print(data[col].value_counts())


vehicle_type
sedan          91457
small          79831
wagon          65166
bus            28775
convertible    20203
coupe          16163
suv            11996
other           3288
Name: count, dtype: int64

model
golf                  29232
other                 24421
3er                   19761
polo                  13066
corsa                 12570
                      ...  
i3                        8
serie_3                   4
rangerover                4
range_rover_evoque        2
serie_1                   2
Name: count, Length: 250, dtype: int64

fuel_type
petrol      216352
gasoline     98720
lpg           5310
cng            565
hybrid         233
other          204
electric        90
Name: count, dtype: int64

gearbox
manual    268251
auto       66285
Name: count, dtype: int64

not_repaired
no     247161
yes     36054
Name: count, dtype: int64


The `vehicle_type`, `model` and `fuel_type` columns have an `other` value that missing values will be replaced with. The remaining columns will have missing values filled in with `unknown`.

In [30]:
# Filling missing data
data[missing_cols[:3]] = data[missing_cols[:3]].fillna('other')
data[missing_cols[3:]] = data[missing_cols[3:]].fillna('unknown')
del missing_cols

data.isna().sum()

date_crawled          0
price                 0
vehicle_type          0
registration_year     0
gearbox               0
power                 0
model                 0
mileage               0
registration_month    0
fuel_type             0
brand                 0
not_repaired          0
date_created          0
num_of_pictures       0
postal_code           0
last_seen             0
dtype: int64

In [31]:
print('Number of duplicate rows:', data.duplicated().sum())

Number of duplicate rows: 263


In [32]:
data.drop_duplicates(inplace=True)
print('Number of duplicate rows:', data.duplicated().sum())

Number of duplicate rows: 0


In [33]:
data.describe()

Unnamed: 0,price,registration_year,power,mileage,registration_month,date_created,num_of_pictures,postal_code
count,354106.0,354106.0,354106.0,354106.0,354106.0,354106,354106.0,354106.0
mean,4416.443785,2004.235379,110.08975,128211.750154,5.71417,2016-03-20 19:11:16.129746944,0.0,50507.195148
min,0.0,1000.0,0.0,5000.0,0.0,2014-03-10 00:00:00,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,2016-03-13 00:00:00,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,2016-03-21 00:00:00,0.0,49406.0
75%,6400.0,2008.0,143.0,150000.0,9.0,2016-03-29 00:00:00,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,2016-04-07 00:00:00,0.0,99998.0
std,4514.340636,90.261295,189.915231,37906.625942,3.72668,,0.0,25784.231254


A few oddities stand out:
- `registration_year` has a min value of 1000 and a max of 9999.
- `power` has a min value of 0.
- `registration_month` has a min value of 0, but a max of 12, making 13 months.
- `num_of_pictures` contains only 0 values.

In [34]:
# Getting the year data was collected
current_year = data['date_created'].dt.year.max()

# Remove unrealistic values
data = data[(data['registration_year'] >= 1900) & (data['registration_year'] <= current_year)]
data = data[data['power'] >= 50]

# Filling missing 'registration_month' values
data['registration_month'].replace(0, np.nan, inplace=True)
data['registration_month'].fillna(data['registration_month'].mode()[0], inplace=True)

# Drop non-informative columns
data.drop(columns=['num_of_pictures'], inplace=True)

display(data.describe())

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['registration_month'].replace(0, np.nan, inplace=True)


Unnamed: 0,price,registration_year,power,mileage,registration_month,date_created,postal_code
count,297347.0,297347.0,297347.0,297347.0,297347.0,297347,297347.0
mean,4806.474728,2002.903083,125.980464,128578.142709,6.166563,2016-03-20 19:17:03.470894080,51199.648882
min,0.0,1910.0,50.0,5000.0,1.0,2015-03-20 00:00:00,1067.0
25%,1299.0,1999.0,80.0,125000.0,3.0,2016-03-13 00:00:00,30916.0
50%,3100.0,2003.0,115.0,150000.0,6.0,2016-03-21 00:00:00,50226.0
75%,6990.0,2007.0,150.0,150000.0,9.0,2016-03-29 00:00:00,72076.0
max,20000.0,2016.0,20000.0,150000.0,12.0,2016-04-07 00:00:00,99998.0
std,4626.584568,6.369567,195.113059,36633.453719,3.342634,,25806.939738


Now the data looks much cleaner, a new column named `vehicle_age` will be created.

In [35]:
# Creating 'vehicle_age' column
data['vehicle_age'] = current_year - data['registration_year']
del current_year
display(data.describe())

Unnamed: 0,price,registration_year,power,mileage,registration_month,date_created,postal_code,vehicle_age
count,297347.0,297347.0,297347.0,297347.0,297347.0,297347,297347.0,297347.0
mean,4806.474728,2002.903083,125.980464,128578.142709,6.166563,2016-03-20 19:17:03.470894080,51199.648882,13.096917
min,0.0,1910.0,50.0,5000.0,1.0,2015-03-20 00:00:00,1067.0,0.0
25%,1299.0,1999.0,80.0,125000.0,3.0,2016-03-13 00:00:00,30916.0,9.0
50%,3100.0,2003.0,115.0,150000.0,6.0,2016-03-21 00:00:00,50226.0,13.0
75%,6990.0,2007.0,150.0,150000.0,9.0,2016-03-29 00:00:00,72076.0,17.0
max,20000.0,2016.0,20000.0,150000.0,12.0,2016-04-07 00:00:00,99998.0,106.0
std,4626.584568,6.369567,195.113059,36633.453719,3.342634,,25806.939738,6.369567


<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Very good preprocessing!
</div>

### Data Encoding

In order to train the models, `object` data will need to be converted to numeric dtypes. OHE will be used for the columns with a small amount of entries: `gearbox`, `fuel_type`and `not_repaired`.

In [36]:
# Creating a list of string columns to encode
columns_to_encode = ['gearbox', 'fuel_type', 'not_repaired', 'brand', 'vehicle_type', 'model']

# Encoding data using OHE for Linear Regression
encoded_data = pd.get_dummies(data[columns_to_encode[:3]])
encoded_data = data.drop(columns_to_encode[:3], axis=1).merge(encoded_data, on=data.index).drop('key_0', axis=1)
encoded_data.head()

Unnamed: 0,date_crawled,price,vehicle_type,registration_year,power,model,mileage,registration_month,brand,date_created,...,fuel_type_cng,fuel_type_electric,fuel_type_gasoline,fuel_type_hybrid,fuel_type_lpg,fuel_type_other,fuel_type_petrol,not_repaired_no,not_repaired_unknown,not_repaired_yes
0,24/03/2016 10:58,18300,coupe,2011,190,other,125000,5.0,audi,2016-03-24,...,False,False,True,False,False,False,False,False,False,True
1,14/03/2016 12:52,9800,suv,2004,163,grand,125000,8.0,jeep,2016-03-14,...,False,False,True,False,False,False,False,False,True,False
2,17/03/2016 16:54,1500,small,2001,75,golf,150000,6.0,volkswagen,2016-03-17,...,False,False,False,False,False,False,True,True,False,False
3,31/03/2016 17:25,3600,small,2008,69,fabia,90000,7.0,skoda,2016-03-31,...,False,False,True,False,False,False,False,True,False,False
4,04/04/2016 17:36,650,sedan,1995,102,3er,150000,10.0,bmw,2016-04-04,...,False,False,False,False,False,False,True,False,False,True


The remaining features will be encoded via Ordinal Encoding.

In [37]:
# Preparing features using ordinal encoding for other simple algorithms 
encoder = OrdinalEncoder()
encoder.fit(data[columns_to_encode[3:]])
encoded_columns = pd.DataFrame(encoder.transform(data[columns_to_encode[3:]]), columns=data[columns_to_encode[3:]].columns)

# Creating data_ordinal
data_ordinal = encoded_data.drop(columns_to_encode[3:], axis=1).merge(encoded_columns, on=data.index)

data_ordinal.head(5)

Unnamed: 0,key_0,date_crawled,price,registration_year,power,mileage,registration_month,date_created,postal_code,last_seen,...,fuel_type_hybrid,fuel_type_lpg,fuel_type_other,fuel_type_petrol,not_repaired_no,not_repaired_unknown,not_repaired_yes,brand,vehicle_type,model
0,1,24/03/2016 10:58,18300,2011,190,125000,5.0,2016-03-24,66954,07/04/2016 01:46,...,False,False,False,False,False,False,True,1.0,2.0,166.0
1,2,14/03/2016 12:52,9800,2004,163,125000,8.0,2016-03-14,90480,05/04/2016 12:47,...,False,False,False,False,False,True,False,14.0,6.0,117.0
2,3,17/03/2016 16:54,1500,2001,75,150000,6.0,2016-03-17,91074,17/03/2016 17:40,...,False,False,False,True,True,False,False,38.0,5.0,116.0
3,4,31/03/2016 17:25,3600,2008,69,90000,7.0,2016-03-31,60437,06/04/2016 10:17,...,False,False,False,False,True,False,False,31.0,5.0,101.0
4,5,04/04/2016 17:36,650,1995,102,150000,10.0,2016-04-04,33775,06/04/2016 19:17,...,False,False,False,True,False,False,True,2.0,4.0,11.0


<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Well done!
    
Pro tip: it's better to fit encoders only on train dataframe to avoid leaks.
</div>

### Splitting data

In order to train the model properly the data will be split into training and test sets at a ratio of $4:1$ respectively. The models will be trained using cross validation which allows the models to be tested on data is has not had access to, ensuring the effectiveness of the model.

In [38]:
# Setting ordinal_features and ordinal_target
ordinal_features = data_ordinal.drop(['price', 'date_crawled', 'date_created', 'last_seen', 'key_0'], axis=1)
ordinal_target = data_ordinal['price']

# Scaling data between 0 - 1
scaler = StandardScaler()
ordinal_features = scaler.fit_transform(ordinal_features)

# Splitting data_ordinal into training and test sets
ordinal_features_train, ordinal_features_test, ordinal_target_train, ordinal_target_test = train_test_split(ordinal_features, ordinal_target, test_size=0.2, random_state=1492)
print(f'Data Training Ratio: {len(ordinal_features_train)/len(data_ordinal):.2f}%')
print(f'Data Test Ratio: {len(ordinal_features_test)/len(data_ordinal):.2f}%')

Data Training Ratio: 0.80%
Data Test Ratio: 0.20%


## Model training

5 Different ML models will be used for the purposes of this project. The target value is the price of the vehicle, which means this is a regression task.

The models include:

- `LinearRegression`
- `DecisionTreeRegressor`
- `RandomForestRegressor`
- `LGBMRegressor`
- `CatBoostRegressor`

Each model will be trained with a cross validator with 5 splits to ensure the models are trained on as much data as possible. Using the `GridSearchCV` class, the models will be cross validated and hyperparameters will be tuned using the `param_grid` parameter.

Due to the nature of `LinearRegression`, it will be used as the sanity check for the purposes of this project.

In [39]:
# Creating KFold instance
cross_validator = KFold(n_splits=5, shuffle=True, random_state=17)

# Creating LinearRegression param grid
line_params = {
    'n_jobs':[-1]
}

# Initializing LinearRegression and its GridSearchCV
linreg = LinearRegression()
line_grid = GridSearchCV(linreg, param_grid=line_params, scoring='neg_root_mean_squared_error', cv=cross_validator)

# Training model
start_time = time.time()
line_grid.fit(ordinal_features_train, ordinal_target_train)
linreg_time = time.time() - start_time

print(line_grid.best_estimator_, '\n')
print(f'RMSE for Linear Regression model: {-line_grid.best_score_:.3f}', '\n')
print(f'Time to train: {linreg_time:.2f} secs')

LinearRegression(n_jobs=-1) 

RMSE for Linear Regression model: 3409.355 

Time to train: 1.04 secs


In [None]:
#  Creating DecisionTree param grid
param_grid_tree = {'max_depth':[depth for depth in range(1,16)]}

# Initializing DecisionTree and its GridSearchCV
tree_model = DecisionTreeRegressor(random_state=17)
tree_grid = GridSearchCV(tree_model, param_grid=param_grid_tree, scoring='neg_root_mean_squared_error', cv=cross_validator)

# Training model
start_time = time.time()
tree_grid.fit(ordinal_features_train, ordinal_target_train)
tree_time = time.time() - start_time

print(tree_grid.best_estimator_, '\n')
print(f'Best Decision Tree RMSE: {-tree_grid.best_score_:.3f}', '\n')
print(f'Time to train: {tree_time:.2f} secs')

In [17]:
# Creating RandomForest param grid
param_grid_forest = {
    'max_depth':[9, 12, 15],
    'n_estimators':[10, 40, 80],
    'min_samples_leaf':[2, 4]
                    }

# Initializing RandomForest and its GridSearchCV
forest_model = RandomForestRegressor(random_state=17)
forest_grid = GridSearchCV(forest_model, param_grid=param_grid_forest, scoring='neg_root_mean_squared_error', cv=cross_validator)

# Training model
start_time = time.time()
forest_grid.fit(ordinal_features_train, ordinal_target_train)
forest_time = time.time() - start_time

print(forest_grid.best_estimator_, '\n')
print(f'Best Random Forest RMSE: {-forest_grid.best_score_:.3f}', '\n')
print(f'Time to train: {forest_time:.2f} secs')

RandomForestRegressor(max_depth=15, min_samples_leaf=2, n_estimators=80,
                      random_state=17) 

Best Random Forest RMSE: 1733.956 

Time to train: 2949.21 secs


In [18]:
# Creating LGBM param grid
param_grid_light = {
    'num_leaves': [31, 50, 100],
    'learning_rate': [0.1, 0.01],
    'n_estimators': [10, 40, 80]
}

# Initializing LGBM and its GridSearchCV
light_model = lgb.LGBMRegressor(random_state=17, n_jobs=-1)
light_grid = GridSearchCV(light_model, param_grid=param_grid_light, scoring='neg_root_mean_squared_error', cv=cross_validator)

# Training model
start_time = time.time()
light_grid.fit(ordinal_features_train, ordinal_target_train)
light_time = time.time() - start_time

print(light_grid.best_estimator_, '\n')
print(f'Best LGBM RMSE: {-light_grid.best_score_:.3f}', '\n')
print(f'Time to train: {light_time:.2f} secs')

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.008041 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1010
[LightGBM] [Info] Number of data points in the train set: 190301, number of used features: 22
[LightGBM] [Info] Start training from score 4810.056295
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006517 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1013
[LightGBM] [Info] Number of data points in the train set: 190301, number of used features: 22
[LightGBM] [Info] Start training from score 4813.838051
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.007274 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is n

In [19]:
# Creating CatBoost param grid
cat_params = {
    'iterations':[1000]
}

# Initializing CatBoost and its GridSearchCV
cat_model = CatBoostRegressor(loss_function='RMSE', random_state=17, verbose=500)
cat_grid = GridSearchCV(cat_model, param_grid=cat_params, scoring='neg_root_mean_squared_error', cv=cross_validator)

# Training model
start_time = time.time()
cat_grid.fit(ordinal_features_train, ordinal_target_train)
cat_time = time.time() - start_time

print(cat_grid.best_estimator_, '\n')
print(f'Best CatBoost RMSE: {-cat_grid.best_score_:.3f}', '\n')
print(f'Time to train: {cat_time:.2f} secs')

Learning rate set to 0.093828
0:	learn: 4351.7260445	total: 162ms	remaining: 2m 41s
500:	learn: 1678.5813511	total: 6s	remaining: 5.97s
999:	learn: 1592.0506142	total: 11.6s	remaining: 0us
Learning rate set to 0.093828
0:	learn: 4354.0205799	total: 14.8ms	remaining: 14.7s
500:	learn: 1683.3462057	total: 5.58s	remaining: 5.56s
999:	learn: 1597.9354070	total: 11.4s	remaining: 0us
Learning rate set to 0.093828
0:	learn: 4353.3491783	total: 13.6ms	remaining: 13.6s
500:	learn: 1683.1731376	total: 5.48s	remaining: 5.46s
999:	learn: 1598.3281363	total: 11.2s	remaining: 0us
Learning rate set to 0.093828
0:	learn: 4353.3803065	total: 13.7ms	remaining: 13.7s
500:	learn: 1685.6962523	total: 5.4s	remaining: 5.38s
999:	learn: 1598.6548402	total: 11.1s	remaining: 0us
Learning rate set to 0.093828
0:	learn: 4353.9877769	total: 15ms	remaining: 15s
500:	learn: 1682.7707205	total: 5.54s	remaining: 5.52s
999:	learn: 1599.5017189	total: 10.9s	remaining: 0us
Learning rate set to 0.097195
0:	learn: 4342.766

|        |**DecisionTree:**|**RandomForest:**|**LGBMRegressor:**|**CatBoost:**|
|--------|-----------------|-----------------|------------------|-------------|
|**Time**|38.32 secs       |1794.64 secs     |189.72 secs       |143.49 secs  |
|**RMSE**|2011.305         |1733.956         |1711.783          |1692.477     |


From the table, `DecisionTree` Trained the fastest, but had the highest RMSE value.

`CatBoost` both trained the quickest and had the lowest RMSE value.

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Excellent!
</div>

## Model analysis

Now that every model has been trained, we will now compare the speed and accuracy of every model predicting the test dataset.

In [20]:
%%time
line_test = line_grid.predict(ordinal_features_test)
line_test_score = mean_squared_error(ordinal_target_test, line_test) ** 0.5
f'RMSE: {line_test_score}'

CPU times: total: 0 ns
Wall time: 15 ms


'RMSE: 3374.8298767341275'

In [21]:
%%time
treetest = tree_grid.predict(ordinal_features_test)
treetest_score = mean_squared_error(ordinal_target_test, treetest) ** 0.5
f'RMSE: {treetest_score}'

CPU times: total: 46.9 ms
Wall time: 17 ms


'RMSE: 1975.8399503039295'

In [22]:
%%time
foresttest = forest_grid.predict(ordinal_features_test)
foresttest_score = mean_squared_error(ordinal_target_test, foresttest) ** 0.5
f'RMSE: {foresttest_score}'

CPU times: total: 891 ms
Wall time: 889 ms


'RMSE: 1695.17817689717'

In [23]:
%%time
light_model_test = light_grid.predict(ordinal_features_test)
light_model_test_score = mean_squared_error(ordinal_target_test, light_model_test) ** 0.5
f'RMSE: {light_model_test_score}'

CPU times: total: 578 ms
Wall time: 93 ms


'RMSE: 1694.1160604834008'

In [24]:
%%time
cat_test = cat_grid.predict(ordinal_features_test)
cat_test_score = mean_squared_error(ordinal_target_test, cat_test) ** 0.5
f'RMSE: {cat_test_score}'

CPU times: total: 594 ms
Wall time: 473 ms


'RMSE: 1664.1257529167658'

Catboost predicted the prices the quickest and with the lowest RMSE.

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Good conclusion!
</div>

<div class="alert alert-block alert-success">
<b>Overall reviewer's comment</b> <a class="tocSkip"></a>

Thank you for sending your project. You've done a really good job on it!
    
Especially impressed:
    
- high code level

- good project structure
    
- good hyperparameters tuning  
    
Thank you for in-depth analysis and logical conclusions!
    
I'm glad to say that your project has been accepted. Keep up the good work and good luck on the next sprint!
</div>

# Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [x]  Code is error free
- [x]  The cells with the code have been arranged in order of execution
- [x]  The data has been downloaded and prepared
- [x]  The models have been trained
- [x]  The analysis of speed and quality of the models has been performed