# Определение стоимости автомобилей

Сервис по продаже автомобилей с пробегом «Не бит, не крашен» разрабатывает приложение для привлечения новых клиентов. В нём можно быстро узнать рыночную стоимость своего автомобиля. В вашем распоряжении исторические данные: технические характеристики, комплектации и цены автомобилей. Вам нужно построить модель для определения стоимости. 

Заказчику важны:

- качество предсказания;
- скорость предсказания;
- время обучения.

## Подготовка данных

In [1]:
#Необходимые библиотеки и функции
import pandas as pd
from sklearn.model_selection import train_test_split
from math import sqrt
from sklearn.metrics import mean_squared_error
from lightgbm import LGBMRegressor
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score
import warnings
from catboost import CatBoostRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
import time

In [2]:
warnings.filterwarnings('ignore')

In [3]:
df = pd.read_csv('/datasets/autos.csv')
df.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,2016-03-24 11:52:17,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


In [4]:
df.dtypes

DateCrawled          object
Price                 int64
VehicleType          object
RegistrationYear      int64
Gearbox              object
Power                 int64
Model                object
Kilometer             int64
RegistrationMonth     int64
FuelType             object
Brand                object
NotRepaired          object
DateCreated          object
NumberOfPictures      int64
PostalCode            int64
LastSeen             object
dtype: object

Анализ пропущенных значений

In [5]:
df.isna().sum()

DateCrawled              0
Price                    0
VehicleType          37490
RegistrationYear         0
Gearbox              19833
Power                    0
Model                19705
Kilometer                0
RegistrationMonth        0
FuelType             32895
Brand                    0
NotRepaired          71154
DateCreated              0
NumberOfPictures         0
PostalCode               0
LastSeen                 0
dtype: int64

Пропуски в столбце NotRepaired, скорее всего говорят о том, что машина в ремонте не была

In [6]:
df['NotRepaired'] = df['NotRepaired'].fillna('no')

In [7]:
df.isna().sum()

DateCrawled              0
Price                    0
VehicleType          37490
RegistrationYear         0
Gearbox              19833
Power                    0
Model                19705
Kilometer                0
RegistrationMonth        0
FuelType             32895
Brand                    0
NotRepaired              0
DateCreated              0
NumberOfPictures         0
PostalCode               0
LastSeen                 0
dtype: int64

In [8]:
df.loc[df['Model'] == 'golf']

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,2016-03-24 11:52:17,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,no,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
3,2016-03-17 16:54:04,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
9,2016-03-17 10:53:50,999,small,1998,manual,101,golf,150000,0,,volkswagen,no,2016-03-17 00:00:00,0,27472,2016-03-31 17:17:06
32,2016-03-15 20:59:01,245,sedan,1994,,0,golf,150000,2,petrol,volkswagen,no,2016-03-15 00:00:00,0,44145,2016-03-17 18:17:43
35,2016-03-08 07:54:46,350,,2016,manual,75,golf,150000,4,petrol,volkswagen,no,2016-03-08 00:00:00,0,19386,2016-03-08 09:44:50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
354315,2016-03-29 13:53:42,1300,convertible,1998,manual,90,golf,150000,4,petrol,volkswagen,no,2016-03-29 00:00:00,0,38518,2016-03-31 07:15:48
354320,2016-03-19 19:39:53,1500,sedan,1999,manual,75,golf,150000,4,petrol,volkswagen,no,2016-03-19 00:00:00,0,37339,2016-04-07 07:15:21
354348,2016-03-20 18:47:59,5900,sedan,2006,manual,105,golf,150000,9,gasoline,volkswagen,no,2016-03-20 00:00:00,0,1217,2016-04-07 02:44:27
354359,2016-03-28 13:48:07,7900,sedan,2010,manual,140,golf,150000,7,gasoline,volkswagen,no,2016-03-28 00:00:00,0,75223,2016-04-02 18:16:20


In [9]:
df.loc[df['Model'] == 'golf', 'VehicleType'] = df.loc[df['Model'] == 'golf', 'VehicleType'].fillna('small')

In [10]:
df['VehicleType'] = df['VehicleType'].fillna('unknown')
df['Gearbox'] = df['Gearbox'].fillna('unknown')
df['Model'] = df['Model'].fillna('unknown')
df['FuelType'] = df['FuelType'].fillna('unknown')

In [11]:
df.isna().sum()

DateCrawled          0
Price                0
VehicleType          0
RegistrationYear     0
Gearbox              0
Power                0
Model                0
Kilometer            0
RegistrationMonth    0
FuelType             0
Brand                0
NotRepaired          0
DateCreated          0
NumberOfPictures     0
PostalCode           0
LastSeen             0
dtype: int64

Удаление лишних признаков

In [12]:
df = df.drop(['PostalCode', 'NumberOfPictures'], axis=1)

Исследование выделяющихся и некорректных значений

In [13]:
print('Минимальное значение цены',df['Price'].min())
print('Минимальное значение мощности', df['Power'].min())

Минимальное значение цены 0
Минимальное значение мощности 0


Мощность и цена не могут быть нулевыми, эти значения нужно удалить, кроме того удалим 5% самых дешевых и 5% самых дорогих машин, а также удалим 5% самых мощных и самых слабых машин

In [14]:
df = df.loc[(df['Price'].quantile(0.05) < df['Price']) & (df['Price'] < df['Price'].quantile(0.95)) & 
            (df['Power'].quantile(0.05) < df['Power']) & (df['Power'] < df['Power'].quantile(0.95))]

Исследование взаимной корреляции

In [15]:
df.corr()

Unnamed: 0,Price,RegistrationYear,Power,Kilometer,RegistrationMonth
Price,1.0,0.100081,0.405525,-0.373562,0.06875
RegistrationYear,0.100081,1.0,0.017324,-0.072969,-0.001344
Power,0.405525,0.017324,1.0,0.164498,0.031904
Kilometer,-0.373562,-0.072969,0.164498,1.0,-0.014824
RegistrationMonth,0.06875,-0.001344,0.031904,-0.014824,1.0


Некоторая корреляция наблюдается только в между price и power, а также между price и kilometer, корреляция наблюдается только с целевым признаком, между собой остальные признаки коррелируют слабо

In [16]:
data = df.drop(['DateCrawled', 'DateCreated', 'LastSeen'], axis=1)

In [17]:
data = pd.get_dummies(data, drop_first=True)
data

Unnamed: 0,Price,RegistrationYear,Power,Kilometer,RegistrationMonth,VehicleType_convertible,VehicleType_coupe,VehicleType_other,VehicleType_sedan,VehicleType_small,...,Brand_skoda,Brand_smart,Brand_sonstige_autos,Brand_subaru,Brand_suzuki,Brand_toyota,Brand_trabant,Brand_volkswagen,Brand_volvo,NotRepaired_yes
2,9800,2004,163,125000,8,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1500,2001,75,150000,6,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
4,3600,2008,69,90000,7,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,0
5,650,1995,102,150000,10,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
6,2200,2004,109,150000,8,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
354360,3999,2005,3,150000,5,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
354361,5250,2016,150,150000,12,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
354366,1199,2000,101,125000,3,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
354367,9200,1996,102,150000,3,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


Разбиение на обучающую и тестовую выборки

In [18]:
df_features = df.drop('Price', axis=1)
df_target = df['Price']

data_features = data.drop('Price', axis=1)
data_target = data['Price']

In [19]:
df_features_train, df_features_test, df_target_train, df_target_test = train_test_split(df_features, df_target, test_size=0.25,
                                                                                       random_state=12345)

data_features_train, data_features_test, data_target_train, data_target_test = train_test_split(
data_features, data_target, test_size=0.25, random_state=12345)

## Обучение моделей

In [20]:
def rmse_func(y_true, y_pred):
    value = mean_squared_error(y_true, y_pred)
    return sqrt(value)

In [21]:
rmse = make_scorer(rmse_func, greater_is_better=False)

In [22]:
cat_features = list(df.dtypes[df.dtypes == 'object'].index)
cat_features

['DateCrawled',
 'VehicleType',
 'Gearbox',
 'Model',
 'FuelType',
 'Brand',
 'NotRepaired',
 'DateCreated',
 'LastSeen']

Показатели метрики

тест для CatBoostRegressor

In [23]:
cat = CatBoostRegressor(random_state=12345)
score = round(abs(
    cross_val_score(cat, df_features, df_target, cv=4, scoring=rmse, fit_params={'cat_features':cat_features}).mean()), 2)
print('Лучший показатель rmse для CatBoostRegressor', score)

Learning rate set to 0.094784
0:	learn: 3209.7144284	total: 613ms	remaining: 10m 12s
1:	learn: 3030.4273415	total: 1.22s	remaining: 10m 10s
2:	learn: 2870.6717325	total: 1.74s	remaining: 9m 38s
3:	learn: 2721.1614218	total: 2.27s	remaining: 9m 25s
4:	learn: 2595.6196492	total: 2.76s	remaining: 9m 10s
5:	learn: 2483.2875398	total: 3.17s	remaining: 8m 46s
6:	learn: 2379.5163614	total: 3.72s	remaining: 8m 48s
7:	learn: 2291.9940202	total: 4.12s	remaining: 8m 31s
8:	learn: 2212.0809647	total: 4.65s	remaining: 8m 32s
9:	learn: 2135.9008609	total: 5.22s	remaining: 8m 36s
10:	learn: 2071.9286633	total: 5.72s	remaining: 8m 34s
11:	learn: 2015.2465339	total: 6.13s	remaining: 8m 24s
12:	learn: 1962.1595042	total: 6.63s	remaining: 8m 23s
13:	learn: 1914.4207505	total: 7.18s	remaining: 8m 26s
14:	learn: 1873.4041006	total: 7.49s	remaining: 8m 11s
15:	learn: 1836.0660126	total: 7.94s	remaining: 8m 8s
16:	learn: 1802.4234075	total: 8.4s	remaining: 8m 5s
17:	learn: 1772.0500604	total: 8.79s	remaining

Тест для LGBMRegressor

In [24]:
parametres = {'n_estimators':[100, 200]}
lgbm = LGBMRegressor(random_state=12345)
grid = GridSearchCV(lgbm, parametres, scoring=rmse, cv=4)
grid.fit(data_features, data_target)

GridSearchCV(cv=4, estimator=LGBMRegressor(random_state=12345),
             param_grid={'n_estimators': [100, 200]},
             scoring=make_scorer(rmse_func, greater_is_better=False))

In [25]:
print('Лучший показатель rmse для LGBMRegressor:', round(abs(grid.best_score_), 2))

Лучший показатель rmse для LGBMRegressor: 1312.12


Лучшие гиперпараметры

In [26]:
grid.best_params_

{'n_estimators': 200}

Тест для линейной регрессии

In [27]:
log = LinearRegression()
log_score = round(abs(cross_val_score(log, data_features, data_target, scoring=rmse, cv=4).mean()), 2)
print('Показатель метрики rmse для линейной регрессии:', log_score)

Показатель метрики rmse для линейной регрессии: 2261.65


## Анализ моделей

Тест времени для CatBoostRegressor

In [28]:
cat = CatBoostRegressor(random_state=12345)

Время обучения

In [29]:
%%time
cat_fit_start = time.time()
cat.fit(df_features_train, df_target_train, cat_features=cat_features)
cat_fit_finish = time.time()
cat_fit_time = round(cat_fit_finish - cat_fit_start,2)
print('Время обучения CatBoostRegressor', cat_fit_time)

Learning rate set to 0.094784
0:	learn: 3207.3557778	total: 451ms	remaining: 7m 30s
1:	learn: 3029.3345502	total: 960ms	remaining: 7m 59s
2:	learn: 2870.0210983	total: 1.38s	remaining: 7m 38s
3:	learn: 2728.1632670	total: 1.76s	remaining: 7m 19s
4:	learn: 2598.9487655	total: 2.12s	remaining: 7m 1s
5:	learn: 2481.2819422	total: 2.56s	remaining: 7m 3s
6:	learn: 2378.9195164	total: 3.02s	remaining: 7m 9s
7:	learn: 2289.0150153	total: 3.4s	remaining: 7m 1s
8:	learn: 2210.5300047	total: 3.91s	remaining: 7m 10s
9:	learn: 2137.2358663	total: 4.31s	remaining: 7m 6s
10:	learn: 2070.9777787	total: 4.71s	remaining: 7m 3s
11:	learn: 2014.0394142	total: 5.18s	remaining: 7m 6s
12:	learn: 1962.9239328	total: 5.63s	remaining: 7m 7s
13:	learn: 1916.5862916	total: 6.22s	remaining: 7m 18s
14:	learn: 1870.8501803	total: 6.61s	remaining: 7m 14s
15:	learn: 1833.4100581	total: 6.96s	remaining: 7m 8s
16:	learn: 1797.4112693	total: 7.33s	remaining: 7m 3s
17:	learn: 1768.7606208	total: 7.69s	remaining: 6m 59s
1

Время предсказания

In [30]:
%%time
cat_start = time.time()
cat.predict(df_features_test)
cat_finish = time.time()
cat_predict_time = round(cat_finish - cat_start, 2)
print('Время предсказания CatBoostRegressor', cat_predict_time)

Время предсказания CatBoostRegressor 2.32
CPU times: user 2.3 s, sys: 7.89 ms, total: 2.31 s
Wall time: 2.32 s


Тест времени для LGBMRegressor

In [31]:
lgbm = LGBMRegressor(random_state=12345, n_estimators=200)

Время обучения

In [32]:
%%time
lgbm_fit_start = time.time()
lgbm.fit(data_features_train, data_target_train)
lgbm_fit_finish = time.time()
lgbm_fit_time = round(lgbm_fit_finish - lgbm_fit_start, 2)
print('Время обучения Lgbm', lgbm_fit_time)

Время обучения Lgbm 157.88
CPU times: user 2min 35s, sys: 1.34 s, total: 2min 36s
Wall time: 2min 37s


Время предсказания

In [33]:
%%time
lgbm_predict_start = time.time()
lgbm.predict(data_features_test)
lgbm_predict_finish = time.time()
lgbm_predict_time = round(lgbm_predict_finish - lgbm_predict_start, 2)
print('Время предсказания Lgbm', lgbm_predict_time)

Время предсказания Lgbm 1.3
CPU times: user 1.2 s, sys: 68.3 ms, total: 1.27 s
Wall time: 1.3 s


Тест времени для линейной регрессии

In [34]:
reg = LinearRegression()

Время обучения

In [35]:
%%time
reg_fit_start = time.time()
reg.fit(data_features_train, data_target_train)
reg_fit_finish = time.time()
reg_fit_time = round(reg_fit_finish - reg_fit_start, 2)
print('Время обучения линейной регресии', reg_fit_time)

Время обучения линейной регресии 24.66
CPU times: user 18.5 s, sys: 6.11 s, total: 24.6 s
Wall time: 24.7 s


Время предсказания

In [36]:
%%time
reg_predict_start = time.time()
reg.predict(data_features_test)
reg_predict_finish = time.time()
reg_predict_time = round(reg_predict_finish - reg_predict_start, 2)
print('Время предсказания линейной регресии', reg_predict_time)

Время предсказания линейной регресии 0.21
CPU times: user 132 ms, sys: 129 ms, total: 261 ms
Wall time: 213 ms


Результирующая таблица с показателями метрик, приблизительным значением времени обучения и времени предсказания

In [37]:
space = {'model':['CatBoostRegressor', 'LGBMRegressor', 'LinearRegression'],
         'rmse':[1268.86, 1312.12, 2261.65],
         'fit_time_seconds':[cat_fit_time, lgbm_fit_time, reg_fit_time],
         'predict_time_seconds':[cat_predict_time, lgbm_predict_time, reg_predict_time]
        }
res_data = pd.DataFrame(data=space)
res_data

Unnamed: 0,model,rmse,fit_time_seconds,predict_time_seconds
0,CatBoostRegressor,1729.6,453.18,2.32
1,LGBMRegressor,1792.95,157.88,1.3
2,LinearRegression,3203.78,24.66,0.21


Из полученных данных можно сделать вывод, что модель линейной регрессии проигрывает по показателям метрик другим, в то время как модели CatBoostRegressor и LGBMRegressor выдают высокие показатели метрик, CatBoost выигрывает по показателю метрики, а LGBMRegressor выигрывает по времени обучения, итоговую модель выбирается исходя из приоритета показателей метрики или времени обучения