# Определение стоимости автомобилей

### Описание проекта

Сервис по продаже автомобилей с пробегом «Не бит, не крашен» разрабатывает приложение для привлечения новых клиентов. В нём можно быстро узнать рыночную стоимость своего автомобиля. В нашем распоряжении исторические данные: технические характеристики, комплектации и цены автомобилей. Нужно построить модель для определения стоимости. 

Заказчику важны:

- качество предсказания;
- скорость предсказания;
- время обучения.

## Подготовка данных

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.utils import shuffle
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, KFold
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.dummy import DummyRegressor
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor

In [2]:
STATE = np.random.RandomState(12345)

In [3]:
data = pd.read_csv('/datasets/autos.csv')

In [4]:
data.info()
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Kilometer          354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  Repaired           283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,Repaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,2016-03-24 11:52:17,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


In [5]:
# Проверим, есть ли дубликаты в данных 
data.duplicated().sum()

4

Всего 4 дубликата из 354369 строк - такое количество не должно повлиять на модель. Оставим их.

In [6]:
# Посмотрим на корреляцию между признаками
corr = data.corr()
corr[(corr > 0.8) & (corr < 0.9999)].unstack().dropna().sort_values(ascending=False)

Series([], dtype: float64)

Корреляции между количественными признаками не обнаружено

In [7]:
# Посмотрим на пропуски в датасетах
data.isna().sum().sort_values(ascending=False)

Repaired             71154
VehicleType          37490
FuelType             32895
Gearbox              19833
Model                19705
DateCrawled              0
Price                    0
RegistrationYear         0
Power                    0
Kilometer                0
RegistrationMonth        0
Brand                    0
DateCreated              0
NumberOfPictures         0
PostalCode               0
LastSeen                 0
dtype: int64

In [8]:
data['PostalCode'].value_counts()

10115    819
65428    613
66333    343
32257    317
44145    317
        ... 
21782      1
9517       1
29367      1
38325      1
82404      1
Name: PostalCode, Length: 8143, dtype: int64

In [10]:
print(sorted(data['RegistrationYear'].unique()))

[1000, 1001, 1039, 1111, 1200, 1234, 1253, 1255, 1300, 1400, 1500, 1600, 1602, 1688, 1800, 1910, 1915, 1919, 1920, 1923, 1925, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2066, 2200, 2222, 2290, 2500, 2800, 2900, 3000, 3200, 3500, 3700, 3800, 4000, 4100, 4500, 4800, 5000, 5300, 5555, 5600, 5900, 5911, 6000, 6500, 7000, 7100, 7500, 7800, 8000, 8200, 8455, 8500, 8888, 9000, 9229, 9450, 9996, 9999]


In [11]:
data['RegistrationMonth'].unique()

array([ 0,  5,  8,  6,  7, 10, 12, 11,  2,  3,  1,  4,  9])

Данные по пропускам для столбца Repaired можно заменить на no, так как скорее всего если эт информация пропущена, значит авто не было в ремонте.

In [14]:
data['Repaired'] = data['Repaired'].fillna('no')

In [15]:
data['Repaired'].isna().sum()

0

In [25]:
import warnings
warnings.filterwarnings("ignore")

In [26]:
data['LastSeen'] = pd.to_datetime(data['LastSeen'])

In [27]:
MIN_ADEQUATE_POWER = 10.0
MAX_ADEQUATE_POWER = 1000.0

MIN_ADEQUATE_YEAR = 1950

LOW_ADEQUATE_PRICE = 10

def clean_dataset(dataset):
    """Процедура очистки набора данных от выявленных на этапе анализа данных недостатков."""

    print('1 Удаление полных дубликатов...')
    dataset = dataset.drop(dataset[dataset.duplicated()].index)

    print(f'2 Удаление объявлений с неадекватным ценовым предложением (меньше {LOW_ADEQUATE_PRICE})')
    dataset = dataset.drop(dataset[dataset.Price < LOW_ADEQUATE_PRICE].index)
     
    max_adequate_year = dataset['LastSeen'].dt.year.max()
    print(f'3 Удаление автомобилей с регистрацией до {MIN_ADEQUATE_YEAR} и после {max_adequate_year} года...')
    dataset = dataset[dataset['RegistrationYear'].between(
        MIN_ADEQUATE_YEAR,
        max_adequate_year
    )].copy()

    print('4 Очистка месяца регистрации автомобиля...')
    dataset['RegistrationMonth'] = dataset['RegistrationMonth'].clip(lower=1, upper=12)

    print('5 Удаление ненужных полей')
    dataset = dataset.drop(columns=[
        'DateCrawled', 'DateCreated', 'PostalCode', 'LastSeen', 'NumberOfPictures'
    ], axis=1)
    
    
    print('6 Деление на трейн и тест выборки')
    train, test = train_test_split(
        dataset, test_size=0.2, random_state=STATE
    )
    
    print('7 Удаление автомобилей c неадекватной мощностью из обучающей выборки...')
    train = train.drop(train[~train['Power'].between(
        MIN_ADEQUATE_POWER, MAX_ADEQUATE_POWER
    )].index)
    
    print('8 Пометка автомобилей с неадекватной мощностью в тестовой выборке...')
    test['Power'] = test['Power'].where(test['Power'] >= MIN_ADEQUATE_POWER)
    test['Power'] = test['Power'].where(test['Power'] <= MAX_ADEQUATE_POWER)

    return train, test

In [28]:
train, test = clean_dataset(data)

1 Удаление полных дубликатов...
2 Удаление объявлений с неадекватным ценовым предложением (меньше 10)
3 Удаление автомобилей с регистрацией до 1950 и после 2016 года...
4 Очистка месяца регистрации автомобиля...
5 Удаление ненужных полей
6 Деление на трейн и тест выборки
7 Удаление автомобилей c неадекватной мощностью из обучающей выборки...
8 Пометка автомобилей с неадекватной мощностью в тестовой выборке...


In [29]:
train.isna().sum()

Price                    0
VehicleType           8764
RegistrationYear         0
Gearbox               4218
Power                    0
Model                 8730
Kilometer                0
RegistrationMonth        0
FuelType             12317
Brand                    0
Repaired                 0
dtype: int64

In [30]:
test.isna().sum()

Price                   0
VehicleType          3905
RegistrationYear        0
Gearbox              3099
Power                6521
Model                3145
Kilometer               0
RegistrationMonth       0
FuelType             4774
Brand                   0
Repaired                0
dtype: int64

In [31]:
def fill_na_values(dataset, train):
    """Процедура заполнения пропусков в тестировании."""
    
    def get_fill_value(data, key):
        key = tuple(key) if key.shape[0] > 1 else key[0]
        return data.loc[key] if data.index.isin([key]).any() else np.nan

    def apply_agg_func(x):
        try:
            return agg_func(x)
        except:
            return np.nan
    
    group_list = ['Brand', 'RegistrationYear']
    group_list_sup = ['Brand']
    
    def fill_na(fill_column, empty_idx, group_list):
        dataset.loc[empty_idx, fill_column] = dataset.loc[empty_idx][group_list].apply(
            lambda x: get_fill_value(train_fill_values, x), axis=1, raw=False
        )
        return dataset
    
    # заполняем столбцы 'Model', 'VehicleType', 'FuelType', 'Gearbox' модальным значением
    for fill_column in ['Power', 'Model', 'VehicleType', 'FuelType', 'Gearbox']:
        agg_func = lambda x: x.value_counts().idxmax()
        if fill_column == 'Power':
            train_fill_values = train.groupby(group_list)[fill_column].agg('median')
        else:
            train_fill_values = train.groupby(group_list)[fill_column].agg(apply_agg_func)
        empty_idx = dataset[dataset[fill_column].isna()].index
        dataset.loc[empty_idx, fill_column] = fill_na(fill_column, empty_idx, group_list)
        
        # если остались NaN
        if dataset.loc[empty_idx, fill_column].isna().sum() > 0:
            if fill_column in ['Model', 'VehicleType']:
                dataset[fill_column] = dataset[fill_column].fillna('unknown')
            dataset[fill_column] = dataset[fill_column].fillna(train[fill_column].agg(agg_func))

    return dataset

In [32]:
train = fill_na_values(train, train)

In [33]:
test = fill_na_values(test, train)

In [34]:
train.isna().sum()

Price                0
VehicleType          0
RegistrationYear     0
Gearbox              0
Power                0
Model                0
Kilometer            0
RegistrationMonth    0
FuelType             0
Brand                0
Repaired             0
dtype: int64

In [35]:
test.isna().sum()

Price                0
VehicleType          0
RegistrationYear     0
Gearbox              0
Power                0
Model                0
Kilometer            0
RegistrationMonth    0
FuelType             0
Brand                0
Repaired             0
dtype: int64

In [34]:
features_train = train.drop(['Price'], axis=1)
target_train = train['Price']
features_test = test.drop(['Price'], axis=1)
target_test = test['Price']

In [35]:
features_train.shape

(236535, 10)

In [36]:
target_train.shape

(236535,)

In [37]:
# Применим технику OHE для тренировочной выборки. Напишем для этого функцию
features_list = ['Repaired', 'VehicleType', 'FuelType', 'Gearbox', 'Model', 'Brand']
ohe_train = OneHotEncoder(sparse=False, handle_unknown='ignore')
ohe_train.fit(features_train.loc[:,features_list])

def ohe(features, target):
    ohe_features = ohe_train.transform(features.loc[:,features_list])
    ohe_feature_array = pd.DataFrame(ohe_features)
    features = features.drop(features_list, axis=1).reset_index(drop=True)
    new_features = pd.concat([features, ohe_feature_array], axis=1)
    features_train_new, target_train_new = shuffle(new_features, target, random_state=STATE)
    return (features_train_new, target_train_new)

features_train_new, target_train_new = ohe(features_train, target_train)

In [38]:
# Применим технику OHE для тестовой выборки

features_test_new, target_test_new = ohe(features_test, target_test)

In [40]:
#features_train_new.shape
features_train_new.head()

Unnamed: 0,RegistrationYear,Power,Kilometer,RegistrationMonth,0,1,2,3,4,5,...,300,301,302,303,304,305,306,307,308,309
236231,2004,163.0,70000,5,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
93281,2006,77.0,100000,2,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
194939,2001,125.0,150000,3,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
151310,2013,135.0,60000,1,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
235957,2004,97.0,125000,12,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [41]:
#features_test_new.shape
features_test_new.head()

Unnamed: 0,RegistrationYear,Power,Kilometer,RegistrationMonth,0,1,2,3,4,5,...,300,301,302,303,304,305,306,307,308,309
20564,2006,75.0,150000,5,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3071,1996,88.0,125000,2,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16315,1999,54.0,150000,1,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6114,2008,170.0,150000,6,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
43306,1999,286.0,150000,5,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Изучили и подготовили данные для дальнейшей работы с ними. Подготовили выборки для обучения моделей. 

## Обучение моделей

In [42]:
%%time
# Модель случайного леса
forest = RandomForestRegressor(random_state=STATE)
param_grid={'n_estimators':range(10, 50, 10), 'max_depth': range(5, 15)}
grid_forest = GridSearchCV(forest, param_grid, verbose=10, cv=3, n_jobs=-1,
                          scoring='neg_root_mean_squared_error')
grid_forest.fit(features_train_new, target_train_new)
print(grid_forest.best_score_)

Fitting 3 folds for each of 40 candidates, totalling 120 fits
[CV 1/3; 1/40] START max_depth=5, n_estimators=10...............................
[CV 1/3; 1/40] END .............max_depth=5, n_estimators=10; total time=   9.4s
[CV 2/3; 1/40] START max_depth=5, n_estimators=10...............................
[CV 2/3; 1/40] END .............max_depth=5, n_estimators=10; total time=   9.4s
[CV 3/3; 1/40] START max_depth=5, n_estimators=10...............................
[CV 3/3; 1/40] END .............max_depth=5, n_estimators=10; total time=   9.3s
[CV 1/3; 2/40] START max_depth=5, n_estimators=20...............................
[CV 1/3; 2/40] END .............max_depth=5, n_estimators=20; total time=  18.3s
[CV 2/3; 2/40] START max_depth=5, n_estimators=20...............................
[CV 2/3; 2/40] END .............max_depth=5, n_estimators=20; total time=  18.3s
[CV 3/3; 2/40] START max_depth=5, n_estimators=20...............................
[CV 3/3; 2/40] END .............max_depth=5, n_

In [43]:
predictions_valid = grid_forest.predict(features_train_new)
mse = mean_squared_error(target_train_new, predictions_valid)

print('rmse=',mse ** 0.5)

rmse= 1483.039999438121


In [45]:
%%time
# Модель CatBoost

catboost = CatBoostRegressor(loss_function='RMSE', silent=True, random_seed=0) #random_state=STATE)
param_grid={'n_estimators':range(1, 50, 10), 'depth':range(1,16), 'learning_rate':[0.1, 0.5, 0.8]}
grid_catboost = GridSearchCV(catboost, param_grid, verbose=10, cv=3)

CPU times: user 132 µs, sys: 2 µs, total: 134 µs
Wall time: 137 µs


In [46]:
%%time
grid_catboost.fit(features_train_new, target_train_new)
print(grid_catboost.best_score_)

Fitting 3 folds for each of 225 candidates, totalling 675 fits
[CV 1/3; 1/225] START depth=1, learning_rate=0.1, n_estimators=1................
[CV 1/3; 1/225] END depth=1, learning_rate=0.1, n_estimators=1; total time=   1.9s
[CV 2/3; 1/225] START depth=1, learning_rate=0.1, n_estimators=1................
[CV 2/3; 1/225] END depth=1, learning_rate=0.1, n_estimators=1; total time=   1.7s
[CV 3/3; 1/225] START depth=1, learning_rate=0.1, n_estimators=1................
[CV 3/3; 1/225] END depth=1, learning_rate=0.1, n_estimators=1; total time=   1.7s
[CV 1/3; 2/225] START depth=1, learning_rate=0.1, n_estimators=11...............
[CV 1/3; 2/225] END depth=1, learning_rate=0.1, n_estimators=11; total time=   1.8s
[CV 2/3; 2/225] START depth=1, learning_rate=0.1, n_estimators=11...............
[CV 2/3; 2/225] END depth=1, learning_rate=0.1, n_estimators=11; total time=   1.8s
[CV 3/3; 2/225] START depth=1, learning_rate=0.1, n_estimators=11...............
[CV 3/3; 2/225] END depth=1, learn

In [47]:
%%time

predictions_valid = grid_catboost.predict(features_train_new)
mse = mean_squared_error(target_train_new, predictions_valid)

print('rmse=',mse ** 0.5)

rmse= 1454.538425758973
CPU times: user 141 ms, sys: 79 µs, total: 141 ms
Wall time: 140 ms


In [48]:
%%time
# модель lightGBM
lgbm = LGBMRegressor(boosting_type='gbdt', random_state=STATE, n_estimators=20)
param_grid={'max_depth':range(1, 20, 5), 'num_leaves':range(1, 30, 10), 'learning_rate':[0.01, 0.1, 0.5]}
grid_lgbm = GridSearchCV(lgbm, param_grid, verbose=10, cv=3, n_jobs=-1)
                          #scoring='neg_root_mean_squared_error')

CPU times: user 94 µs, sys: 0 ns, total: 94 µs
Wall time: 100 µs


In [49]:
%%time
grid_lgbm.fit(features_train_new, target_train_new)
print(grid_lgbm.best_score_)

Fitting 3 folds for each of 36 candidates, totalling 108 fits
[CV 1/3; 1/36] START learning_rate=0.01, max_depth=1, num_leaves=1..............


[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 1/3; 1/36] END learning_rate=0.01, max_depth=1, num_leaves=1; total time=   0.7s
[CV 2/3; 1/36] START learning_rate=0.01, max_depth=1, num_leaves=1..............


[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 2/3; 1/36] END learning_rate=0.01, max_depth=1, num_leaves=1; total time=   0.6s
[CV 3/3; 1/36] START learning_rate=0.01, max_depth=1, num_leaves=1..............


[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 3/3; 1/36] END learning_rate=0.01, max_depth=1, num_leaves=1; total time=   0.6s
[CV 1/3; 2/36] START learning_rate=0.01, max_depth=1, num_leaves=11.............
[CV 1/3; 2/36] END learning_rate=0.01, max_depth=1, num_leaves=11; total time=  18.9s
[CV 2/3; 2/36] START learning_rate=0.01, max_depth=1, num_leaves=11.............
[CV 2/3; 2/36] END learning_rate=0.01, max_depth=1, num_leaves=11; total time=  18.4s
[CV 3/3; 2/36] START learning_rate=0.01, max_depth=1, num_leaves=11.............
[CV 3/3; 2/36] END learning_rate=0.01, max_depth=1, num_leaves=11; total time=  18.8s
[CV 1/3; 3/36] START learning_rate=0.01, max_depth=1, num_leaves=21.............
[CV 1/3; 3/36] END learning_rate=0.01, max_depth=1, num_leaves=21; total time=  18.8s
[CV 2/3; 3/36] START learning_rate=0.01, max_depth=1, num_leaves=21.............
[CV 2/3; 3/36] END learning_rate=0.01, max_depth=1, num_leaves=21; total time=  19.1s
[CV 3/3; 3/36] START learning_rate=0.01, max_depth=1, num_leaves=21.............

[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 1/3; 4/36] END learning_rate=0.01, max_depth=6, num_leaves=1; total time=   0.8s
[CV 2/3; 4/36] START learning_rate=0.01, max_depth=6, num_leaves=1..............


[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 2/3; 4/36] END learning_rate=0.01, max_depth=6, num_leaves=1; total time=   0.6s
[CV 3/3; 4/36] START learning_rate=0.01, max_depth=6, num_leaves=1..............


[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 3/3; 4/36] END learning_rate=0.01, max_depth=6, num_leaves=1; total time=   0.7s
[CV 1/3; 5/36] START learning_rate=0.01, max_depth=6, num_leaves=11.............
[CV 1/3; 5/36] END learning_rate=0.01, max_depth=6, num_leaves=11; total time= 1.3min
[CV 2/3; 5/36] START learning_rate=0.01, max_depth=6, num_leaves=11.............
[CV 2/3; 5/36] END learning_rate=0.01, max_depth=6, num_leaves=11; total time= 1.3min
[CV 3/3; 5/36] START learning_rate=0.01, max_depth=6, num_leaves=11.............
[CV 3/3; 5/36] END learning_rate=0.01, max_depth=6, num_leaves=11; total time= 1.3min
[CV 1/3; 6/36] START learning_rate=0.01, max_depth=6, num_leaves=21.............
[CV 1/3; 6/36] END learning_rate=0.01, max_depth=6, num_leaves=21; total time= 2.6min
[CV 2/3; 6/36] START learning_rate=0.01, max_depth=6, num_leaves=21.............
[CV 2/3; 6/36] END learning_rate=0.01, max_depth=6, num_leaves=21; total time= 2.6min
[CV 3/3; 6/36] START learning_rate=0.01, max_depth=6, num_leaves=21.............

[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 1/3; 7/36] END learning_rate=0.01, max_depth=11, num_leaves=1; total time=   0.6s
[CV 2/3; 7/36] START learning_rate=0.01, max_depth=11, num_leaves=1.............


[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 2/3; 7/36] END learning_rate=0.01, max_depth=11, num_leaves=1; total time=   0.6s
[CV 3/3; 7/36] START learning_rate=0.01, max_depth=11, num_leaves=1.............


[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 3/3; 7/36] END learning_rate=0.01, max_depth=11, num_leaves=1; total time=   0.6s
[CV 1/3; 8/36] START learning_rate=0.01, max_depth=11, num_leaves=11............
[CV 1/3; 8/36] END learning_rate=0.01, max_depth=11, num_leaves=11; total time= 1.4min
[CV 2/3; 8/36] START learning_rate=0.01, max_depth=11, num_leaves=11............
[CV 2/3; 8/36] END learning_rate=0.01, max_depth=11, num_leaves=11; total time= 1.5min
[CV 3/3; 8/36] START learning_rate=0.01, max_depth=11, num_leaves=11............
[CV 3/3; 8/36] END learning_rate=0.01, max_depth=11, num_leaves=11; total time= 1.5min
[CV 1/3; 9/36] START learning_rate=0.01, max_depth=11, num_leaves=21............
[CV 1/3; 9/36] END learning_rate=0.01, max_depth=11, num_leaves=21; total time= 2.7min
[CV 2/3; 9/36] START learning_rate=0.01, max_depth=11, num_leaves=21............
[CV 2/3; 9/36] END learning_rate=0.01, max_depth=11, num_leaves=21; total time= 2.6min
[CV 3/3; 9/36] START learning_rate=0.01, max_depth=11, num_leaves=21......

[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 1/3; 10/36] END learning_rate=0.01, max_depth=16, num_leaves=1; total time=   0.7s
[CV 2/3; 10/36] START learning_rate=0.01, max_depth=16, num_leaves=1............


[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 2/3; 10/36] END learning_rate=0.01, max_depth=16, num_leaves=1; total time=   0.7s
[CV 3/3; 10/36] START learning_rate=0.01, max_depth=16, num_leaves=1............


[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 3/3; 10/36] END learning_rate=0.01, max_depth=16, num_leaves=1; total time=   0.7s
[CV 1/3; 11/36] START learning_rate=0.01, max_depth=16, num_leaves=11...........
[CV 1/3; 11/36] END learning_rate=0.01, max_depth=16, num_leaves=11; total time= 1.4min
[CV 2/3; 11/36] START learning_rate=0.01, max_depth=16, num_leaves=11...........
[CV 2/3; 11/36] END learning_rate=0.01, max_depth=16, num_leaves=11; total time= 1.5min
[CV 3/3; 11/36] START learning_rate=0.01, max_depth=16, num_leaves=11...........
[CV 3/3; 11/36] END learning_rate=0.01, max_depth=16, num_leaves=11; total time= 1.5min
[CV 1/3; 12/36] START learning_rate=0.01, max_depth=16, num_leaves=21...........
[CV 1/3; 12/36] END learning_rate=0.01, max_depth=16, num_leaves=21; total time= 2.6min
[CV 2/3; 12/36] START learning_rate=0.01, max_depth=16, num_leaves=21...........
[CV 2/3; 12/36] END learning_rate=0.01, max_depth=16, num_leaves=21; total time= 2.7min
[CV 3/3; 12/36] START learning_rate=0.01, max_depth=16, num_leaves=2

[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 1/3; 13/36] END learning_rate=0.1, max_depth=1, num_leaves=1; total time=   0.7s
[CV 2/3; 13/36] START learning_rate=0.1, max_depth=1, num_leaves=1..............


[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 2/3; 13/36] END learning_rate=0.1, max_depth=1, num_leaves=1; total time=   0.6s
[CV 3/3; 13/36] START learning_rate=0.1, max_depth=1, num_leaves=1..............


[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 3/3; 13/36] END learning_rate=0.1, max_depth=1, num_leaves=1; total time=   0.6s
[CV 1/3; 14/36] START learning_rate=0.1, max_depth=1, num_leaves=11.............
[CV 1/3; 14/36] END learning_rate=0.1, max_depth=1, num_leaves=11; total time=  18.2s
[CV 2/3; 14/36] START learning_rate=0.1, max_depth=1, num_leaves=11.............
[CV 2/3; 14/36] END learning_rate=0.1, max_depth=1, num_leaves=11; total time=  18.9s
[CV 3/3; 14/36] START learning_rate=0.1, max_depth=1, num_leaves=11.............
[CV 3/3; 14/36] END learning_rate=0.1, max_depth=1, num_leaves=11; total time=  14.8s
[CV 1/3; 15/36] START learning_rate=0.1, max_depth=1, num_leaves=21.............
[CV 1/3; 15/36] END learning_rate=0.1, max_depth=1, num_leaves=21; total time=  20.6s
[CV 2/3; 15/36] START learning_rate=0.1, max_depth=1, num_leaves=21.............
[CV 2/3; 15/36] END learning_rate=0.1, max_depth=1, num_leaves=21; total time=  12.6s
[CV 3/3; 15/36] START learning_rate=0.1, max_depth=1, num_leaves=21.............

[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 1/3; 16/36] END learning_rate=0.1, max_depth=6, num_leaves=1; total time=   0.7s
[CV 2/3; 16/36] START learning_rate=0.1, max_depth=6, num_leaves=1..............


[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 2/3; 16/36] END learning_rate=0.1, max_depth=6, num_leaves=1; total time=   0.6s
[CV 3/3; 16/36] START learning_rate=0.1, max_depth=6, num_leaves=1..............


[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 3/3; 16/36] END learning_rate=0.1, max_depth=6, num_leaves=1; total time=   0.6s
[CV 1/3; 17/36] START learning_rate=0.1, max_depth=6, num_leaves=11.............
[CV 1/3; 17/36] END learning_rate=0.1, max_depth=6, num_leaves=11; total time= 1.2min
[CV 2/3; 17/36] START learning_rate=0.1, max_depth=6, num_leaves=11.............
[CV 2/3; 17/36] END learning_rate=0.1, max_depth=6, num_leaves=11; total time= 1.2min
[CV 3/3; 17/36] START learning_rate=0.1, max_depth=6, num_leaves=11.............
[CV 3/3; 17/36] END learning_rate=0.1, max_depth=6, num_leaves=11; total time=  48.8s
[CV 1/3; 18/36] START learning_rate=0.1, max_depth=6, num_leaves=21.............
[CV 1/3; 18/36] END learning_rate=0.1, max_depth=6, num_leaves=21; total time= 1.2min
[CV 2/3; 18/36] START learning_rate=0.1, max_depth=6, num_leaves=21.............
[CV 2/3; 18/36] END learning_rate=0.1, max_depth=6, num_leaves=21; total time= 1.1min
[CV 3/3; 18/36] START learning_rate=0.1, max_depth=6, num_leaves=21.............

[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 1/3; 19/36] END learning_rate=0.1, max_depth=11, num_leaves=1; total time=   0.7s
[CV 2/3; 19/36] START learning_rate=0.1, max_depth=11, num_leaves=1.............


[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 2/3; 19/36] END learning_rate=0.1, max_depth=11, num_leaves=1; total time=   0.6s
[CV 3/3; 19/36] START learning_rate=0.1, max_depth=11, num_leaves=1.............


[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 3/3; 19/36] END learning_rate=0.1, max_depth=11, num_leaves=1; total time=   0.6s
[CV 1/3; 20/36] START learning_rate=0.1, max_depth=11, num_leaves=11............
[CV 1/3; 20/36] END learning_rate=0.1, max_depth=11, num_leaves=11; total time= 1.4min
[CV 2/3; 20/36] START learning_rate=0.1, max_depth=11, num_leaves=11............
[CV 2/3; 20/36] END learning_rate=0.1, max_depth=11, num_leaves=11; total time=  44.1s
[CV 3/3; 20/36] START learning_rate=0.1, max_depth=11, num_leaves=11............
[CV 3/3; 20/36] END learning_rate=0.1, max_depth=11, num_leaves=11; total time=  28.1s
[CV 1/3; 21/36] START learning_rate=0.1, max_depth=11, num_leaves=21............
[CV 1/3; 21/36] END learning_rate=0.1, max_depth=11, num_leaves=21; total time= 1.5min
[CV 2/3; 21/36] START learning_rate=0.1, max_depth=11, num_leaves=21............
[CV 2/3; 21/36] END learning_rate=0.1, max_depth=11, num_leaves=21; total time= 1.6min
[CV 3/3; 21/36] START learning_rate=0.1, max_depth=11, num_leaves=21......

[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 1/3; 22/36] END learning_rate=0.1, max_depth=16, num_leaves=1; total time=   0.6s
[CV 2/3; 22/36] START learning_rate=0.1, max_depth=16, num_leaves=1.............


[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 2/3; 22/36] END learning_rate=0.1, max_depth=16, num_leaves=1; total time=   0.6s
[CV 3/3; 22/36] START learning_rate=0.1, max_depth=16, num_leaves=1.............


[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 3/3; 22/36] END learning_rate=0.1, max_depth=16, num_leaves=1; total time=   0.6s
[CV 1/3; 23/36] START learning_rate=0.1, max_depth=16, num_leaves=11............
[CV 1/3; 23/36] END learning_rate=0.1, max_depth=16, num_leaves=11; total time=   3.2s
[CV 2/3; 23/36] START learning_rate=0.1, max_depth=16, num_leaves=11............
[CV 2/3; 23/36] END learning_rate=0.1, max_depth=16, num_leaves=11; total time=   3.4s
[CV 3/3; 23/36] START learning_rate=0.1, max_depth=16, num_leaves=11............
[CV 3/3; 23/36] END learning_rate=0.1, max_depth=16, num_leaves=11; total time=   3.1s
[CV 1/3; 24/36] START learning_rate=0.1, max_depth=16, num_leaves=21............
[CV 1/3; 24/36] END learning_rate=0.1, max_depth=16, num_leaves=21; total time=   3.4s
[CV 2/3; 24/36] START learning_rate=0.1, max_depth=16, num_leaves=21............
[CV 2/3; 24/36] END learning_rate=0.1, max_depth=16, num_leaves=21; total time=   3.5s
[CV 3/3; 24/36] START learning_rate=0.1, max_depth=16, num_leaves=21......

[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 1/3; 25/36] END learning_rate=0.5, max_depth=1, num_leaves=1; total time=   0.6s
[CV 2/3; 25/36] START learning_rate=0.5, max_depth=1, num_leaves=1..............


[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 2/3; 25/36] END learning_rate=0.5, max_depth=1, num_leaves=1; total time=   0.6s
[CV 3/3; 25/36] START learning_rate=0.5, max_depth=1, num_leaves=1..............


[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 3/3; 25/36] END learning_rate=0.5, max_depth=1, num_leaves=1; total time=   0.6s
[CV 1/3; 26/36] START learning_rate=0.5, max_depth=1, num_leaves=11.............
[CV 1/3; 26/36] END learning_rate=0.5, max_depth=1, num_leaves=11; total time=   2.9s
[CV 2/3; 26/36] START learning_rate=0.5, max_depth=1, num_leaves=11.............
[CV 2/3; 26/36] END learning_rate=0.5, max_depth=1, num_leaves=11; total time=   5.7s
[CV 3/3; 26/36] START learning_rate=0.5, max_depth=1, num_leaves=11.............
[CV 3/3; 26/36] END learning_rate=0.5, max_depth=1, num_leaves=11; total time=   5.1s
[CV 1/3; 27/36] START learning_rate=0.5, max_depth=1, num_leaves=21.............
[CV 1/3; 27/36] END learning_rate=0.5, max_depth=1, num_leaves=21; total time=   4.6s
[CV 2/3; 27/36] START learning_rate=0.5, max_depth=1, num_leaves=21.............
[CV 2/3; 27/36] END learning_rate=0.5, max_depth=1, num_leaves=21; total time=   7.0s
[CV 3/3; 27/36] START learning_rate=0.5, max_depth=1, num_leaves=21.............

[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 1/3; 28/36] END learning_rate=0.5, max_depth=6, num_leaves=1; total time=   0.6s
[CV 2/3; 28/36] START learning_rate=0.5, max_depth=6, num_leaves=1..............


[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 2/3; 28/36] END learning_rate=0.5, max_depth=6, num_leaves=1; total time=   0.6s
[CV 3/3; 28/36] START learning_rate=0.5, max_depth=6, num_leaves=1..............


[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 3/3; 28/36] END learning_rate=0.5, max_depth=6, num_leaves=1; total time=   0.6s
[CV 1/3; 29/36] START learning_rate=0.5, max_depth=6, num_leaves=11.............
[CV 1/3; 29/36] END learning_rate=0.5, max_depth=6, num_leaves=11; total time=   5.4s
[CV 2/3; 29/36] START learning_rate=0.5, max_depth=6, num_leaves=11.............
[CV 2/3; 29/36] END learning_rate=0.5, max_depth=6, num_leaves=11; total time=   3.1s
[CV 3/3; 29/36] START learning_rate=0.5, max_depth=6, num_leaves=11.............
[CV 3/3; 29/36] END learning_rate=0.5, max_depth=6, num_leaves=11; total time=   3.2s
[CV 1/3; 30/36] START learning_rate=0.5, max_depth=6, num_leaves=21.............
[CV 1/3; 30/36] END learning_rate=0.5, max_depth=6, num_leaves=21; total time=   3.2s
[CV 2/3; 30/36] START learning_rate=0.5, max_depth=6, num_leaves=21.............
[CV 2/3; 30/36] END learning_rate=0.5, max_depth=6, num_leaves=21; total time=   3.2s
[CV 3/3; 30/36] START learning_rate=0.5, max_depth=6, num_leaves=21.............

[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 1/3; 31/36] END learning_rate=0.5, max_depth=11, num_leaves=1; total time=   0.6s
[CV 2/3; 31/36] START learning_rate=0.5, max_depth=11, num_leaves=1.............


[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 2/3; 31/36] END learning_rate=0.5, max_depth=11, num_leaves=1; total time=   0.6s
[CV 3/3; 31/36] START learning_rate=0.5, max_depth=11, num_leaves=1.............


[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 3/3; 31/36] END learning_rate=0.5, max_depth=11, num_leaves=1; total time=   0.6s
[CV 1/3; 32/36] START learning_rate=0.5, max_depth=11, num_leaves=11............
[CV 1/3; 32/36] END learning_rate=0.5, max_depth=11, num_leaves=11; total time=   3.0s
[CV 2/3; 32/36] START learning_rate=0.5, max_depth=11, num_leaves=11............
[CV 2/3; 32/36] END learning_rate=0.5, max_depth=11, num_leaves=11; total time=   3.3s
[CV 3/3; 32/36] START learning_rate=0.5, max_depth=11, num_leaves=11............
[CV 3/3; 32/36] END learning_rate=0.5, max_depth=11, num_leaves=11; total time=   3.2s
[CV 1/3; 33/36] START learning_rate=0.5, max_depth=11, num_leaves=21............
[CV 1/3; 33/36] END learning_rate=0.5, max_depth=11, num_leaves=21; total time=   3.3s
[CV 2/3; 33/36] START learning_rate=0.5, max_depth=11, num_leaves=21............
[CV 2/3; 33/36] END learning_rate=0.5, max_depth=11, num_leaves=21; total time=   3.3s
[CV 3/3; 33/36] START learning_rate=0.5, max_depth=11, num_leaves=21......

[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 1/3; 34/36] END learning_rate=0.5, max_depth=16, num_leaves=1; total time=   0.7s
[CV 2/3; 34/36] START learning_rate=0.5, max_depth=16, num_leaves=1.............


[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 2/3; 34/36] END learning_rate=0.5, max_depth=16, num_leaves=1; total time=   0.6s
[CV 3/3; 34/36] START learning_rate=0.5, max_depth=16, num_leaves=1.............


[LightGBM] [Fatal] Check failed: (num_leaves) > (1) at /__w/1/s/python-package/compile/src/io/config_auto.cpp, line 334 .



[CV 3/3; 34/36] END learning_rate=0.5, max_depth=16, num_leaves=1; total time=   0.6s
[CV 1/3; 35/36] START learning_rate=0.5, max_depth=16, num_leaves=11............
[CV 1/3; 35/36] END learning_rate=0.5, max_depth=16, num_leaves=11; total time=   3.4s
[CV 2/3; 35/36] START learning_rate=0.5, max_depth=16, num_leaves=11............
[CV 2/3; 35/36] END learning_rate=0.5, max_depth=16, num_leaves=11; total time=   3.3s
[CV 3/3; 35/36] START learning_rate=0.5, max_depth=16, num_leaves=11............
[CV 3/3; 35/36] END learning_rate=0.5, max_depth=16, num_leaves=11; total time=   3.3s
[CV 1/3; 36/36] START learning_rate=0.5, max_depth=16, num_leaves=21............
[CV 1/3; 36/36] END learning_rate=0.5, max_depth=16, num_leaves=21; total time=   3.5s
[CV 2/3; 36/36] START learning_rate=0.5, max_depth=16, num_leaves=21............
[CV 2/3; 36/36] END learning_rate=0.5, max_depth=16, num_leaves=21; total time=   3.4s
[CV 3/3; 36/36] START learning_rate=0.5, max_depth=16, num_leaves=21......

In [50]:
%%time

predictions_valid = grid_lgbm.predict(features_train_new)
mse = mean_squared_error(target_train_new, predictions_valid)

print('rmse=',mse ** 0.5)

rmse= 1751.3112316029394
CPU times: user 1.09 s, sys: 242 ms, total: 1.34 s
Wall time: 1.33 s


## Анализ моделей

Я обучила 3 модели: случайный лес, CatBoost и LightGBM. Результаты получились такие:
1. Модель случайного леса обучалась 1 час и 15 минут. В итоге результат RMSE = 1483.
2. Модель CatBoost обучалась более 40 минут и выдала результат RMSE = 1454
3. Модель LightGBM самая сложная оказалась. Обучение длилось 55 минут, а RMSE = 1751. 

Таким образом делаю вывод, что по RMSE и времени обучения лучший результат у модели CatBoost.

In [51]:
%%time
# Проверим нашу модель на тестовой выборке  

predictions_valid = grid_catboost.predict(features_test_new)
mse = mean_squared_error(target_test_new, predictions_valid)

print('rmse=',mse ** 0.5)

rmse= 1727.4227195490844
CPU times: user 46.1 ms, sys: 7.44 ms, total: 53.5 ms
Wall time: 51.9 ms


RMSE на тестовой выборке получился равным 1727, что вполне соответствует требованиям. 

ВЫВОДЫ: Я изучила, обработала и подготовила к обучению данные. Затем я выбрала 3 модели для обучения и путем перебора разных гиперпараметров нашла оптимальную модель. Добилась требуемого значения RMSE не выше 2500, которое подтвердилось и на тестовой выборке. 