## Homework

In [1]:
!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1OKFSv2GpuUFDphO0r8LdM7bl6MAWwBfX' -O data.csv

"wget" не является внутренней или внешней
командой, исполняемой программой или пакетным файлом.
"id" не является внутренней или внешней
командой, исполняемой программой или пакетным файлом.


В этой домашней работе вы будете предсказывать стоимость домов по их характеристикам.

Метрика качества: `RMSE`

Оценивание:
* Baseline - 2 балла
* Feature Engineering - 2 балла
* Model Selection - 3 балла
* Ensemble v.1 - 3 балла
* (*) Ensemble v.2 - дополнительно, 2 балла

### Описание датасета

Короткое описание данных:
```
price: sale price (this is the target variable)
id: transaction id
timestamp: date of transaction
full_sq: total area in square meters, including loggias, balconies and other non-residential areas
life_sq: living area in square meters, excluding loggias, balconies and other non-residential areas
floor: for apartments, floor of the building
max_floor: number of floors in the building
material: wall material
build_year: year built
num_room: number of living rooms
kitch_sq: kitchen area
state: apartment condition
product_type: owner-occupier purchase or investment
sub_area: name of the district

The dataset also includes a collection of features about each property's surrounding neighbourhood, and some features that are constant across each sub area (known as a Raion). Most of the feature names are self explanatory, with the following notes. See below for a complete list.

full_all: subarea population
male_f, female_f: subarea population by gender
young_*: population younger than working age
work_*: working-age population
ekder_*: retirement-age population
n_m_{all|male|female}: population between n and m years old
build_count_*: buildings in the subarea by construction type or year
x_count_500: the number of x within 500m of the property
x_part_500: the share of x within 500m of the property
_sqm_: square meters
cafe_count_d_price_p: number of cafes within d meters of the property that have an average bill under p RUB
trc_: shopping malls
prom_: industrial zones
green_: green zones
metro_: subway
_avto_: distances by car
mkad_: Moscow Circle Auto Road
ttk_: Third Transport Ring
sadovoe_: Garden Ring
bulvar_ring_: Boulevard Ring
kremlin_: City center
zd_vokzaly_: Train station
oil_chemistry_: Dirty industry
ts_: Power plant
```

### Setup

In [26]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import copy
from collections import Counter
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, StackingRegressor
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.experimental import enable_halving_search_cv # для HalvingGridSearchCV
from sklearn.model_selection import GridSearchCV, HalvingGridSearchCV
from catboost import CatBoostRegressor
from sklearn import linear_model
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

In [14]:
df = pd.read_csv("data.csv", parse_dates=["timestamp"])

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Columns: 292 entries, id to price
dtypes: datetime64[ns](1), float64(119), int64(157), object(15)
memory usage: 44.6+ MB


Разделите имеющиеся у вас данные на обучающую и тестовую выборки. В качестве обучающей выборки возьмите первые 80% данных, последние 20% - тестовая выборка.

Посмотрим. сколько у нас среди данных есть пропусков в каждом столбце (в процентах сразу)

In [16]:
x = df.isna().sum()/20000*100
x.sort_values(ascending=False).nlargest(10)

hospital_beds_raion           47.020
state                         44.535
build_year                    44.525
cafe_avg_price_500            43.890
cafe_sum_500_max_price_avg    43.890
cafe_sum_500_min_price_avg    43.890
max_floor                     31.515
material                      31.515
num_room                      31.515
kitch_sq                      31.515
dtype: float64

Слишком много порой пропусков в некоторых данных  - удалим все, что имеют более 40% пропусков (но сделаем это чуть позже)

Удалили (пока не удалили, Влад дебил, это позже надо делать) совсем прям пустые данные, перейдем к заполнению пропусков в других и к обработке категориальных данных
Сначала посмотрим на кат. данные

In [17]:
categorical_columns = [c for c in df.columns if df[c].dtype.name == 'object']
numeric_columns = [d for d in df.columns if df[d].dtype.name == 'int64' or df[d].dtype.name == 'float64']

for cat in categorical_columns:
    print(cat, df[cat].value_counts(),end = "\n\n")

product_type Investment       12840
OwnerOccupier     7160
Name: product_type, dtype: int64

sub_area Poselenie Sosenskoe               1126
Nekrasovka                        1074
Poselenie Vnukovskoe               859
Poselenie Moskovskij               603
Poselenie Voskresenskoe            464
                                  ... 
Molzhaninovskoe                      2
Poselenie Shhapovskoe                2
Poselenie Kievskij                   2
Poselenie Mihajlovo-Jarcevskoe       1
Poselenie Klenovskoe                 1
Name: sub_area, Length: 146, dtype: int64

culture_objects_top_25 no     18748
yes     1252
Name: culture_objects_top_25, dtype: int64

thermal_power_plant_raion no     18912
yes     1088
Name: thermal_power_plant_raion, dtype: int64

incineration_raion no     18452
yes     1548
Name: incineration_raion, dtype: int64

oil_chemistry_raion no     19809
yes      191
Name: oil_chemistry_raion, dtype: int64

radiation_raion no     12767
yes     7233
Name: radiation_raio

Видим, что почти во всех у нас идет разеделение на 1/0 (да, нет), и только экология выбивается из этого бравого кружка - можем воспользоваться one-hot-кодированием и будем жить долго да счастливо
Но перед этим заполним пропуски в таблице - для кат. данных  -наиболее часто встречающимся значением, для числовых - средним

In [18]:
from sklearn.impute import SimpleImputer

In [19]:
df_1 = pd.DataFrame(df, copy=True)
df_1[numeric_columns] = SimpleImputer(strategy="mean").fit_transform(df_1[numeric_columns])
df_1[categorical_columns] = SimpleImputer(strategy="most_frequent").fit_transform(df_1[categorical_columns])

In [20]:
df_onehot = pd.get_dummies(df_1)
df_onehot

Unnamed: 0,id,timestamp,full_sq,life_sq,floor,max_floor,material,build_year,num_room,kitch_sq,...,water_1line_yes,big_road1_1line_no,big_road1_1line_yes,railroad_1line_no,railroad_1line_yes,ecology_excellent,ecology_good,ecology_no data,ecology_poor,ecology_satisfactory
0,0.0,2014-12-26,1.0,1.000000,1.0,1.000000,1.000000,1.000000,1.00000,1.000000,...,0,1,0,1,0,0,0,0,1,0
1,1.0,2012-10-04,64.0,64.000000,16.0,12.603928,1.838505,3687.417575,1.91071,6.454917,...,0,1,0,1,0,0,0,1,0,0
2,2.0,2014-02-05,83.0,44.000000,9.0,17.000000,1.000000,1985.000000,3.00000,10.000000,...,0,1,0,1,0,0,1,0,0,0
3,3.0,2012-07-26,71.0,49.000000,2.0,12.603928,1.838505,3687.417575,1.91071,6.454917,...,0,1,0,1,0,0,0,1,0,0
4,4.0,2014-10-29,60.0,42.000000,9.0,9.000000,1.000000,1970.000000,3.00000,6.000000,...,1,1,0,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,19995.0,2014-09-22,45.0,27.000000,2.0,9.000000,1.000000,1978.000000,2.00000,5.000000,...,1,1,0,1,0,0,1,0,0,0
19996,19996.0,2013-12-06,38.0,34.032333,4.0,17.000000,1.000000,3687.417575,2.00000,1.000000,...,0,1,0,1,0,0,0,1,0,0
19997,19997.0,2014-06-26,35.0,14.000000,9.0,22.000000,1.000000,2001.000000,1.00000,10.000000,...,0,1,0,1,0,1,0,0,0,0
19998,19998.0,2014-01-21,51.0,30.000000,8.0,17.000000,1.000000,2011.000000,2.00000,9.000000,...,0,1,0,1,0,0,0,1,0,0


Ну а теперь удалим все пустышки из начала, ибо смысл заполнять половину столбца данных смысла не имело

In [21]:
x[x>40].index.to_list()
df_2 = df_onehot.drop(columns=x[x>40].index.to_list())
df_2

Unnamed: 0,id,timestamp,full_sq,life_sq,floor,max_floor,material,num_room,kitch_sq,area_m,...,water_1line_yes,big_road1_1line_no,big_road1_1line_yes,railroad_1line_no,railroad_1line_yes,ecology_excellent,ecology_good,ecology_no data,ecology_poor,ecology_satisfactory
0,0.0,2014-12-26,1.0,1.000000,1.0,1.000000,1.000000,1.00000,1.000000,5.293465e+06,...,0,1,0,1,0,0,0,0,1,0
1,1.0,2012-10-04,64.0,64.000000,16.0,12.603928,1.838505,1.91071,6.454917,6.677245e+07,...,0,1,0,1,0,0,0,1,0,0
2,2.0,2014-02-05,83.0,44.000000,9.0,17.000000,1.000000,3.00000,10.000000,1.216448e+07,...,0,1,0,1,0,0,1,0,0,0
3,3.0,2012-07-26,71.0,49.000000,2.0,12.603928,1.838505,1.91071,6.454917,4.708040e+06,...,0,1,0,1,0,0,0,1,0,0
4,4.0,2014-10-29,60.0,42.000000,9.0,9.000000,1.000000,3.00000,6.000000,1.428699e+07,...,1,1,0,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,19995.0,2014-09-22,45.0,27.000000,2.0,9.000000,1.000000,2.00000,5.000000,2.481385e+07,...,1,1,0,1,0,0,1,0,0,0
19996,19996.0,2013-12-06,38.0,34.032333,4.0,17.000000,1.000000,2.00000,1.000000,6.677245e+07,...,0,1,0,1,0,0,0,1,0,0
19997,19997.0,2014-06-26,35.0,14.000000,9.0,22.000000,1.000000,1.00000,10.000000,1.004686e+07,...,0,1,0,1,0,1,0,0,0,0
19998,19998.0,2014-01-21,51.0,30.000000,8.0,17.000000,1.000000,2.00000,9.000000,4.036700e+07,...,0,1,0,1,0,0,0,1,0,0


В принципе, я наверн даже не очень плохо все и сделал...

In [22]:
X = df_2.drop(columns=["price", "timestamp"])
y = df_2["price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)

Возможно в ваших моделях вам придется указывать, какие колонки являются категориальными (например, в бустингах). Для упрощения предлагается разделить колонки по следующему принципу:
```
drop_columns = [
    'id',           # May leak information
    'timestamp',    # May leak information
]
cat_columns = [
    'product_type',              #
    'material',                  # Material of the wall
    'state',                     # Satisfaction level
    'sub_area',                  # District name
    'culture_objects_top_25',    #
    'thermal_power_plant_raion', #
    'incineration_raion',        #
    'oil_chemistry_raion',       #
    'radiation_raion',           #
    'railroad_terminal_raion',   #
    'big_market_raion',          #
    'nuclear_reactor_raion',     #
    'detention_facility_raion',  #
    'ID_metro',                  #
    'ID_railroad_station_walk',  #
    'ID_railroad_station_avto',  #
    'water_1line',               #
    'ID_big_road1',              #
    'big_road1_1line',           #
    'ID_big_road2',              #
    'railroad_1line',            #
    'ID_railroad_terminal',      #
    'ID_bus_terminal',           #
    'ecology',                   #
]
num_columns = list(set(df.columns).difference(set(cat_columns + drop_columns)))
```

### Baseline (2 балла)

В качестве Baseline обучите `DecisionTreeRegressor` из `sklearn`.

In [27]:
tree = DecisionTreeRegressor()
tree.fit(X_train, y_train)
pred = tree.predict(X_test)

Проверьте качество на отложенной выборке.

In [28]:
mean_squared_error(pred, y_test)**0.5

3857951.7470837324

### Feature Engineering (2 балла)

Часто улучшить модель можно с помощью аккуратного Feature Engineering.

Добавим в модель дополнительные признаки:
* "Как часто в этот год и этот месяц появлились объявления"
* "Как часто в этот год и эту неделю появлялись объявления"

In [29]:
month_year = (df.timestamp.dt.month + df.timestamp.dt.year * 100)
month_year_cnt_map = month_year.value_counts().to_dict()
df_2["month_year_cnt"] = month_year.map(month_year_cnt_map)

week_year = (df.timestamp.dt.weekofyear + df.timestamp.dt.year * 100)
week_year_cnt_map = week_year.value_counts().to_dict()
df_2["week_year_cnt"] = week_year.map(week_year_cnt_map)

  week_year = (df.timestamp.dt.weekofyear + df.timestamp.dt.year * 100)


Добавьте следюущие дополнительные признаки:
* Месяц (из колонки `timestamp`)
* День недели (из колонки `timestamp`)
* Отношение "этаж / максимальный этаж в здании" (колонки `floor` и `max_floor`)
* Отношение "площадь кухни / площадь квартиры" (колонки `kitchen_sq` и `full_sq`)

По желанию можно добавить и другие признаки.

Добавил еще отношение жилой площади к общей - ну а вдруг помомжет - обычно рил решает

In [30]:
df_2["month"] = df.timestamp.dt.month
df_2["week_day"] = df.timestamp.dt.weekday
df_2["del_floor"] = np.minimum(1, df_2["floor"] / df_2["max_floor"])
df_2["sq_kitchen"] = np.minimum(1, df_2["kitch_sq"] / df_2["full_sq"])
df_2["life_to_live"] = np.minimum(1, df_2["life_sq"] / df_2["full_sq"])

df_2 = df_2.dropna(axis=0)

not_inf_pls = df_2.loc[df_2["del_floor"] != np.inf, "del_floor"].max()
df_2["del_floor"].replace(np.inf, not_inf_pls, inplace=True)
not_inf_pls = df_2.loc[df_2["sq_kitchen"] != np.inf, "sq_kitchen"].max()
df_2["sq_kitchen"].replace(np.inf, not_inf_pls, inplace=True)
not_inf_pls = df_2.loc[df_2["life_to_live"] != np.inf, "life_to_live"].max()
df_2["life_to_live"].replace(np.inf, not_inf_pls, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_2["del_floor"].replace(np.inf, not_inf_pls, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_2["sq_kitchen"].replace(np.inf, not_inf_pls, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_2["life_to_live"].replace(np.inf, not_inf_pls, inplace=True)


Из-за того, что там где-то видимо из-за деления на 0 некоторые значения улетали далеко в inf, я их поменял - по уродски, но поменял

In [31]:
df_2.isna().sum()

id              0
timestamp       0
full_sq         0
life_sq         0
floor           0
               ..
month           0
week_day        0
del_floor       0
sq_kitchen      0
life_to_live    0
Length: 455, dtype: int64

Разделите выборку на обучающую и тестовую еще раз (потому что дополнительные признаки созданы для исходной выборки).

In [32]:
X = df_2.drop(columns=["price", "timestamp"])
y = df_2["price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)

### Model Selection (3 балла)

Посмотрите, какого качества можно добиться если использовать разные модели:
* `DecisionTreeRegressor` из `sklearn`
* `RandomForestRegressor` из `sklearn`
* `CatBoostRegressor`

Также вы можете попробовать линейные модели, другие бустинги (`LigthGBM` и `XGBoost`).

Почти все библиотеки поддерживают удобный способ подбора гиперпараметров: посмотрите как это делать в [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) или в [catboost](https://catboost.ai/docs/concepts/python-reference_catboostregressor_grid_search.html).

Проверяйте качество каждой модели на тестовой выборке и выберите наилучшую.

In [37]:
tree = DecisionTreeRegressor()
tree.fit(X_train, y_train)
pred = tree.predict(X_test)

mean_squared_error(pred, y_test)**0.5

3692843.6950338283

Как видим, при помощи feature engeneering был относительно неплохо повышен результат - кнш надо было делать все еще лучше, но для этого надо было делать дз не в день дедлайна...

In [40]:
tree = DecisionTreeRegressor()
parameters = {"max_depth":[1,10, 20, 30], "min_samples_leaf":[1,10,20, 50]}
new_tree = GridSearchCV(tree,
                      param_grid=parameters,
                      scoring="neg_mean_squared_error",
                      cv=5)
new_tree.fit(X_train, y_train)
new_tree.best_params_

{'max_depth': 10, 'min_samples_leaf': 50}

In [41]:
tree = DecisionTreeRegressor(max_depth=10, min_samples_leaf=50)
tree.fit(X_train, y_train)
pred = tree.predict(X_test)

mean_squared_error(pred, y_test)**0.5

3059968.8916884847

Видим, что после подбора параметров ОЧЕНЬ сильно упала ошибка. Живем!

## Random forest

In [42]:
forest = RandomForestRegressor()
forest.fit(X_train, y_train)
forest_pred = forest.predict(X_test)
mean_squared_error(forest_pred, y_test)**0.5

2712516.4636450084


Даже ранд лес по умолчанию оч хорошо справляется. Ок, попробуем перебрать чутка

In [45]:
forest = RandomForestRegressor(min_samples_leaf=100, random_state=42)
parameters = {"n_estimators": [50, 75], "max_depth": [3, 5], "max_features": [0.2, 0.3, "sqrt", "auto"], "max_samples": [0.1, 0.2, None]}
new_forest = GridSearchCV(forest,
                      param_grid=parameters,
                      scoring="neg_mean_squared_error",
                      cv=5)
new_forest.fit(X_train, y_train)
new_forest.best_params_

{'max_depth': 5,
 'max_features': 'auto',
 'max_samples': None,
 'n_estimators': 75}

In [47]:
forest = RandomForestRegressor(max_depth=5, max_features="auto", max_samples=None, n_estimators=75, min_samples_leaf=100)
forest.fit(X_train, y_train)
forest_pred = forest.predict(X_test)

mean_squared_error(forest_pred, y_test)**0.5

3034738.7293428145

Ошибка возросла - говорит о том, что плохо перебрали параметров, ибо времени не особо было
Нужен более глубокий перебор с более широкими параметрами

Надо потом погуглить. как эту адскую машину ускорить
Вердикт по данным тестам - ранд лес рулит!

## Бустинг 

In [48]:
boost = CatBoostRegressor(loss_function="RMSE", verbose=False)
boost.fit(X_train, y_train)
boost_pred = boost.predict(X_test)

mean_squared_error(boost_pred, y_test)**0.5

2626533.364034407

Без перебора параметров бустинг лучше всех предыдущих решений - нот бэд

In [52]:
boost = CatBoostRegressor(loss_function="RMSE", random_state=42, silent=True)
parameters = {"iterations": [10, 20, 50], "learning_rate": [0.02, 0.05], "depth": [3, 5, 7]}
result = boost.grid_search(parameters, X=X_train, y=y_train, plot=True, verbose=False)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))


bestTest = 7064473.315
bestIteration = 9


bestTest = 5648097.583
bestIteration = 9


bestTest = 6061259.472
bestIteration = 19


bestTest = 4159346.534
bestIteration = 19


bestTest = 4182851.421
bestIteration = 49


bestTest = 2874980.892
bestIteration = 49


bestTest = 7050603.219
bestIteration = 9


bestTest = 5581670.915
bestIteration = 9


bestTest = 6057924.442
bestIteration = 19


bestTest = 4096847.566
bestIteration = 19


bestTest = 4125270.756
bestIteration = 49


bestTest = 2784812.977
bestIteration = 49


bestTest = 7046003.597
bestIteration = 9


bestTest = 5594478.067
bestIteration = 9


bestTest = 6043637.014
bestIteration = 19


bestTest = 4085903.454
bestIteration = 19


bestTest = 4125706.564
bestIteration = 49


bestTest = 2774794.458
bestIteration = 49

Training on fold [0/3]

bestTest = 3011643.731
bestIteration = 49

Training on fold [1/3]

bestTest = 2872540.814
bestIteration = 49

Training on fold [2/3]

bestTest = 3062373.844
bestIteration = 49



In [53]:
result["params"]

{'depth': 7, 'iterations': 50, 'learning_rate': 0.05}

In [54]:
boost = CatBoostRegressor(loss_function="RMSE", depth=7, iterations=50, learning_rate=0.05, verbose=False)
boost.fit(X_train, y_train)
boost_pred = boost.predict(X_test)

mean_squared_error(boost_pred, y_test)**0.5

2974055.3234977406

Ошибка снова возросла - скорее всего из-за малого кол-ва параметров перебор - нужно птратить время и перебрать по-нормальному

### Ensemble v.1 (3 балла)

Ансамбли иногда оказываются лучше чем одна большая модель.

В колонке `product_type` содержится информация о том, каким является объявление: `Investment` (продажа квартиры как инвестиции) или `OwnerOccupier` (продажа квартиры для жилья). Логично предположить, что если сделать по модели на каждый из этих типов, то качество будет выше.

Обучите свои лучшие модели на отдельно на `Investment` и `OwnerOccupier` (т.е. у вас будет `model_invest`, обученная на `(invest_train_X, invest_train_Y)` и `model_owner`, обученная на `(owner_train_X, owner_train_Y)`) и проверьте качество на отложенной выборке (т.е. на исходном `test_split`).

In [55]:
X_train['price'] = y_train

invest_train= X_train[X_train['product_type_Investment'] == 1]
invest_X_train = invest_train.drop(columns=['price'])

owner_train= X_train[X_train['product_type_Investment'] == 1]
owner_X_train = owner_train.drop(columns=['price'])

owner_y_train = owner_train['price']
invest_y_train =  invest_train['price']

In [58]:
invest_boost = CatBoostRegressor(loss_function='RMSE', random_state=42).fit(invest_X_train, invest_y_train)
owner_boost = CatBoostRegressor(loss_function='RMSE', random_state=42).fit(owner_X_train, owner_y_train)

invest_pred = invest_boost.predict(X_test)
owner_pred = owner_boost.predict(X_test)

Learning rate set to 0.059116
0:	learn: 4691292.0410695	total: 19ms	remaining: 18.9s
1:	learn: 4571852.7432057	total: 37.2ms	remaining: 18.6s
2:	learn: 4456753.0117982	total: 57.4ms	remaining: 19.1s
3:	learn: 4359655.7025815	total: 75.8ms	remaining: 18.9s
4:	learn: 4257039.0284134	total: 92.6ms	remaining: 18.4s
5:	learn: 4168754.5060736	total: 111ms	remaining: 18.4s
6:	learn: 4077030.2300684	total: 128ms	remaining: 18.2s
7:	learn: 4002516.4304777	total: 147ms	remaining: 18.3s
8:	learn: 3926295.5854854	total: 165ms	remaining: 18.2s
9:	learn: 3856481.2007721	total: 183ms	remaining: 18.2s
10:	learn: 3793869.8399396	total: 207ms	remaining: 18.6s
11:	learn: 3734123.2287922	total: 224ms	remaining: 18.5s
12:	learn: 3674651.9183609	total: 244ms	remaining: 18.5s
13:	learn: 3622667.5158159	total: 263ms	remaining: 18.5s
14:	learn: 3573638.5115909	total: 281ms	remaining: 18.5s
15:	learn: 3525124.3149982	total: 299ms	remaining: 18.4s
16:	learn: 3482067.3011166	total: 316ms	remaining: 18.3s
17:	lear

151:	learn: 2600055.4394420	total: 3.12s	remaining: 17.4s
152:	learn: 2598317.8232187	total: 3.14s	remaining: 17.4s
153:	learn: 2595486.5851641	total: 3.18s	remaining: 17.5s
154:	learn: 2595108.1998691	total: 3.2s	remaining: 17.5s
155:	learn: 2593747.6001524	total: 3.22s	remaining: 17.4s
156:	learn: 2591786.6278515	total: 3.24s	remaining: 17.4s
157:	learn: 2589559.4274712	total: 3.26s	remaining: 17.4s
158:	learn: 2588227.4026033	total: 3.28s	remaining: 17.3s
159:	learn: 2586381.2719290	total: 3.29s	remaining: 17.3s
160:	learn: 2584767.0593394	total: 3.31s	remaining: 17.3s
161:	learn: 2582909.4867690	total: 3.34s	remaining: 17.3s
162:	learn: 2581227.7315220	total: 3.37s	remaining: 17.3s
163:	learn: 2579265.0203575	total: 3.41s	remaining: 17.4s
164:	learn: 2577280.2728176	total: 3.43s	remaining: 17.4s
165:	learn: 2573399.3741032	total: 3.49s	remaining: 17.5s
166:	learn: 2570751.1275383	total: 3.53s	remaining: 17.6s
167:	learn: 2568224.0051140	total: 3.57s	remaining: 17.7s
168:	learn: 256

294:	learn: 2348145.4051412	total: 7.94s	remaining: 19s
295:	learn: 2346957.2626766	total: 7.97s	remaining: 19s
296:	learn: 2345758.0911118	total: 8.01s	remaining: 19s
297:	learn: 2344552.0127833	total: 8.04s	remaining: 19s
298:	learn: 2342940.5274600	total: 8.07s	remaining: 18.9s
299:	learn: 2341373.5641634	total: 8.1s	remaining: 18.9s
300:	learn: 2340151.0068158	total: 8.14s	remaining: 18.9s
301:	learn: 2339201.0072171	total: 8.17s	remaining: 18.9s
302:	learn: 2337723.8679932	total: 8.21s	remaining: 18.9s
303:	learn: 2337227.0624879	total: 8.24s	remaining: 18.9s
304:	learn: 2337128.9105130	total: 8.27s	remaining: 18.9s
305:	learn: 2336159.7396789	total: 8.31s	remaining: 18.8s
306:	learn: 2334652.7159687	total: 8.34s	remaining: 18.8s
307:	learn: 2333182.7250580	total: 8.36s	remaining: 18.8s
308:	learn: 2331887.7447551	total: 8.39s	remaining: 18.8s
309:	learn: 2329987.7638061	total: 8.41s	remaining: 18.7s
310:	learn: 2328267.1476747	total: 8.43s	remaining: 18.7s
311:	learn: 2326807.566

447:	learn: 2161016.2240027	total: 11.6s	remaining: 14.2s
448:	learn: 2159845.9717369	total: 11.6s	remaining: 14.2s
449:	learn: 2159360.2266978	total: 11.6s	remaining: 14.2s
450:	learn: 2157596.2059785	total: 11.6s	remaining: 14.1s
451:	learn: 2155853.5030946	total: 11.6s	remaining: 14.1s
452:	learn: 2154374.3467802	total: 11.6s	remaining: 14.1s
453:	learn: 2154330.5723786	total: 11.7s	remaining: 14s
454:	learn: 2153292.5225099	total: 11.7s	remaining: 14s
455:	learn: 2151590.3900636	total: 11.7s	remaining: 14s
456:	learn: 2150685.7591560	total: 11.7s	remaining: 13.9s
457:	learn: 2149385.8354673	total: 11.7s	remaining: 13.9s
458:	learn: 2147166.4239505	total: 11.8s	remaining: 13.9s
459:	learn: 2146472.1491024	total: 11.8s	remaining: 13.8s
460:	learn: 2145858.0857088	total: 11.8s	remaining: 13.8s
461:	learn: 2144747.5557490	total: 11.8s	remaining: 13.8s
462:	learn: 2143377.2176551	total: 11.8s	remaining: 13.7s
463:	learn: 2142046.1283960	total: 11.9s	remaining: 13.7s
464:	learn: 2140560.

596:	learn: 2012883.7472350	total: 14.5s	remaining: 9.79s
597:	learn: 2012341.5076335	total: 14.5s	remaining: 9.77s
598:	learn: 2011140.2270431	total: 14.5s	remaining: 9.74s
599:	learn: 2010422.1695570	total: 14.6s	remaining: 9.71s
600:	learn: 2009691.6070672	total: 14.6s	remaining: 9.68s
601:	learn: 2008851.9811023	total: 14.6s	remaining: 9.65s
602:	learn: 2007986.5474891	total: 14.6s	remaining: 9.63s
603:	learn: 2007070.2692814	total: 14.6s	remaining: 9.6s
604:	learn: 2006385.8795346	total: 14.7s	remaining: 9.57s
605:	learn: 2005329.2635334	total: 14.7s	remaining: 9.54s
606:	learn: 2003973.6857303	total: 14.7s	remaining: 9.52s
607:	learn: 2003010.9547239	total: 14.7s	remaining: 9.49s
608:	learn: 2002098.9422393	total: 14.7s	remaining: 9.46s
609:	learn: 2001433.7884650	total: 14.8s	remaining: 9.44s
610:	learn: 2000605.0564342	total: 14.8s	remaining: 9.42s
611:	learn: 2000009.6547592	total: 14.8s	remaining: 9.39s
612:	learn: 1999224.7491979	total: 14.8s	remaining: 9.37s
613:	learn: 199

742:	learn: 1886327.0283827	total: 17.4s	remaining: 6.03s
743:	learn: 1885697.0105820	total: 17.5s	remaining: 6.01s
744:	learn: 1885321.8676885	total: 17.5s	remaining: 5.98s
745:	learn: 1884536.7973218	total: 17.5s	remaining: 5.96s
746:	learn: 1883915.8041548	total: 17.5s	remaining: 5.93s
747:	learn: 1883259.4835537	total: 17.5s	remaining: 5.91s
748:	learn: 1882576.3311481	total: 17.6s	remaining: 5.88s
749:	learn: 1882090.8498344	total: 17.6s	remaining: 5.86s
750:	learn: 1881523.3008990	total: 17.6s	remaining: 5.83s
751:	learn: 1880785.8033103	total: 17.6s	remaining: 5.81s
752:	learn: 1880190.8939566	total: 17.6s	remaining: 5.78s
753:	learn: 1878802.1312637	total: 17.6s	remaining: 5.76s
754:	learn: 1878101.7253248	total: 17.7s	remaining: 5.73s
755:	learn: 1877235.1750955	total: 17.7s	remaining: 5.71s
756:	learn: 1876160.5898970	total: 17.7s	remaining: 5.68s
757:	learn: 1875331.4924409	total: 17.7s	remaining: 5.66s
758:	learn: 1874298.0588715	total: 17.7s	remaining: 5.63s
759:	learn: 18

891:	learn: 1770082.6176614	total: 20.4s	remaining: 2.47s
892:	learn: 1769554.7759159	total: 20.4s	remaining: 2.44s
893:	learn: 1768931.2346570	total: 20.4s	remaining: 2.42s
894:	learn: 1767941.7416884	total: 20.4s	remaining: 2.4s
895:	learn: 1767244.9139409	total: 20.4s	remaining: 2.37s
896:	learn: 1766610.9986568	total: 20.5s	remaining: 2.35s
897:	learn: 1766003.2382660	total: 20.5s	remaining: 2.33s
898:	learn: 1764965.5713380	total: 20.5s	remaining: 2.3s
899:	learn: 1764166.1735768	total: 20.5s	remaining: 2.28s
900:	learn: 1763409.9558499	total: 20.5s	remaining: 2.26s
901:	learn: 1762921.4312507	total: 20.6s	remaining: 2.23s
902:	learn: 1762189.7642843	total: 20.6s	remaining: 2.21s
903:	learn: 1761407.6873760	total: 20.6s	remaining: 2.19s
904:	learn: 1760605.4873170	total: 20.6s	remaining: 2.16s
905:	learn: 1760121.3132823	total: 20.6s	remaining: 2.14s
906:	learn: 1759022.4212409	total: 20.7s	remaining: 2.12s
907:	learn: 1758025.6353205	total: 20.7s	remaining: 2.1s
908:	learn: 17570

39:	learn: 3014390.5575411	total: 832ms	remaining: 20s
40:	learn: 3000894.7861349	total: 851ms	remaining: 19.9s
41:	learn: 2991794.4500307	total: 871ms	remaining: 19.9s
42:	learn: 2977919.3570305	total: 889ms	remaining: 19.8s
43:	learn: 2967809.3295474	total: 909ms	remaining: 19.8s
44:	learn: 2958279.1389896	total: 931ms	remaining: 19.8s
45:	learn: 2949073.5836082	total: 953ms	remaining: 19.8s
46:	learn: 2939735.7552438	total: 980ms	remaining: 19.9s
47:	learn: 2929505.8911361	total: 1s	remaining: 19.9s
48:	learn: 2923762.7386462	total: 1.03s	remaining: 19.9s
49:	learn: 2914401.9959470	total: 1.05s	remaining: 20s
50:	learn: 2906434.5356010	total: 1.07s	remaining: 20s
51:	learn: 2901526.7310599	total: 1.09s	remaining: 20s
52:	learn: 2894044.4588023	total: 1.12s	remaining: 20s
53:	learn: 2888982.8270058	total: 1.14s	remaining: 20s
54:	learn: 2881466.6185450	total: 1.17s	remaining: 20.1s
55:	learn: 2876590.0839307	total: 1.19s	remaining: 20.1s
56:	learn: 2871237.1958585	total: 1.22s	remain

191:	learn: 2518935.9097592	total: 4.22s	remaining: 17.8s
192:	learn: 2517184.8480712	total: 4.24s	remaining: 17.7s
193:	learn: 2515844.8427479	total: 4.26s	remaining: 17.7s
194:	learn: 2512424.8348293	total: 4.28s	remaining: 17.7s
195:	learn: 2510835.2018416	total: 4.3s	remaining: 17.6s
196:	learn: 2509261.1089479	total: 4.32s	remaining: 17.6s
197:	learn: 2506977.7706935	total: 4.34s	remaining: 17.6s
198:	learn: 2505063.4506619	total: 4.36s	remaining: 17.5s
199:	learn: 2503164.2620810	total: 4.38s	remaining: 17.5s
200:	learn: 2500348.1030736	total: 4.4s	remaining: 17.5s
201:	learn: 2498646.8183224	total: 4.42s	remaining: 17.4s
202:	learn: 2494720.1387240	total: 4.43s	remaining: 17.4s
203:	learn: 2492582.3589570	total: 4.45s	remaining: 17.4s
204:	learn: 2492134.2864958	total: 4.47s	remaining: 17.3s
205:	learn: 2489633.2491610	total: 4.49s	remaining: 17.3s
206:	learn: 2487672.4910654	total: 4.51s	remaining: 17.3s
207:	learn: 2485358.5535660	total: 4.53s	remaining: 17.2s
208:	learn: 2482

342:	learn: 2283109.7800563	total: 7.14s	remaining: 13.7s
343:	learn: 2282413.6392564	total: 7.16s	remaining: 13.6s
344:	learn: 2280547.4828308	total: 7.18s	remaining: 13.6s
345:	learn: 2278987.9929806	total: 7.2s	remaining: 13.6s
346:	learn: 2277449.2591364	total: 7.22s	remaining: 13.6s
347:	learn: 2275412.0007377	total: 7.24s	remaining: 13.6s
348:	learn: 2274602.9007271	total: 7.25s	remaining: 13.5s
349:	learn: 2273458.9009956	total: 7.28s	remaining: 13.5s
350:	learn: 2271483.9668768	total: 7.3s	remaining: 13.5s
351:	learn: 2270724.4585324	total: 7.32s	remaining: 13.5s
352:	learn: 2268485.4503227	total: 7.34s	remaining: 13.5s
353:	learn: 2268384.2505596	total: 7.36s	remaining: 13.4s
354:	learn: 2266402.9219145	total: 7.38s	remaining: 13.4s
355:	learn: 2265092.9455625	total: 7.4s	remaining: 13.4s
356:	learn: 2263998.0114855	total: 7.42s	remaining: 13.4s
357:	learn: 2263210.1243294	total: 7.43s	remaining: 13.3s
358:	learn: 2262448.3688642	total: 7.45s	remaining: 13.3s
359:	learn: 22623

486:	learn: 2117381.5743278	total: 10s	remaining: 10.6s
487:	learn: 2116387.4290005	total: 10.1s	remaining: 10.6s
488:	learn: 2115490.8857956	total: 10.1s	remaining: 10.5s
489:	learn: 2114607.3832832	total: 10.1s	remaining: 10.5s
490:	learn: 2113120.6488903	total: 10.1s	remaining: 10.5s
491:	learn: 2111605.6191812	total: 10.2s	remaining: 10.5s
492:	learn: 2110301.3981708	total: 10.2s	remaining: 10.5s
493:	learn: 2109447.2218950	total: 10.2s	remaining: 10.4s
494:	learn: 2108283.7339998	total: 10.2s	remaining: 10.4s
495:	learn: 2107410.6699378	total: 10.2s	remaining: 10.4s
496:	learn: 2106083.9397980	total: 10.3s	remaining: 10.4s
497:	learn: 2104844.0799243	total: 10.3s	remaining: 10.4s
498:	learn: 2103625.2399281	total: 10.3s	remaining: 10.4s
499:	learn: 2102404.9498683	total: 10.3s	remaining: 10.3s
500:	learn: 2101669.6787960	total: 10.4s	remaining: 10.3s
501:	learn: 2100136.3454499	total: 10.4s	remaining: 10.3s
502:	learn: 2099024.3965478	total: 10.4s	remaining: 10.3s
503:	learn: 2098

635:	learn: 1978247.0731199	total: 13.5s	remaining: 7.7s
636:	learn: 1977718.9932827	total: 13.5s	remaining: 7.68s
637:	learn: 1976891.6366741	total: 13.5s	remaining: 7.66s
638:	learn: 1976105.5207195	total: 13.5s	remaining: 7.64s
639:	learn: 1975363.0939014	total: 13.5s	remaining: 7.62s
640:	learn: 1974849.7580722	total: 13.6s	remaining: 7.6s
641:	learn: 1973573.9083776	total: 13.6s	remaining: 7.58s
642:	learn: 1972589.7175894	total: 13.6s	remaining: 7.56s
643:	learn: 1972026.3759881	total: 13.6s	remaining: 7.54s
644:	learn: 1970679.9396111	total: 13.7s	remaining: 7.52s
645:	learn: 1969726.9842661	total: 13.7s	remaining: 7.5s
646:	learn: 1969382.0574416	total: 13.7s	remaining: 7.47s
647:	learn: 1968586.7489985	total: 13.7s	remaining: 7.45s
648:	learn: 1967943.5411199	total: 13.7s	remaining: 7.43s
649:	learn: 1967186.6884649	total: 13.8s	remaining: 7.41s
650:	learn: 1966578.8384215	total: 13.8s	remaining: 7.39s
651:	learn: 1965661.7525068	total: 13.8s	remaining: 7.37s
652:	learn: 19646

778:	learn: 1857845.7154911	total: 16.6s	remaining: 4.72s
779:	learn: 1857117.5488826	total: 16.7s	remaining: 4.71s
780:	learn: 1856515.4407898	total: 16.7s	remaining: 4.68s
781:	learn: 1855577.0366279	total: 16.7s	remaining: 4.66s
782:	learn: 1855082.1683780	total: 16.8s	remaining: 4.65s
783:	learn: 1854067.8955265	total: 16.8s	remaining: 4.63s
784:	learn: 1853102.5332602	total: 16.8s	remaining: 4.61s
785:	learn: 1852137.3613061	total: 16.9s	remaining: 4.59s
786:	learn: 1851218.2160423	total: 16.9s	remaining: 4.57s
787:	learn: 1850359.8767984	total: 16.9s	remaining: 4.55s
788:	learn: 1849605.2130534	total: 16.9s	remaining: 4.53s
789:	learn: 1848617.3442659	total: 16.9s	remaining: 4.5s
790:	learn: 1847797.7945274	total: 17s	remaining: 4.48s
791:	learn: 1846901.1118959	total: 17s	remaining: 4.46s
792:	learn: 1845991.6673477	total: 17s	remaining: 4.44s
793:	learn: 1845068.8398948	total: 17s	remaining: 4.42s
794:	learn: 1844134.9914707	total: 17.1s	remaining: 4.4s
795:	learn: 1843183.8117

925:	learn: 1744807.9215827	total: 21.3s	remaining: 1.7s
926:	learn: 1744189.0622756	total: 21.3s	remaining: 1.68s
927:	learn: 1743869.2028391	total: 21.3s	remaining: 1.66s
928:	learn: 1743336.4487310	total: 21.4s	remaining: 1.63s
929:	learn: 1742742.1990408	total: 21.4s	remaining: 1.61s
930:	learn: 1742052.5999199	total: 21.4s	remaining: 1.59s
931:	learn: 1741492.0683758	total: 21.5s	remaining: 1.56s
932:	learn: 1740956.0787615	total: 21.5s	remaining: 1.54s
933:	learn: 1740351.2650578	total: 21.5s	remaining: 1.52s
934:	learn: 1740036.7720013	total: 21.5s	remaining: 1.5s
935:	learn: 1738908.4911851	total: 21.6s	remaining: 1.47s
936:	learn: 1738675.8544770	total: 21.6s	remaining: 1.45s
937:	learn: 1738301.7433963	total: 21.6s	remaining: 1.43s
938:	learn: 1737611.9835079	total: 21.6s	remaining: 1.4s
939:	learn: 1737552.9695132	total: 21.6s	remaining: 1.38s
940:	learn: 1736855.4143747	total: 21.7s	remaining: 1.36s
941:	learn: 1736053.0586598	total: 21.7s	remaining: 1.33s
942:	learn: 17350

In [60]:
mean_squared_error(y_test, invest_pred)**0.5

2880973.9768233574

In [61]:
mean_squared_error(y_test, owner_pred)**0.5

2880973.9768233574

Ансамбли оказались хуже, чем предыдущие модели

### (*) Ensemble v.2 (дополнительно, 2 балла)

Попробуйте сделать для `Investment` более сложную модель: обучите `CatBoostRegressor` и `HuberRegressor` из `sklearn`, а затем сложите их предсказания с весами `w_1` и `w_2` (выберите веса сами; сумма весов равняется 1).