## Homework

В этой домашней работе вы будете предсказывать стоимость домов по их характеристикам.

Метрика качества: `RMSE`

Оценивание:
* Baseline - 2 балла
* Feature Engineering - 2 балла
* Model Selection - 3 балла
* Ensemble v.1 - 3 балла
* (*) Ensemble v.2 - дополнительно, 2 балла

### Описание датасета

Короткое описание данных:
```
price: sale price (this is the target variable)
id: transaction id
timestamp: date of transaction
full_sq: total area in square meters, including loggias, balconies and other non-residential areas
life_sq: living area in square meters, excluding loggias, balconies and other non-residential areas
floor: for apartments, floor of the building
max_floor: number of floors in the building
material: wall material
build_year: year built
num_room: number of living rooms
kitch_sq: kitchen area
state: apartment condition
product_type: owner-occupier purchase or investment
sub_area: name of the district

The dataset also includes a collection of features about each property's surrounding neighbourhood, and some features that are constant across each sub area (known as a Raion). Most of the feature names are self explanatory, with the following notes. See below for a complete list.

full_all: subarea population
male_f, female_f: subarea population by gender
young_*: population younger than working age
work_*: working-age population
ekder_*: retirement-age population
n_m_{all|male|female}: population between n and m years old
build_count_*: buildings in the subarea by construction type or year
x_count_500: the number of x within 500m of the property
x_part_500: the share of x within 500m of the property
_sqm_: square meters
cafe_count_d_price_p: number of cafes within d meters of the property that have an average bill under p RUB
trc_: shopping malls
prom_: industrial zones
green_: green zones
metro_: subway
_avto_: distances by car
mkad_: Moscow Circle Auto Road
ttk_: Third Transport Ring
sadovoe_: Garden Ring
bulvar_ring_: Boulevard Ring
kremlin_: City center
zd_vokzaly_: Train station
oil_chemistry_: Dirty industry
ts_: Power plant
```

### Setup

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("data.csv", parse_dates=["timestamp"])

In [3]:
df.head()

Unnamed: 0,id,timestamp,full_sq,life_sq,floor,max_floor,material,build_year,num_room,kitch_sq,...,cafe_count_5000_price_2500,cafe_count_5000_price_4000,cafe_count_5000_price_high,big_church_count_5000,church_count_5000,mosque_count_5000,leisure_count_5000,sport_count_5000,market_count_5000,price
0,0,2014-12-26,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,36,7,2,15,33,1,12,75,10,15318960
1,1,2012-10-04,64,64.0,16.0,,,,,,...,2,2,0,0,13,1,0,6,1,6080000
2,2,2014-02-05,83,44.0,9.0,17.0,1.0,1985.0,3.0,10.0,...,13,6,1,8,18,0,1,52,0,17000000
3,3,2012-07-26,71,49.0,2.0,,,,,,...,0,0,0,1,3,0,2,8,2,990000
4,4,2014-10-29,60,42.0,9.0,9.0,1.0,1970.0,3.0,6.0,...,3,1,0,5,8,0,1,34,5,7900000


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Columns: 292 entries, id to price
dtypes: datetime64[ns](1), float64(119), int64(157), object(15)
memory usage: 44.6+ MB


In [5]:
df.isna().any().sum()

51

In [6]:
df.isna().sum()

id                       0
timestamp                0
full_sq                  0
life_sq               4103
floor                  113
                      ... 
mosque_count_5000        0
leisure_count_5000       0
sport_count_5000         0
market_count_5000        0
price                    0
Length: 292, dtype: int64

In [89]:
X = df.drop("price", axis=1)
y = df["price"]

In [90]:
X = X.drop(['id', 'timestamp'], axis = 1)

In [91]:
X.shape

(20000, 295)

In [92]:
X = pd.get_dummies(X)

In [93]:
X.shape

(20000, 457)

In [94]:
X.fillna(X.mean(), inplace=True)

Разделите имеющиеся у вас данные на обучающую и тестовую выборки. В качестве обучающей выборки возьмите первые 80% данных, последние 20% - тестовая выборка.

In [95]:
from sklearn.model_selection import train_test_split

In [96]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

Возможно в ваших моделях вам придется указывать, какие колонки являются категориальными (например, в бустингах). Для упрощения предлагается разделить колонки по следующему принципу:
```
drop_columns = [
    'id',           # May leak information
    'timestamp',    # May leak information
]
cat_columns = [
    'product_type',              #
    'material',                  # Material of the wall
    'state',                     # Satisfaction level
    'sub_area',                  # District name
    'culture_objects_top_25',    #
    'thermal_power_plant_raion', #
    'incineration_raion',        #
    'oil_chemistry_raion',       #
    'radiation_raion',           #
    'railroad_terminal_raion',   #
    'big_market_raion',          #
    'nuclear_reactor_raion',     #
    'detention_facility_raion',  #
    'ID_metro',                  #
    'ID_railroad_station_walk',  #
    'ID_railroad_station_avto',  #
    'water_1line',               #
    'ID_big_road1',              #
    'big_road1_1line',           #
    'ID_big_road2',              #
    'railroad_1line',            #
    'ID_railroad_terminal',      #
    'ID_bus_terminal',           #
    'ecology',                   #
]
num_columns = list(set(df.columns).difference(set(cat_columns + drop_columns)))
```

In [15]:
drop_columns = [
    'id',           # May leak information
    'timestamp',    # May leak information
]
cat_columns = [
    'product_type',              #
    'material',                  # Material of the wall
    'state',                     # Satisfaction level
    'sub_area',                  # District name
    'culture_objects_top_25',    #
    'thermal_power_plant_raion', #
    'incineration_raion',        #
    'oil_chemistry_raion',       #
    'radiation_raion',           #
    'railroad_terminal_raion',   #
    'big_market_raion',          #
    'nuclear_reactor_raion',     #
    'detention_facility_raion',  #
    'ID_metro',                  #
    'ID_railroad_station_walk',  #
    'ID_railroad_station_avto',  #
    'water_1line',               #
    'ID_big_road1',              #
    'big_road1_1line',           #
    'ID_big_road2',              #
    'railroad_1line',            #
    'ID_railroad_terminal',      #
    'ID_bus_terminal',           #
    'ecology',                   #
]
num_columns = list(set(df.columns).difference(set(cat_columns + drop_columns)))

### Baseline (2 балла)

В качестве Baseline обучите `DecisionTreeRegressor` из `sklearn`.

In [16]:
from sklearn.tree import DecisionTreeRegressor

In [17]:
model = DecisionTreeRegressor()
model.fit(X_train, y_train)

DecisionTreeRegressor()

Проверьте качество на отложенной выборке.

In [18]:
from sklearn.metrics import mean_squared_error as mse

In [19]:
print("mse: ", mse(model.predict(X_test), y_test))
print("rmse: ", mse(model.predict(X_test), y_test)**0.5)

mse:  14370035532470.627
rmse:  3790782.9709006855


### Feature Engineering (2 балла)

Часто улучшить модель можно с помощью аккуратного Feature Engineering.

Добавим в модель дополнительные признаки:
* "Как часто в этот год и этот месяц появлились объявления"
* "Как часто в этот год и эту неделю появлялись объявления"

In [20]:
month_year = (df.timestamp.dt.month + df.timestamp.dt.year * 100)
month_year_cnt_map = month_year.value_counts().to_dict()
df["month_year_cnt"] = month_year.map(month_year_cnt_map)

week_year = (df.timestamp.dt.weekofyear + df.timestamp.dt.year * 100)
week_year_cnt_map = week_year.value_counts().to_dict()
df["week_year_cnt"] = week_year.map(week_year_cnt_map)

  week_year = (df.timestamp.dt.weekofyear + df.timestamp.dt.year * 100)


Добавьте следюущие дополнительные признаки:
* Месяц (из колонки `timestamp`)
* День недели (из колонки `timestamp`)
* Отношение "этаж / максимальный этаж в здании" (колонки `floor` и `max_floor`)
* Отношение "площадь кухни / площадь квартиры" (колонки `kitchen_sq` и `full_sq`)

По желанию можно добавить и другие признаки.

In [21]:
df['month'] = df.timestamp.dt.month

In [22]:
df['day_of_week'] = df.timestamp.dt.dayofweek

In [23]:
df['floor/max_floor'] = df.floor / df.max_floor

In [24]:
df['kitch_sq/full_sq'] = df.kitch_sq / df.full_sq

In [25]:
df.head()

Unnamed: 0,id,timestamp,full_sq,life_sq,floor,max_floor,material,build_year,num_room,kitch_sq,...,leisure_count_5000,sport_count_5000,market_count_5000,price,month_year_cnt,week_year_cnt,month,day_of_week,floor/max_floor,kitch_sq/full_sq
0,0,2014-12-26,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,12,75,10,15318960,1122,112,12,4,1.0,1.0
1,1,2012-10-04,64,64.0,16.0,,,,,,...,0,6,1,6080000,327,68,10,3,,
2,2,2014-02-05,83,44.0,9.0,17.0,1.0,1985.0,3.0,10.0,...,1,52,0,17000000,752,192,2,2,0.529412,0.120482
3,3,2012-07-26,71,49.0,2.0,,,,,,...,2,8,2,990000,262,69,7,3,,
4,4,2014-10-29,60,42.0,9.0,9.0,1.0,1970.0,3.0,6.0,...,1,34,5,7900000,711,214,10,2,1.0,0.1


Разделите выборку на обучающую и тестовую еще раз (потому что дополнительные признаки созданы для исходной выборки).

In [59]:
X = df.drop(["price", "id", "timestamp"], axis=1)
y = df["price"]

In [60]:
X = pd.get_dummies(X)

In [61]:
X.fillna(X.mean(), inplace=True)

In [62]:
X.isna().any().sum()

0

In [63]:
X.replace([np.inf, -np.inf], np.nan, inplace=True)

In [64]:
X.isna().any().sum()

1

In [65]:
X['floor/max_floor'].isna().sum()

6645

In [66]:
X.fillna(1, inplace=True)

In [67]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

In [35]:
model = DecisionTreeRegressor()
model.fit(X_train, y_train)

DecisionTreeRegressor()

In [36]:
print("mse: ", mse(model.predict(X_test), y_test))
print("rmse: ", mse(model.predict(X_test), y_test)**0.5)

mse:  14096091211989.361
rmse:  3754476.156801287


### Model Selection (3 балла)

Посмотрите, какого качества можно добиться если использовать разные модели:
* `DecisionTreeRegressor` из `sklearn`
* `RandomForestRegressor` из `sklearn`
* `CatBoostRegressor`

Также вы можете попробовать линейные модели, другие бустинги (`LigthGBM` и `XGBoost`).

Почти все библиотеки поддерживают удобный способ подбора гиперпараметров: посмотрите как это делать в [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) или в [catboost](https://catboost.ai/docs/concepts/python-reference_catboostregressor_grid_search.html).

Проверяйте качество каждой модели на тестовой выборке и выберите наилучшую.

In [40]:
from sklearn.ensemble import RandomForestRegressor

In [41]:
model = RandomForestRegressor()
model.fit(X_train, y_train)

RandomForestRegressor()

In [42]:
print("mse: ", mse(model.predict(X_test), y_test))
print("rmse: ", mse(model.predict(X_test), y_test)**0.5)

mse:  7347474510030.466
rmse:  2710622.531823726


In [43]:
pip install catboost

Collecting catboost
  Downloading catboost-1.0.5-cp38-none-win_amd64.whl (73.9 MB)
Collecting graphviz
  Downloading graphviz-0.20-py3-none-any.whl (46 kB)
Installing collected packages: graphviz, catboost
Successfully installed catboost-1.0.5 graphviz-0.20
Note: you may need to restart the kernel to use updated packages.


In [44]:
from catboost import CatBoostRegressor

In [45]:
model = CatBoostRegressor()
model.fit(X_train, y_train)

Learning rate set to 0.06345
0:	learn: 4462364.1939822	total: 188ms	remaining: 3m 7s
1:	learn: 4322611.4386474	total: 208ms	remaining: 1m 43s
2:	learn: 4204585.7088874	total: 227ms	remaining: 1m 15s
3:	learn: 4088323.5457329	total: 245ms	remaining: 1m
4:	learn: 3982404.6273667	total: 263ms	remaining: 52.3s
5:	learn: 3877869.1155272	total: 280ms	remaining: 46.4s
6:	learn: 3783488.0478221	total: 297ms	remaining: 42.1s
7:	learn: 3698355.7154376	total: 314ms	remaining: 39s
8:	learn: 3620556.3155370	total: 332ms	remaining: 36.6s
9:	learn: 3550006.7844262	total: 350ms	remaining: 34.6s
10:	learn: 3480605.1146699	total: 367ms	remaining: 33s
11:	learn: 3422933.6757729	total: 399ms	remaining: 32.9s
12:	learn: 3364146.5920282	total: 430ms	remaining: 32.6s
13:	learn: 3307739.5108115	total: 447ms	remaining: 31.5s
14:	learn: 3259291.9885110	total: 466ms	remaining: 30.6s
15:	learn: 3214074.5484106	total: 486ms	remaining: 29.9s
16:	learn: 3171333.9165987	total: 505ms	remaining: 29.2s
17:	learn: 312981

144:	learn: 2292248.2456466	total: 3.65s	remaining: 21.5s
145:	learn: 2290825.2974191	total: 3.68s	remaining: 21.5s
146:	learn: 2289142.2878301	total: 3.71s	remaining: 21.5s
147:	learn: 2287653.7765711	total: 3.73s	remaining: 21.5s
148:	learn: 2284759.4201864	total: 3.75s	remaining: 21.4s
149:	learn: 2283261.0185735	total: 3.77s	remaining: 21.3s
150:	learn: 2281776.0729567	total: 3.78s	remaining: 21.3s
151:	learn: 2280539.3217394	total: 3.8s	remaining: 21.2s
152:	learn: 2279238.9826725	total: 3.83s	remaining: 21.2s
153:	learn: 2277590.3159406	total: 3.86s	remaining: 21.2s
154:	learn: 2274885.8778930	total: 3.89s	remaining: 21.2s
155:	learn: 2273205.1818890	total: 3.9s	remaining: 21.1s
156:	learn: 2271604.6528333	total: 3.92s	remaining: 21.1s
157:	learn: 2269846.3422530	total: 3.95s	remaining: 21.1s
158:	learn: 2266958.9194380	total: 3.97s	remaining: 21s
159:	learn: 2264643.5095815	total: 4s	remaining: 21s
160:	learn: 2263758.6473338	total: 4.03s	remaining: 21s
161:	learn: 2263175.86436

292:	learn: 2072723.1245072	total: 7.49s	remaining: 18.1s
293:	learn: 2071258.1648461	total: 7.52s	remaining: 18.1s
294:	learn: 2069289.1279826	total: 7.55s	remaining: 18.1s
295:	learn: 2068243.3433230	total: 7.58s	remaining: 18s
296:	learn: 2067034.9026208	total: 7.62s	remaining: 18s
297:	learn: 2066539.7544924	total: 7.63s	remaining: 18s
298:	learn: 2065000.1724819	total: 7.67s	remaining: 18s
299:	learn: 2064918.1812354	total: 7.7s	remaining: 18s
300:	learn: 2063718.7972722	total: 7.73s	remaining: 18s
301:	learn: 2062360.6826706	total: 7.75s	remaining: 17.9s
302:	learn: 2061429.6824249	total: 7.78s	remaining: 17.9s
303:	learn: 2060309.4578954	total: 7.81s	remaining: 17.9s
304:	learn: 2058627.4037924	total: 7.83s	remaining: 17.8s
305:	learn: 2057902.7299503	total: 7.85s	remaining: 17.8s
306:	learn: 2056949.2670045	total: 7.87s	remaining: 17.8s
307:	learn: 2054895.0150661	total: 7.9s	remaining: 17.8s
308:	learn: 2053316.9303448	total: 7.92s	remaining: 17.7s
309:	learn: 2053204.9396718	

437:	learn: 1929030.7911999	total: 11.4s	remaining: 14.6s
438:	learn: 1927287.6641887	total: 11.4s	remaining: 14.6s
439:	learn: 1926754.7799100	total: 11.4s	remaining: 14.6s
440:	learn: 1925735.2452979	total: 11.5s	remaining: 14.5s
441:	learn: 1924763.2665417	total: 11.5s	remaining: 14.5s
442:	learn: 1923938.9826850	total: 11.5s	remaining: 14.5s
443:	learn: 1922914.2361380	total: 11.6s	remaining: 14.5s
444:	learn: 1922364.2894948	total: 11.6s	remaining: 14.4s
445:	learn: 1921784.5561185	total: 11.6s	remaining: 14.4s
446:	learn: 1921121.1440079	total: 11.6s	remaining: 14.4s
447:	learn: 1919921.3678289	total: 11.7s	remaining: 14.4s
448:	learn: 1919023.5197206	total: 11.7s	remaining: 14.4s
449:	learn: 1917787.5736843	total: 11.7s	remaining: 14.3s
450:	learn: 1916899.8217652	total: 11.8s	remaining: 14.3s
451:	learn: 1916029.1171829	total: 11.8s	remaining: 14.3s
452:	learn: 1914800.7770813	total: 11.8s	remaining: 14.3s
453:	learn: 1914285.2785110	total: 11.9s	remaining: 14.3s
454:	learn: 19

582:	learn: 1816584.6579712	total: 15.5s	remaining: 11.1s
583:	learn: 1815495.0588181	total: 15.6s	remaining: 11.1s
584:	learn: 1815050.9801679	total: 15.6s	remaining: 11.1s
585:	learn: 1814436.8910231	total: 15.6s	remaining: 11s
586:	learn: 1813378.5737062	total: 15.6s	remaining: 11s
587:	learn: 1812584.1972037	total: 15.7s	remaining: 11s
588:	learn: 1811505.7390228	total: 15.7s	remaining: 10.9s
589:	learn: 1810121.4628135	total: 15.7s	remaining: 10.9s
590:	learn: 1809298.6660544	total: 15.7s	remaining: 10.9s
591:	learn: 1808859.7814855	total: 15.8s	remaining: 10.9s
592:	learn: 1808466.8086436	total: 15.8s	remaining: 10.8s
593:	learn: 1807832.6803160	total: 15.8s	remaining: 10.8s
594:	learn: 1807264.3359670	total: 15.8s	remaining: 10.8s
595:	learn: 1806710.3717090	total: 15.9s	remaining: 10.8s
596:	learn: 1806113.4049289	total: 15.9s	remaining: 10.7s
597:	learn: 1805770.7067773	total: 15.9s	remaining: 10.7s
598:	learn: 1804678.5027859	total: 16s	remaining: 10.7s
599:	learn: 1803718.46

732:	learn: 1710915.4750836	total: 19.6s	remaining: 7.15s
733:	learn: 1710130.8604350	total: 19.6s	remaining: 7.12s
734:	learn: 1709477.9006544	total: 19.7s	remaining: 7.09s
735:	learn: 1709097.2075817	total: 19.7s	remaining: 7.06s
736:	learn: 1708497.1128966	total: 19.7s	remaining: 7.04s
737:	learn: 1708064.6520245	total: 19.8s	remaining: 7.01s
738:	learn: 1707379.9958674	total: 19.8s	remaining: 6.99s
739:	learn: 1706775.5089157	total: 19.8s	remaining: 6.96s
740:	learn: 1706065.9360815	total: 19.8s	remaining: 6.94s
741:	learn: 1705315.1782082	total: 19.9s	remaining: 6.91s
742:	learn: 1704635.6374118	total: 19.9s	remaining: 6.88s
743:	learn: 1703893.0624359	total: 19.9s	remaining: 6.85s
744:	learn: 1703093.6897640	total: 19.9s	remaining: 6.82s
745:	learn: 1702819.5646767	total: 20s	remaining: 6.79s
746:	learn: 1701826.5252262	total: 20s	remaining: 6.77s
747:	learn: 1701648.5033441	total: 20s	remaining: 6.74s
748:	learn: 1700557.8939366	total: 20s	remaining: 6.72s
749:	learn: 1700081.55

875:	learn: 1635994.1444936	total: 23.6s	remaining: 3.34s
876:	learn: 1635366.9100602	total: 23.6s	remaining: 3.31s
877:	learn: 1635068.3144427	total: 23.6s	remaining: 3.29s
878:	learn: 1634500.8993175	total: 23.7s	remaining: 3.26s
879:	learn: 1633940.7631110	total: 23.7s	remaining: 3.23s
880:	learn: 1632898.3107461	total: 23.7s	remaining: 3.21s
881:	learn: 1632322.4802387	total: 23.8s	remaining: 3.18s
882:	learn: 1631328.7415691	total: 23.8s	remaining: 3.15s
883:	learn: 1631069.3813694	total: 23.8s	remaining: 3.13s
884:	learn: 1630168.9616475	total: 23.8s	remaining: 3.1s
885:	learn: 1629647.5147333	total: 23.9s	remaining: 3.07s
886:	learn: 1628903.0543303	total: 23.9s	remaining: 3.05s
887:	learn: 1628810.0724582	total: 23.9s	remaining: 3.02s
888:	learn: 1628416.2341004	total: 24s	remaining: 2.99s
889:	learn: 1627712.6701348	total: 24s	remaining: 2.97s
890:	learn: 1627161.5357127	total: 24s	remaining: 2.94s
891:	learn: 1626747.2146484	total: 24.1s	remaining: 2.91s
892:	learn: 1626024.5

<catboost.core.CatBoostRegressor at 0x2172ca25d00>

In [46]:
print("mse: ", mse(model.predict(X_test), y_test))
print("rmse: ", mse(model.predict(X_test), y_test)**0.5)

mse:  6598316257601.038
rmse:  2568718.796910444


### Ensemble v.1 (3 балла)

Ансамбли иногда оказываются лучше чем одна большая модель.

В колонке `product_type` содержится информация о том, каким является объявление: `Investment` (продажа квартиры как инвестиции) или `OwnerOccupier` (продажа квартиры для жилья). Логично предположить, что если сделать по модели на каждый из этих типов, то качество будет выше.

Обучите свои лучшие модели на отдельно на `Investment` и `OwnerOccupier` (т.е. у вас будет `model_invest`, обученная на `(invest_train_X, invest_train_Y)` и `model_owner`, обученная на `(owner_train_X, owner_train_Y)`) и проверьте качество на отложенной выборке (т.е. на исходном `test_split`).

In [119]:
invest_train_X = X_train[X_train['product_type_Investment'] == 1]
invest_train_y = y_train[invest_train_X.index]

In [98]:
model_invest = CatBoostRegressor()
model_invest.fit(invest_train_X, invest_train_y)

Learning rate set to 0.059131
0:	learn: 4748443.1626395	total: 20.4ms	remaining: 20.4s
1:	learn: 4630394.2957514	total: 36.2ms	remaining: 18.1s
2:	learn: 4524098.9120143	total: 65ms	remaining: 21.6s
3:	learn: 4409826.3796340	total: 80.8ms	remaining: 20.1s
4:	learn: 4303232.3952496	total: 96.7ms	remaining: 19.2s
5:	learn: 4203833.1106526	total: 113ms	remaining: 18.7s
6:	learn: 4114869.2477210	total: 129ms	remaining: 18.3s
7:	learn: 4037248.9288048	total: 145ms	remaining: 18s
8:	learn: 3965871.8208847	total: 161ms	remaining: 17.8s
9:	learn: 3888786.2255179	total: 177ms	remaining: 17.5s
10:	learn: 3824449.2052097	total: 196ms	remaining: 17.6s
11:	learn: 3766474.5821944	total: 212ms	remaining: 17.5s
12:	learn: 3711561.9910250	total: 229ms	remaining: 17.4s
13:	learn: 3660708.9949368	total: 246ms	remaining: 17.3s
14:	learn: 3606588.0002751	total: 262ms	remaining: 17.2s
15:	learn: 3562961.5619271	total: 278ms	remaining: 17.1s
16:	learn: 3527230.9266596	total: 293ms	remaining: 17s
17:	learn: 3

143:	learn: 2598741.8870878	total: 3.06s	remaining: 18.2s
144:	learn: 2596574.9810467	total: 3.09s	remaining: 18.2s
145:	learn: 2593987.3649693	total: 3.12s	remaining: 18.3s
146:	learn: 2592768.5380068	total: 3.14s	remaining: 18.2s
147:	learn: 2591249.3132205	total: 3.16s	remaining: 18.2s
148:	learn: 2589120.8713630	total: 3.17s	remaining: 18.1s
149:	learn: 2586498.5722568	total: 3.19s	remaining: 18.1s
150:	learn: 2584435.3779475	total: 3.21s	remaining: 18.1s
151:	learn: 2582186.1301693	total: 3.23s	remaining: 18s
152:	learn: 2580804.4610903	total: 3.25s	remaining: 18s
153:	learn: 2580002.6501444	total: 3.27s	remaining: 18s
154:	learn: 2577566.6246924	total: 3.29s	remaining: 18s
155:	learn: 2575314.1875850	total: 3.32s	remaining: 18s
156:	learn: 2574281.2431052	total: 3.35s	remaining: 18s
157:	learn: 2573217.4889273	total: 3.37s	remaining: 18s
158:	learn: 2572223.6348919	total: 3.4s	remaining: 18s
159:	learn: 2570770.2557276	total: 3.43s	remaining: 18s
160:	learn: 2569326.2064112	total

292:	learn: 2344243.6838827	total: 6.45s	remaining: 15.6s
293:	learn: 2342309.3073960	total: 6.47s	remaining: 15.6s
294:	learn: 2340874.2627905	total: 6.51s	remaining: 15.5s
295:	learn: 2340786.3105248	total: 6.52s	remaining: 15.5s
296:	learn: 2339631.1629150	total: 6.54s	remaining: 15.5s
297:	learn: 2338028.0831247	total: 6.55s	remaining: 15.4s
298:	learn: 2336604.5209169	total: 6.57s	remaining: 15.4s
299:	learn: 2334681.2251754	total: 6.6s	remaining: 15.4s
300:	learn: 2333583.9535252	total: 6.62s	remaining: 15.4s
301:	learn: 2332349.0163376	total: 6.63s	remaining: 15.3s
302:	learn: 2330975.7025946	total: 6.66s	remaining: 15.3s
303:	learn: 2330888.7278279	total: 6.69s	remaining: 15.3s
304:	learn: 2330792.5276182	total: 6.7s	remaining: 15.3s
305:	learn: 2329245.7339675	total: 6.72s	remaining: 15.2s
306:	learn: 2327000.8039379	total: 6.74s	remaining: 15.2s
307:	learn: 2325549.5085995	total: 6.75s	remaining: 15.2s
308:	learn: 2323888.5017040	total: 6.78s	remaining: 15.2s
309:	learn: 2321

441:	learn: 2167609.9675091	total: 9.65s	remaining: 12.2s
442:	learn: 2166606.7665583	total: 9.68s	remaining: 12.2s
443:	learn: 2165119.8999394	total: 9.7s	remaining: 12.1s
444:	learn: 2164066.1454376	total: 9.72s	remaining: 12.1s
445:	learn: 2163206.9805407	total: 9.73s	remaining: 12.1s
446:	learn: 2162192.2407838	total: 9.76s	remaining: 12.1s
447:	learn: 2161013.0494753	total: 9.79s	remaining: 12.1s
448:	learn: 2159834.5327195	total: 9.82s	remaining: 12.1s
449:	learn: 2159778.6573612	total: 9.85s	remaining: 12s
450:	learn: 2158460.8834857	total: 9.87s	remaining: 12s
451:	learn: 2158408.4054320	total: 9.89s	remaining: 12s
452:	learn: 2156970.1199234	total: 9.91s	remaining: 12s
453:	learn: 2155891.8954172	total: 9.93s	remaining: 11.9s
454:	learn: 2155841.9266531	total: 9.96s	remaining: 11.9s
455:	learn: 2154021.6840993	total: 9.97s	remaining: 11.9s
456:	learn: 2152829.9440201	total: 10s	remaining: 11.9s
457:	learn: 2151421.3527300	total: 10s	remaining: 11.9s
458:	learn: 2150259.3538527

591:	learn: 2025260.4476626	total: 13.1s	remaining: 9.03s
592:	learn: 2024183.8956336	total: 13.1s	remaining: 9.01s
593:	learn: 2022949.7867241	total: 13.2s	remaining: 8.99s
594:	learn: 2022182.8283247	total: 13.2s	remaining: 8.96s
595:	learn: 2020692.1455185	total: 13.2s	remaining: 8.94s
596:	learn: 2019997.3973794	total: 13.2s	remaining: 8.91s
597:	learn: 2018645.1225503	total: 13.2s	remaining: 8.88s
598:	learn: 2017822.0569968	total: 13.2s	remaining: 8.86s
599:	learn: 2017374.5878237	total: 13.2s	remaining: 8.83s
600:	learn: 2016398.1463807	total: 13.3s	remaining: 8.81s
601:	learn: 2015497.1531756	total: 13.3s	remaining: 8.8s
602:	learn: 2014515.7800268	total: 13.3s	remaining: 8.78s
603:	learn: 2012523.3269338	total: 13.4s	remaining: 8.76s
604:	learn: 2011403.8483405	total: 13.4s	remaining: 8.73s
605:	learn: 2010571.2311176	total: 13.4s	remaining: 8.71s
606:	learn: 2009590.8979410	total: 13.4s	remaining: 8.69s
607:	learn: 2008820.9760371	total: 13.4s	remaining: 8.67s
608:	learn: 200

742:	learn: 1900481.5114146	total: 16.5s	remaining: 5.72s
743:	learn: 1899682.2540848	total: 16.6s	remaining: 5.7s
744:	learn: 1897869.2051088	total: 16.6s	remaining: 5.68s
745:	learn: 1897183.0059845	total: 16.6s	remaining: 5.66s
746:	learn: 1896543.3741094	total: 16.6s	remaining: 5.64s
747:	learn: 1896062.8065369	total: 16.7s	remaining: 5.62s
748:	learn: 1895601.8338932	total: 16.7s	remaining: 5.6s
749:	learn: 1894667.9573074	total: 16.7s	remaining: 5.58s
750:	learn: 1893689.9112763	total: 16.8s	remaining: 5.55s
751:	learn: 1893206.7956536	total: 16.8s	remaining: 5.53s
752:	learn: 1892142.3187336	total: 16.8s	remaining: 5.51s
753:	learn: 1891421.4335096	total: 16.8s	remaining: 5.49s
754:	learn: 1890500.7858520	total: 16.9s	remaining: 5.47s
755:	learn: 1889763.8659540	total: 16.9s	remaining: 5.45s
756:	learn: 1889171.0522023	total: 16.9s	remaining: 5.42s
757:	learn: 1888559.6071949	total: 16.9s	remaining: 5.4s
758:	learn: 1888410.4405018	total: 16.9s	remaining: 5.38s
759:	learn: 18878

886:	learn: 1792047.7016460	total: 20s	remaining: 2.54s
887:	learn: 1791593.0203967	total: 20s	remaining: 2.52s
888:	learn: 1791191.8343383	total: 20s	remaining: 2.5s
889:	learn: 1790308.8036235	total: 20.1s	remaining: 2.48s
890:	learn: 1789680.5489209	total: 20.1s	remaining: 2.46s
891:	learn: 1789207.6328224	total: 20.1s	remaining: 2.43s
892:	learn: 1788433.9496719	total: 20.1s	remaining: 2.41s
893:	learn: 1787607.0143465	total: 20.2s	remaining: 2.39s
894:	learn: 1786964.8992425	total: 20.2s	remaining: 2.37s
895:	learn: 1786845.9953566	total: 20.2s	remaining: 2.35s
896:	learn: 1786257.3020753	total: 20.2s	remaining: 2.32s
897:	learn: 1785321.5787626	total: 20.3s	remaining: 2.3s
898:	learn: 1784366.5536853	total: 20.3s	remaining: 2.28s
899:	learn: 1784162.9264244	total: 20.3s	remaining: 2.25s
900:	learn: 1783593.1405568	total: 20.3s	remaining: 2.23s
901:	learn: 1782688.3484422	total: 20.4s	remaining: 2.21s
902:	learn: 1782064.5062122	total: 20.4s	remaining: 2.19s
903:	learn: 1781462.28

<catboost.core.CatBoostRegressor at 0x21724877280>

In [99]:
owner_train_X = X_train[X_train['product_type_OwnerOccupier'] == 1]
owner_train_y = y_train[owner_train_X.index]

In [100]:
model_owner = CatBoostRegressor()
model_owner.fit(owner_train_X, owner_train_y)

Learning rate set to 0.05399
0:	learn: 3825022.8342964	total: 14.1ms	remaining: 14.1s
1:	learn: 3685608.5308892	total: 38.8ms	remaining: 19.4s
2:	learn: 3558463.0031828	total: 63.7ms	remaining: 21.2s
3:	learn: 3440683.9584564	total: 75ms	remaining: 18.7s
4:	learn: 3332852.2574500	total: 101ms	remaining: 20s
5:	learn: 3223395.3668028	total: 113ms	remaining: 18.7s
6:	learn: 3122662.7707338	total: 125ms	remaining: 17.8s
7:	learn: 3030158.7166958	total: 137ms	remaining: 17s
8:	learn: 2934208.6029405	total: 148ms	remaining: 16.3s
9:	learn: 2852856.7341518	total: 160ms	remaining: 15.8s
10:	learn: 2770719.9124372	total: 171ms	remaining: 15.4s
11:	learn: 2692493.3855509	total: 185ms	remaining: 15.3s
12:	learn: 2616786.3019984	total: 212ms	remaining: 16.1s
13:	learn: 2543463.1286698	total: 224ms	remaining: 15.8s
14:	learn: 2479200.9581079	total: 237ms	remaining: 15.5s
15:	learn: 2411832.8609497	total: 249ms	remaining: 15.3s
16:	learn: 2353911.9104892	total: 261ms	remaining: 15.1s
17:	learn: 229

154:	learn: 1022879.9018632	total: 2.19s	remaining: 11.9s
155:	learn: 1020609.7218157	total: 2.21s	remaining: 12s
156:	learn: 1018708.7429158	total: 2.22s	remaining: 11.9s
157:	learn: 1017648.1587431	total: 2.24s	remaining: 11.9s
158:	learn: 1016828.2049491	total: 2.25s	remaining: 11.9s
159:	learn: 1013970.3154844	total: 2.26s	remaining: 11.9s
160:	learn: 1011214.7420427	total: 2.27s	remaining: 11.8s
161:	learn: 1009207.0148997	total: 2.28s	remaining: 11.8s
162:	learn: 1007434.0844179	total: 2.3s	remaining: 11.8s
163:	learn: 1005833.0064164	total: 2.31s	remaining: 11.8s
164:	learn: 1005442.4305471	total: 2.33s	remaining: 11.8s
165:	learn: 1005124.2175280	total: 2.35s	remaining: 11.8s
166:	learn: 1004827.3099883	total: 2.36s	remaining: 11.8s
167:	learn: 1001334.3041854	total: 2.37s	remaining: 11.7s
168:	learn: 999731.6782956	total: 2.38s	remaining: 11.7s
169:	learn: 997453.6270157	total: 2.41s	remaining: 11.8s
170:	learn: 996057.2559299	total: 2.42s	remaining: 11.7s
171:	learn: 995612.9

306:	learn: 828836.8413087	total: 4.37s	remaining: 9.87s
307:	learn: 827926.4918371	total: 4.38s	remaining: 9.85s
308:	learn: 826549.1954181	total: 4.4s	remaining: 9.83s
309:	learn: 825261.2690484	total: 4.41s	remaining: 9.81s
310:	learn: 825122.0580751	total: 4.42s	remaining: 9.79s
311:	learn: 824986.8410771	total: 4.43s	remaining: 9.77s
312:	learn: 823887.7279722	total: 4.44s	remaining: 9.75s
313:	learn: 822931.3385229	total: 4.47s	remaining: 9.76s
314:	learn: 820099.8447840	total: 4.48s	remaining: 9.74s
315:	learn: 818923.3431582	total: 4.49s	remaining: 9.72s
316:	learn: 817543.7001125	total: 4.5s	remaining: 9.7s
317:	learn: 817413.5847988	total: 4.51s	remaining: 9.68s
318:	learn: 816347.2726172	total: 4.53s	remaining: 9.66s
319:	learn: 815197.9899536	total: 4.54s	remaining: 9.64s
320:	learn: 815050.2091000	total: 4.55s	remaining: 9.63s
321:	learn: 814618.8543517	total: 4.57s	remaining: 9.63s
322:	learn: 811771.3746316	total: 4.59s	remaining: 9.61s
323:	learn: 810491.6671491	total: 

458:	learn: 708213.5598262	total: 6.4s	remaining: 7.55s
459:	learn: 706760.9255639	total: 6.41s	remaining: 7.53s
460:	learn: 706691.5509868	total: 6.42s	remaining: 7.51s
461:	learn: 706180.3277118	total: 6.44s	remaining: 7.5s
462:	learn: 705568.1480967	total: 6.45s	remaining: 7.48s
463:	learn: 705007.5635446	total: 6.46s	remaining: 7.46s
464:	learn: 704142.2580167	total: 6.47s	remaining: 7.45s
465:	learn: 703574.0334763	total: 6.48s	remaining: 7.43s
466:	learn: 703181.1392343	total: 6.5s	remaining: 7.41s
467:	learn: 702809.8992626	total: 6.51s	remaining: 7.4s
468:	learn: 702409.0980643	total: 6.53s	remaining: 7.39s
469:	learn: 701671.9797596	total: 6.54s	remaining: 7.38s
470:	learn: 701251.3844421	total: 6.55s	remaining: 7.36s
471:	learn: 700347.5668868	total: 6.57s	remaining: 7.34s
472:	learn: 699675.5757277	total: 6.58s	remaining: 7.33s
473:	learn: 699610.5976061	total: 6.59s	remaining: 7.31s
474:	learn: 698401.2088722	total: 6.6s	remaining: 7.3s
475:	learn: 697779.1622885	total: 6.6

613:	learn: 616502.0360091	total: 8.61s	remaining: 5.41s
614:	learn: 615621.8837192	total: 8.63s	remaining: 5.4s
615:	learn: 615241.5166902	total: 8.64s	remaining: 5.38s
616:	learn: 614906.1816255	total: 8.65s	remaining: 5.37s
617:	learn: 614830.9955454	total: 8.65s	remaining: 5.35s
618:	learn: 614153.4957050	total: 8.66s	remaining: 5.33s
619:	learn: 613682.4808754	total: 8.68s	remaining: 5.32s
620:	learn: 612424.8821323	total: 8.71s	remaining: 5.32s
621:	learn: 611986.0412578	total: 8.73s	remaining: 5.3s
622:	learn: 611145.7412725	total: 8.74s	remaining: 5.29s
623:	learn: 610625.3843702	total: 8.75s	remaining: 5.27s
624:	learn: 610138.1182023	total: 8.77s	remaining: 5.26s
625:	learn: 609651.8182225	total: 8.79s	remaining: 5.25s
626:	learn: 609163.1930870	total: 8.8s	remaining: 5.24s
627:	learn: 608604.5333044	total: 8.82s	remaining: 5.22s
628:	learn: 608569.1405485	total: 8.83s	remaining: 5.21s
629:	learn: 607835.2964867	total: 8.84s	remaining: 5.19s
630:	learn: 607800.7948815	total: 

763:	learn: 554165.0489242	total: 10.6s	remaining: 3.29s
764:	learn: 553800.4163383	total: 10.7s	remaining: 3.27s
765:	learn: 553776.7700786	total: 10.7s	remaining: 3.26s
766:	learn: 553352.7663981	total: 10.7s	remaining: 3.24s
767:	learn: 552925.7193948	total: 10.7s	remaining: 3.23s
768:	learn: 551641.9607913	total: 10.7s	remaining: 3.21s
769:	learn: 550837.6561936	total: 10.7s	remaining: 3.2s
770:	learn: 550215.5827874	total: 10.7s	remaining: 3.19s
771:	learn: 550036.7903946	total: 10.7s	remaining: 3.17s
772:	learn: 549480.4083153	total: 10.8s	remaining: 3.16s
773:	learn: 549303.6311775	total: 10.8s	remaining: 3.14s
774:	learn: 548932.9259941	total: 10.8s	remaining: 3.13s
775:	learn: 548530.9400977	total: 10.8s	remaining: 3.12s
776:	learn: 547753.8611741	total: 10.8s	remaining: 3.11s
777:	learn: 547731.4615052	total: 10.9s	remaining: 3.1s
778:	learn: 547361.9000890	total: 10.9s	remaining: 3.08s
779:	learn: 547171.2010835	total: 10.9s	remaining: 3.07s
780:	learn: 546972.5637833	total:

918:	learn: 498902.6153725	total: 12.6s	remaining: 1.11s
919:	learn: 497920.1586157	total: 12.7s	remaining: 1.1s
920:	learn: 497529.5710675	total: 12.7s	remaining: 1.09s
921:	learn: 497081.6117796	total: 12.7s	remaining: 1.07s
922:	learn: 496670.8565432	total: 12.7s	remaining: 1.06s
923:	learn: 496495.5755661	total: 12.7s	remaining: 1.04s
924:	learn: 496083.4664804	total: 12.7s	remaining: 1.03s
925:	learn: 495571.5351907	total: 12.7s	remaining: 1.02s
926:	learn: 495301.6255097	total: 12.7s	remaining: 1s
927:	learn: 495177.8010550	total: 12.8s	remaining: 990ms
928:	learn: 494783.2792803	total: 12.8s	remaining: 976ms
929:	learn: 494359.5016605	total: 12.8s	remaining: 962ms
930:	learn: 494286.1680441	total: 12.8s	remaining: 948ms
931:	learn: 494157.3015856	total: 12.8s	remaining: 934ms
932:	learn: 493910.0791496	total: 12.8s	remaining: 921ms
933:	learn: 493593.7686098	total: 12.8s	remaining: 907ms
934:	learn: 493549.0341135	total: 12.8s	remaining: 893ms
935:	learn: 493489.8650897	total: 1

<catboost.core.CatBoostRegressor at 0x217248aebe0>

In [101]:
invest_test_X = X_test[X_test['product_type_Investment'] == 1]
invest_test_y = y_test[invest_test_X.index]

owner_test_X = X_test[X_test['product_type_OwnerOccupier'] == 1]
owner_test_y = y_test[owner_test_X.index]

In [105]:
mse = (sum((model_invest.predict(invest_test_X) - invest_test_y)**2) +
        sum((model_owner.predict(owner_test_X) - owner_test_y)**2))/len(y_test)
print("mse: ", mse)
print("rmse: ", mse**0.5)

mse:  6813986527829.94
rmse:  2610361.378780712


### (*) Ensemble v.2 (дополнительно, 2 балла)

Попробуйте сделать для `Investment` более сложную модель: обучите `CatBoostRegressor` и `HuberRegressor` из `sklearn`, а затем сложите их предсказания с весами `w_1` и `w_2` (выберите веса сами; сумма весов равняется 1).