## 6.10 Homework

The goal of this homework is to create a tree-based regression model for prediction apartment prices (column `'price'`).

In this homework we'll again use the New York City Airbnb Open Data dataset - the same one we used in homework 2 and 3.

You can take it from [Kaggle](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data?select=AB_NYC_2019.csv)
or download from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv)
if you don't want to sign up to Kaggle.

Let's load the data:

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
from sklearn.model_selection import train_test_split

In [3]:
columns = [
    'neighbourhood_group', 'room_type', 'latitude', 'longitude',
    'minimum_nights', 'number_of_reviews','reviews_per_month',
    'calculated_host_listings_count', 'availability_365',
    'price'
]

df = pd.read_csv('AB_NYC_2019.csv', usecols=columns)
df.reviews_per_month = df.reviews_per_month.fillna(0)

* Apply the log tranform to `price`
* Do train/validation/test split with 60%/20%/20% distribution. 
* Use the `train_test_split` function and set the `random_state` parameter to 1

In [4]:
df = df.fillna(0)
df.price = np.log1p(df.price)

In [5]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size = 0.25, random_state=1)

In [6]:
df_train = df_train.reset_index(drop=True)
df_full_train = df_full_train.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)

In [7]:
y_train = df_train.price
y_val = df_val.price
y_test = df_test.price
y_full_train = df_full_train.price

In [8]:
# drop column status
del df_train['price']
del df_val['price']
del df_test['price']
del df_full_train['price']

Now, use `DictVectorizer` to turn train and validation into matrices:

In [9]:
from sklearn.feature_extraction import DictVectorizer

In [10]:
df_train.dtypes

neighbourhood_group                object
latitude                          float64
longitude                         float64
room_type                          object
minimum_nights                      int64
number_of_reviews                   int64
reviews_per_month                 float64
calculated_host_listings_count      int64
availability_365                    int64
dtype: object

In [11]:
train_dict = df_train.to_dict(orient='records')
dv = DictVectorizer()
X_train = dv.fit_transform(train_dict)
val_dict = df_val.to_dict(orient='records')
X_val = dv.transform(val_dict)

## Question 1

Let's train a decision tree regressor to predict the price variable. 

* Train a model with `max_depth=1`

In [12]:
from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor(max_depth=1)
dtr.fit(X_train,y_train)

DecisionTreeRegressor(max_depth=1)

In [13]:
dtr.feature_importances_

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.])

In [14]:
from sklearn.tree import export_text
print(export_text(dtr, feature_names=dv.get_feature_names()))

|--- room_type=Entire home/apt <= 0.50
|   |--- value: [4.29]
|--- room_type=Entire home/apt >  0.50
|   |--- value: [5.15]



Which feature is used for splitting the data?

* `room_type`
* `neighbourhood_group`
* `number_of_reviews`
* `reviews_per_month`

## Question 2

Train a random forest model with these parameters:

* `n_estimators=10`
* `random_state=1`
* `n_jobs=-1`  (optional - to make training faster)

In [15]:
from sklearn.ensemble import RandomForestRegressor
rfg = RandomForestRegressor(n_estimators=10, random_state=1, n_jobs=-1)
rfg.fit(X_train, y_train)

RandomForestRegressor(n_estimators=10, n_jobs=-1, random_state=1)

In [17]:
from sklearn.metrics import mean_squared_error

In [18]:
y_pred = rfg.predict(X_val)

In [21]:
mean_squared_error(y_val, y_pred, squared=False)

0.4615925727520376

What's the RMSE of this model on validation?

* 0.059
* 0.259
* 0.459
* 0.659

## Question 3

Now let's experiment with the `n_estimators` parameter

* Try different values of this parameter from 10 to 200 with step 10
* Set `random_state` to `1`
* Evaluate the model on the validation dataset

In [23]:
for n in range(10, 201, 10):
    rfg = RandomForestRegressor(n_estimators=n, random_state=1, n_jobs=-1)
    rfg.fit(X_train, y_train)
    y_pred = rfg.predict(X_val)
    print(n, mean_squared_error(y_val, y_pred, squared=False))

10 0.4615925727520376
20 0.44819337713865326
30 0.44520237568262205
40 0.443071911922709
50 0.442006832295465
60 0.4413676274724525
70 0.4408812396704666
80 0.4408952796705091
90 0.44037181892894844
100 0.43995205056182096
110 0.4394770432088332
120 0.4392844234239186
130 0.43933429713639516


KeyboardInterrupt: 

After which value of `n_estimators` does RMSE stop improving?

- 10
- 50
- 70
- 120

## Question 4

Let's select the best `max_depth`:

* Try different values of `max_depth`: `[10, 15, 20, 25]`
* For each of these values, try different values of `n_estimators` from 10 till 200 (with step 10)
* Fix the random seed: `random_state=1`

In [28]:
for max_depth in [None, 10, 15, 20, 25]:
    for n in range(10, 201, 10):
        rfg = RandomForestRegressor(max_depth = max_depth, n_estimators=n, random_state=1, n_jobs=-1)
        rfg.fit(X_train, y_train)
        y_pred = rfg.predict(X_val)
        print(n, max_depth, mean_squared_error(y_val, y_pred, squared=False))

10 None 0.4615925727520376
20 None 0.44819337713865326
30 None 0.44520237568262205
40 None 0.443071911922709
50 None 0.442006832295465
10 10 0.44557361547005103
20 10 0.44192682567439234
30 10 0.4412059459106206
40 10 0.4411708790569971
50 10 0.4408040894361395
10 15 0.4498834880023299
20 15 0.4412309381740603
30 15 0.43971717088484674
40 15 0.4389428281920364
50 15 0.4381403171121294
10 20 0.457952466412639
20 20 0.4456414818640784
30 20 0.4432739694602562


KeyboardInterrupt: 

What's the best `max_depth`:

* 10
* 15
* 20
* 25

Bonus question (not graded):

Will the answer be different if we change the seed for the model? <b>Sure.</b>

## Question 5

We can extract feature importance information from tree-based models. 

At each step of the decision tree learning algorith, it finds the best split. 
When doint it, we can calculate "gain" - the reduction in impurity before and after the split. 
This gain is quite useful in understanding what are the imporatant features 
for tree-based models.

In Scikit-Learn, tree-based models contain this information in the `feature_importances_` field. 

For this homework question, we'll find the most important feature:

* Train the model with these parametes:
    * `n_estimators=10`,
    * `max_depth=20`,
    * `random_state=1`,
    * `n_jobs=-1` (optional)
* Get the feature importance information from this model

In [29]:
rfg = RandomForestRegressor(max_depth = 20, n_estimators=10, random_state=1, n_jobs=-1)
rfg.fit(X_train, y_train)

RandomForestRegressor(max_depth=20, n_estimators=10, n_jobs=-1, random_state=1)

In [34]:
for name, score in zip(dv.feature_names_,rfg.feature_importances_):
    print(name, ':', score.round(3))

availability_365 : 0.076
calculated_host_listings_count : 0.03
latitude : 0.151
longitude : 0.154
minimum_nights : 0.054
neighbourhood_group=Bronx : 0.0
neighbourhood_group=Brooklyn : 0.001
neighbourhood_group=Manhattan : 0.034
neighbourhood_group=Queens : 0.001
neighbourhood_group=Staten Island : 0.0
number_of_reviews : 0.043
reviews_per_month : 0.053
room_type=Entire home/apt : 0.392
room_type=Private room : 0.004
room_type=Shared room : 0.005


What's the most important feature? 

* `neighbourhood_group=Manhattan`
* `room_type=Entire home/apt`	
* `longitude`
* `latitude`

## Question 6

Now let's train an XGBoost model! For this question, we'll tune the `eta` parameter

* Install XGBoost
* Create DMatrix for train and validation
* Create a watchlist
* Train a model with these parameters for 100 rounds:

```
xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}
```

In [35]:
import xgboost as xgb
features = dv.get_feature_names()
dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=features)
dval = xgb.DMatrix(X_val, label=y_val, feature_names=features)

In [54]:
xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,

    'objective': 'reg:squarederror',
    'nthread': 8,

    'seed': 1,
    'verbosity': 1,
}

In [55]:
model = xgb.train(xgb_params,
                  dtrain, num_boost_round=100)

In [56]:
y_pred = model.predict(dval)
mean_squared_error(y_val, y_pred, squared=False)

0.43621034591295677

Now change `eta` first to `0.1` and then to `0.01`

Which eta leads to the best RMSE score on the validation dataset?

* 0.3
* 0.1
* 0.01

## Submit the results


Submit your results here: https://forms.gle/wQgFkYE6CtdDed4w8

It's possible that your answers won't match exactly. If it's the case, select the closest one.


## Deadline


The deadline for submitting is 20 October 2021, 17:00 CET (Wednesday). After that, the form will be closed.

