Homework 
---

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import os

# Dataset
In this homework, we will continue the New York City Airbnb Open Data. 

We'll keep working with the 'price' variable, and we'll transform it to a classification task.

In [2]:
if not os.path.isfile('./AB_NYC_2019.csv'):
    !wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv

In [3]:
df_prev = pd.read_csv('AB_NYC_2019.csv')

# Features
For the rest of the homework, you'll need to use the features from the previous homework with additional two 'neighbourhood_group' and 'room_type'. So the whole feature set will be set as follows:

* 'neighbourhood_group',
* 'room_type',
* 'latitude',
* 'longitude',
* 'price',
* 'minimum_nights',
* 'number_of_reviews',
* 'reviews_per_month',
* 'calculated_host_listings_count',
* 'availability_365'

Select only them and fill in the missing values with 0.

In [4]:
features = ['neighbourhood_group',
'room_type',
'latitude',
'longitude',
'price',
'minimum_nights',
'number_of_reviews',
'reviews_per_month',
'calculated_host_listings_count',
'availability_365']

In [5]:
df = df_prev[features]
features.pop(4)
df.dtypes

neighbourhood_group                object
room_type                          object
latitude                          float64
longitude                         float64
price                               int64
minimum_nights                      int64
number_of_reviews                   int64
reviews_per_month                 float64
calculated_host_listings_count      int64
availability_365                    int64
dtype: object

In [6]:
df.isnull().sum()

neighbourhood_group                   0
room_type                             0
latitude                              0
longitude                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

In [7]:
df = df.fillna(0)
df.isnull().sum()

neighbourhood_group               0
room_type                         0
latitude                          0
longitude                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
reviews_per_month                 0
calculated_host_listings_count    0
availability_365                  0
dtype: int64

---
# Question 1
What is the most frequent observation (mode) for the column 'neighbourhood_group'?

In [8]:
for n in df.neighbourhood_group.unique():
    print(df.neighbourhood_group[df.neighbourhood_group == n].value_counts())

Brooklyn    20104
Name: neighbourhood_group, dtype: int64
Manhattan    21661
Name: neighbourhood_group, dtype: int64
Queens    5666
Name: neighbourhood_group, dtype: int64
Staten Island    373
Name: neighbourhood_group, dtype: int64
Bronx    1091
Name: neighbourhood_group, dtype: int64


## Answer
The most frequent observation for the column 'neighbourhood_group' is **Manhattan**

---
## Split the data
* Split your data in train/val/test sets, with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.
* Make sure that the target value ('price') is not in your dataframe.

In [9]:
from sklearn.model_selection import train_test_split

df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train.price.values
y_val = df_val.price.values
y_test = df_test.price.values

del df_train['price']
del df_val['price']
del df_test['price']

# Question 2
* Create the correlation matrix for the numerical features of your train dataset.
    * In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
* What are the two features that have the biggest correlation in this dataset?

In [10]:
df.corr()

Unnamed: 0,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
latitude,1.0,0.084788,0.033939,0.024869,-0.015389,-0.018758,0.019517,-0.010983
longitude,0.084788,1.0,-0.150019,-0.062747,0.059094,0.138516,-0.114713,0.082731
price,0.033939,-0.150019,1.0,0.042799,-0.047954,-0.050564,0.057472,0.081829
minimum_nights,0.024869,-0.062747,0.042799,1.0,-0.080116,-0.124905,0.12796,0.144303
number_of_reviews,-0.015389,0.059094,-0.047954,-0.080116,1.0,0.589407,-0.072376,0.172028
reviews_per_month,-0.018758,0.138516,-0.050564,-0.124905,0.589407,1.0,-0.047312,0.163732
calculated_host_listings_count,0.019517,-0.114713,0.057472,0.12796,-0.072376,-0.047312,1.0,0.225701
availability_365,-0.010983,0.082731,0.081829,0.144303,0.172028,0.163732,0.225701,1.0


## Answer 
The two features that have the biggest correlation in this dataset are **number_of_reviews and reviews_per_month**

---
## Make price binary
* We need to turn the price variable from numeric into binary.
* Let's create a variable above_average which is 1 if the price is above (or equal to) 152.

In [11]:
df_full_train = df_full_train.reset_index(drop=True)
df_full_train['above_average'] = df_full_train.price >= 152
df_full_train

Unnamed: 0,neighbourhood_group,room_type,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,above_average
0,Brooklyn,Entire home/apt,40.71577,-73.95530,295,3,11,0.87,1,1,True
1,Manhattan,Private room,40.84917,-73.94048,70,2,2,0.16,1,0,False
2,Brooklyn,Private room,40.68993,-73.95947,58,2,0,0.00,2,0,False
3,Brooklyn,Entire home/apt,40.68427,-73.93118,75,3,87,4.91,1,267,False
4,Queens,Private room,40.74705,-73.89564,38,5,13,0.25,1,0,False
...,...,...,...,...,...,...,...,...,...,...,...
39111,Manhattan,Shared room,40.84650,-73.94319,60,1,0,0.00,1,0,False
39112,Manhattan,Private room,40.73957,-74.00082,85,2,4,1.90,1,76,False
39113,Manhattan,Entire home/apt,40.78318,-73.97372,130,30,1,0.34,5,261,False
39114,Manhattan,Entire home/apt,40.77508,-73.97990,150,2,11,0.13,1,2,False


# Question 3
* Calculate the mutual information score with the (binarized) price for the two categorical variables that we have. Use the training set only.
* Which of these two variables has bigger score?
* Round it to 2 decimal digits using round(score, 2)

In [12]:
from sklearn.metrics import mutual_info_score


def mutual_info_price_score(series):
    return mutual_info_score(series, df_full_train.above_average)

m_i = df_full_train[['neighbourhood_group','room_type']].apply(mutual_info_price_score)
m_i.sort_values(ascending=False).round(2)

room_type              0.14
neighbourhood_group    0.05
dtype: float64

## Answer
The variable **room_type** has the bigger score

---
# Question 4
* Now let's train a logistic regression
* Remember that we have two categorical variables in the data. Include them using one-hot encoding.
* Fit the model on the training dataset.
    * To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    * model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)
* Calculate the accuracy on the validation dataset and rount it to 2 decimal digits.

In [13]:
y_train_binarized = (y_train >= 152).astype(int)
y_val_binarized = (y_val >= 152).astype(int)
y_test_binarized = (y_test >= 152).astype(int)

In [14]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression

dv = DictVectorizer(sparse=False)

dicts_train = df_train[features].to_dict(orient='records')
X_train = dv.fit_transform(dicts_train)

val_dict = df_val[features].to_dict(orient='records')
X_val = dv.transform(val_dict)



In [15]:
model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42, max_iter=1000, n_jobs=4)
model.fit(X_train, y_train_binarized)

LogisticRegression(max_iter=1000, n_jobs=4, random_state=42)

In [16]:
model.score(X_val,y_val_binarized).round(2)

0.79

# Answer
The accuracy on the validation dataset is **0.79**

---
# Question 5

* We have 9 features: 7 numerical features and 2 categorical.
* Let's find the least useful one using the *feature elimination* technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 
* Which of following feature has the smallest difference? 
   * `neighbourhood_group`
   * `room_type` 
   * `number_of_reviews`
   * `reviews_per_month`

> **note**: the difference doesn't have to be positive

In [17]:
dict(zip(dv.get_feature_names(), model.coef_[0].round(4)))

{'availability_365': 0.003,
 'calculated_host_listings_count': 0.0036,
 'latitude': -5.6447,
 'longitude': -3.0715,
 'minimum_nights': -0.0113,
 'neighbourhood_group=Bronx': -0.102,
 'neighbourhood_group=Brooklyn': 0.114,
 'neighbourhood_group=Manhattan': 1.5649,
 'neighbourhood_group=Queens': -0.0299,
 'neighbourhood_group=Staten Island': -1.6014,
 'number_of_reviews': -0.0032,
 'reviews_per_month': -0.0414,
 'room_type=Entire home/apt': 1.9113,
 'room_type=Private room': -0.8633,
 'room_type=Shared room': -1.1025}

In [18]:
m_i = df_full_train[features].apply(mutual_info_price_score)
m_i.sort_values(ascending=False).round(3)



latitude                          0.295
longitude                         0.249
room_type                         0.142
neighbourhood_group               0.046
calculated_host_listings_count    0.037
reviews_per_month                 0.016
minimum_nights                    0.013
availability_365                  0.012
number_of_reviews                 0.010
dtype: float64

In [19]:
def train_model(features):
    from sklearn.feature_extraction import DictVectorizer
    from sklearn.linear_model import LogisticRegression

    dv = DictVectorizer(sparse=False)

    dicts_train = df_train[features].to_dict(orient='records')
    X_train = dv.fit_transform(dicts_train)

    val_dict = df_val[features].to_dict(orient='records')
    X_val = dv.transform(val_dict)

    model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42, max_iter=1000, n_jobs=4)
    model.fit(X_train, y_train_binarized)

    return model.score(X_val,y_val_binarized)

In [20]:
main_score = train_model(features)

In [21]:
scores = []
diffs = []
for n in range(len(features)):
    f = features.copy()
    f.pop(n)
    score = train_model(f)
    diff = round(main_score-score,4)
    scores.append(score)
    diffs.append(diff)

In [22]:
dict(zip(features,diffs))

{'neighbourhood_group': 0.0399,
 'room_type': 0.0624,
 'latitude': 0.0041,
 'longitude': 0.0038,
 'minimum_nights': -0.0008,
 'number_of_reviews': -0.0004,
 'reviews_per_month': 0.0001,
 'calculated_host_listings_count': 0.0016,
 'availability_365': 0.0096}

## Answer
The feature with the smallest difference from:
    
* neighbourhood_group
* room_type
* number_of_reviews
* reviews_per_month

is **reviews_per_month**

---
# Question 6

* For this question, we'll see how to use a linear regression model from Scikit-Learn
* We'll need to use the original column `'price'`. Apply the logarithmic transformation to this column.
* Fit the Ridge regression model on the training data.
* This model has a parameter `alpha`. Let's try the following values: `[0, 0.01, 0.1, 1, 10]`
* Which of these alphas leads to the best RMSE on the validation set? Round your RMSE scores to 3 decimal digits.

If there are multiple options, select the smallest `alpha`.

In [23]:
y_train_log = np.log1p(y_train)
y_val_log = np.log1p(y_val)
y_test_log = np.log1p(y_test)

In [24]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
RMSEprev=1
for a in [0, 0.01, 0.1, 1, 10]:
    clf = Ridge(alpha=a)
    clf.fit(X_train,y_train_log)
    y_pred=clf.predict(X_val)
    RMSE = mean_squared_error(y_val_log, y_pred, squared=False)
    if RMSE < RMSEprev:
        RMSEprev=RMSE
        best_alpha=a
    
    print(f'alpha {a} RMSE {RMSE.round(3)}') # To 0 is the best
print()
print(f'The best alpha is {best_alpha}')
#     print(f'alpha {a}, R2 {clf.score(X_val,y_val_log)}') # To 1 is the best

alpha 0 RMSE 0.497
alpha 0.01 RMSE 0.497
alpha 0.1 RMSE 0.497
alpha 1 RMSE 0.497
alpha 10 RMSE 0.498

The best alpha is 0.01


## Answer
Rounded to 3 digits, the best and smallest alpha is **0**