# Machine Learning Zoomcamp Homeworks

## Week 3

In this homework, we will continue the New York City Airbnb Open Data. You can take it from
[Kaggle](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data?select=AB_NYC_2019.csv)
or download from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv)
if you don't want to sign up to Kaggle.

We'll keep working with the `'price'` variable, and we'll transform it to a classification task.

### Features

For the rest of the homework, you'll need to use the features from the previous homework with additional two `'neighbourhood_group'` and `'room_type'`. So the whole feature set will be set as follows:

* `'neighbourhood_group'`,
* `'room_type'`,
* `'latitude'`,
* `'longitude'`,
* `'price'`,
* `'minimum_nights'`,
* `'number_of_reviews'`,
* `'reviews_per_month'`,
* `'calculated_host_listings_count'`,
* `'availability_365'`

Select only them and fill in the missing values with 0.

In [1]:
import numpy as np
import pandas as pd


df = pd.read_csv('data\AB_NYC_2019.csv')

# column selection
desired_columns = ['neighbourhood_group', 'room_type', 'latitude', 'longitude', 'price', 
                   'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365']
df_final = df[desired_columns]

# fill missing values with zero
df_final = df_final.fillna(0)

print(df_final.isnull().sum())

df_final


neighbourhood_group               0
room_type                         0
latitude                          0
longitude                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
reviews_per_month                 0
calculated_host_listings_count    0
availability_365                  0
dtype: int64


Unnamed: 0,neighbourhood_group,room_type,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,Brooklyn,Private room,40.64749,-73.97237,149,1,9,0.21,6,365
1,Manhattan,Entire home/apt,40.75362,-73.98377,225,1,45,0.38,2,355
2,Manhattan,Private room,40.80902,-73.94190,150,3,0,0.00,1,365
3,Brooklyn,Entire home/apt,40.68514,-73.95976,89,1,270,4.64,1,194
4,Manhattan,Entire home/apt,40.79851,-73.94399,80,10,9,0.10,1,0
...,...,...,...,...,...,...,...,...,...,...
48890,Brooklyn,Private room,40.67853,-73.94995,70,2,0,0.00,2,9
48891,Brooklyn,Private room,40.70184,-73.93317,40,4,0,0.00,2,36
48892,Manhattan,Entire home/apt,40.81475,-73.94867,115,10,0,0.00,1,27
48893,Manhattan,Shared room,40.75751,-73.99112,55,1,0,0.00,6,2


### Question #1

What is the most frequent observation (mode) for the column `'neighbourhood_group'`?

In [2]:
# answer to question #1

df_final['neighbourhood_group'].value_counts()


Manhattan        21661
Brooklyn         20104
Queens            5666
Bronx             1091
Staten Island      373
Name: neighbourhood_group, dtype: int64

As we can see, 'Manhattan' is the most frequent value in 'neighbourhood_group' column.

### Split the data

* Split your data in train/val/test sets, with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to 42.
* Make sure that the target value ('price') is not in your dataframe.


In [3]:
from sklearn.model_selection import train_test_split

random_seed = 42
df_train_full, df_test = train_test_split(df_final, test_size=0.2, random_state=random_seed)
df_train, df_valid     = train_test_split(df_train_full, test_size=0.25, random_state=random_seed)

print(len(df_final), len(df_train), len(df_valid), len(df_test))

del df_train['price']
del df_valid['price']
del df_test['price']


48895 29337 9779 9779


### Question #2

* Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your train dataset.
   * In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
* What are the two features that have the biggest correlation in this dataset?


In [4]:
# answer to question #2

columns_numerical = df_train.select_dtypes(include=['int64', 'float64']).columns.to_list()

corr_matrix = []
for column in columns_numerical:
    corr_matrix.append(df_train[columns_numerical].corrwith(df_train[column]).to_numpy())

df_corr_matrix = pd.DataFrame(corr_matrix)
print('Correlation Matrix:\n', df_corr_matrix, '\n')


corr_dict = dict()
for i in range(len(corr_matrix)):
    for j in range(i+1, len(corr_matrix)):
        corr_dict[corr_matrix[i][j]] = (i,j)

corr_positive_dict = dict((k, v) for k, v in corr_dict.items() if k >= 0.0)
corr_negative_dict = dict((k, v) for k, v in corr_dict.items() if k < 0.0)


highest_positive_corr = max(corr_positive_dict.keys())
highest_negative_corr = min(min(corr_negative_dict.keys()), 0)
highest_corr = max(highest_positive_corr, abs(highest_negative_corr))

if highest_corr in corr_positive_dict.keys():
    highest_corr_index = corr_positive_dict[highest_corr]
else:
    highest_corr_index = corr_negative_dict[-1*(highest_corr)]

print('Highest Correlated Columns: ',
      (columns_numerical[highest_corr_index[0]], columns_numerical[highest_corr_index[1]]),
      ': ', highest_corr)

Correlation Matrix:
           0         1         2         3         4         5         6
0  1.000000  0.080301  0.027441 -0.006246 -0.007159  0.019375 -0.005891
1  0.080301  1.000000 -0.060660  0.055084  0.134642 -0.117041  0.083666
2  0.027441 -0.060660  1.000000 -0.076020 -0.120703  0.118647  0.138901
3 -0.006246  0.055084 -0.076020  1.000000  0.590374 -0.073167  0.174477
4 -0.007159  0.134642 -0.120703  0.590374  1.000000 -0.048767  0.165376
5  0.019375 -0.117041  0.118647 -0.073167 -0.048767  1.000000  0.225913
6 -0.005891  0.083666  0.138901  0.174477  0.165376  0.225913  1.000000 

Highest Correlated Columns:  ('number_of_reviews', 'reviews_per_month') :  0.5903739015971651


### Make price binary

* We need to turn the price variable from numeric into binary.
* Let's create a variable `above_average` which is `1` if the price is above (or equal to) `152`.


In [5]:
df_final['above_average'] = pd.Series(df_final['price'] >= 152).astype(int)


random_seed = 42
df_train_full, df_test = train_test_split(df_final, test_size=0.2, random_state=random_seed)
df_train, df_valid     = train_test_split(df_train_full, test_size=0.25, random_state=random_seed)

del df_train['price']
del df_valid['price']
del df_test['price']

### Question #3

* Calculate the mutual information score with the (binarized) price for the two categorical variables that we have. Use the training set only.
* Which of these two variables has bigger score?
* Round it to 2 decimal digits using `round(score, 2)`


In [6]:
# answer to question #3

from sklearn.metrics import mutual_info_score
from IPython.display import display


columns_categorical = df_train.select_dtypes(include=['object', 'bool']).columns.to_list()

def calculate_mi(series):
    return mutual_info_score(series, df_train['above_average'])

df_train_mi = df_train[columns_categorical].apply(calculate_mi)
df_train_mi = df_train_mi.sort_values(ascending=False).to_frame(name='Mutual Information')


display(df_train_mi)


Unnamed: 0,Mutual Information
room_type,0.143226
neighbourhood_group,0.046506


In [7]:
print(round(df_train_mi['Mutual Information']['room_type'], 2))


0.14


'room_type' column has bigger mutual information score.

### Question #4

* Now let's train a logistic regression
* Remember that we have two categorical variables in the data. Include them using one-hot encoding.
* Fit the model on the training dataset.
   * To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
   * `model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)`
* Calculate the accuracy on the validation dataset and rount it to 2 decimal digits.

In [8]:
# answer to question #4

from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from warnings import simplefilter
# ignore all warnings
simplefilter(action='ignore')


y_train = df_train['above_average'].to_numpy()
y_valid = df_valid['above_average'].to_numpy()
y_test  = df_test['above_average'].to_numpy()

del df_train['above_average']
del df_valid['above_average']
del df_test['above_average']


train_dict = df_train[columns_categorical + columns_numerical].to_dict(orient='records')
dv = DictVectorizer(sparse=False)
dv.fit(train_dict)

X_train = dv.transform(train_dict)
model = LogisticRegression(solver='lbfgs', C=1.0, random_state=random_seed)
model.fit(X_train, y_train)

val_dict = df_valid[columns_categorical + columns_numerical].to_dict(orient='records')
X_valid = dv.transform(val_dict)

y_pred = model.predict_proba(X_valid)[:, 1]
target_pred = y_pred > 0.5

pred_score = (y_valid == target_pred).mean()
print('<<Prediction Score>>\n > Actual: {}\n > Rounded: {}\n'.format(
    pred_score, round(pred_score, 2)))


<<Prediction Score>>
 > Actual: 0.7864812353001329
 > Rounded: 0.79



### Question #5

* We have 9 features: 7 numerical features and 2 categorical.
* Let's find the least useful one using the *feature elimination* technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 
* Which of following feature has the smallest difference? 
   * `neighbourhood_group`
   * `room_type` 
   * `number_of_reviews`
   * `reviews_per_month`

> **note**: the difference doesn't have to be positive


In [9]:
# answer to question #5

def get_elimination_data(drop_columns: list, dataset: str):
    if   dataset == 'train': return df_train.drop(drop_columns, axis=1)
    elif dataset == 'valid': return df_valid.drop(drop_columns, axis=1)
    elif dataset == 'test':  return df_test.drop(drop_columns, axis=1)
    else: return


feature_elimination_difference_dict = {}
longest_column_name_length = max([len(column) for column in df_train.columns.to_list()])
print('<<Score Status After Feature Elimination>>\nDropped Column'+' '*(longest_column_name_length+2-len('Dropped Column'))+'Score\tDiff.')
for column in df_train.columns.to_list():
    train_data = get_elimination_data(drop_columns=[column], dataset='train')
    train_dict = train_data.to_dict(orient='records')

    dv = DictVectorizer(sparse=False)
    dv.fit(train_dict)

    X_train = dv.transform(train_dict)
    model = LogisticRegression(solver='lbfgs', C=1.0, random_state=random_seed,)
    model.fit(X_train, y_train)

    valid_data = get_elimination_data(drop_columns=[column], dataset='valid')
    val_dict = valid_data.to_dict(orient='records')
    X_valid = dv.transform(val_dict)

    y_pred = model.predict_proba(X_valid)[:, 1]
    target_pred = y_pred > 0.5
    score_after_elimination = (y_valid == target_pred).mean()

    score_difference = (pred_score - score_after_elimination)
    feature_elimination_difference_dict[column] = score_difference
    
    print('{} {} {:.3f} \t {:+.2g}'.format(column, ' '*(longest_column_name_length-len(column)), score_after_elimination, score_difference))

<<Score Status After Feature Elimination>>
Dropped Column                  Score	Diff.
neighbourhood_group             0.751 	 +0.035
room_type                       0.715 	 +0.071
latitude                        0.786 	 +0.0001
longitude                       0.787 	 -0.00031
minimum_nights                  0.786 	 +0.00082
number_of_reviews               0.787 	 -0.00051
reviews_per_month               0.786 	 +0.00061
calculated_host_listings_count  0.787 	 -0.0002
availability_365                0.782 	 +0.0049


In [10]:
print('Column Elimination Effect on Classification Score (Least to Most - Global)')
print(pd.Series(feature_elimination_difference_dict).abs().sort_values(ascending=True), '\n')

only_interested_in = ['neighbourhood_group', 'room_type', 'number_of_reviews', 'reviews_per_month']
print('Column Elimination Effect on Classification Score (Least to Most - Only Question Columns)')
print(pd.Series(feature_elimination_difference_dict)[only_interested_in].abs().sort_values(ascending=True))


Column Elimination Effect on Classification Score (Least to Most - Global)
latitude                          0.000102
calculated_host_listings_count    0.000205
longitude                         0.000307
number_of_reviews                 0.000511
reviews_per_month                 0.000614
minimum_nights                    0.000818
availability_365                  0.004908
neighbourhood_group               0.035484
room_type                         0.071377
dtype: float64 

Column Elimination Effect on Classification Score (Least to Most - Only Question Columns)
number_of_reviews      0.000511
reviews_per_month      0.000614
neighbourhood_group    0.035484
room_type              0.071377
dtype: float64


It is obvious that among the questioned features, eliminating 'number_of_reviews' column almost would not effect the classification score.

### Question #6

* For this question, we'll see how to use a linear regression model from Scikit-Learn
* We'll need to use the original column `'price'`. Apply the logarithmic transformation to this column.
* Fit the Ridge regression model on the training data.
* This model has a parameter `alpha`. Let's try the following values: `[0, 0.01, 0.1, 1, 10]`
* Which of these alphas leads to the best RMSE on the validation set? Round your RMSE scores to 3 decimal digits.

If there are multiple options, select the smallest `alpha`.

In [11]:
# answer to question #6

import math
from sklearn.linear_model import Ridge


# define regression evaluation metrics
def mse(y, y_pred):
    error = 0.0
    for yt, yp in zip(y, y_pred):
        error += (yt - yp) ** 2
    return (error / len(y))

def rmse(y, y_pred):
    return math.sqrt(mse(y, y_pred))


train_dict = df_train.to_dict(orient='records')
dv = DictVectorizer(sparse=False)
dv.fit(train_dict)
X_train  = dv.transform(train_dict)
val_dict = df_valid.to_dict(orient='records')
X_valid  = dv.transform(val_dict)

# we use indices of train and valid sets to select corresponding rows from df_final (which holds whole data)
y_train_reg = np.log1p(df_final['price'].loc[df_train.index])
y_valid_reg = np.log1p(df_final['price'].loc[df_valid.index])


print('<<RMSE Scores for Ridge Regression Model>>\nalpha\t\tScore')
rmse_scores_dict = {}
for alpha in [0, 0.01, 0.1, 1, 10]:
    model = Ridge(alpha=alpha)
    model.fit(X_train, y_train_reg)
    y_pred = model.predict(X_valid)
    rmse_score = rmse(y_valid_reg, y_pred)
    rmse_scores_dict[alpha] = rmse_score
    print('{} \t\t {:.3f}'.format(alpha, rmse_score))


<<RMSE Scores for Ridge Regression Model>>
alpha		Score
0 		 0.497
0.01 		 0.497
0.1 		 0.497
1 		 0.497
10 		 0.498


In [12]:
print('RMSE Scores Sorted Lowest to Highest (lower values are better)\nalpha\tScore')
print(pd.Series(rmse_scores_dict).abs().sort_values(ascending=True), '\n')

RMSE Scores Sorted Lowest to Highest (lower values are better)
alpha	Score
0.00     0.497074
0.01     0.497117
0.10     0.497118
1.00     0.497140
10.00    0.497887
dtype: float64 



Values are close but it looks like alpha=0.0 leads to a slightly lower rmse (better score).