## This Week 3 Questions
Notebook is a part of FREE ML course by Glexey Grigorev. [List Of The Questions](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/03-classification/homework.md)

## Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mutual_info_score, mean_squared_error
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression, Ridge
import warnings

## Loading and Reading Data 

In [2]:
data = pd.read_csv('../input/new-york-city-airbnb-open-data/AB_NYC_2019.csv')
data.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


## Features used for this Project

In [3]:
features  = [
    'neighbourhood_group',
    'room_type',
    'latitude',
    'longitude',
    'price',
    'minimum_nights',
    'number_of_reviews',
    'reviews_per_month',
    'calculated_host_listings_count',
    'availability_365'
]
df = data[features]
"""Checking description of this project's data set"""
df.describe()

Unnamed: 0,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
count,48895.0,48895.0,48895.0,48895.0,48895.0,38843.0,48895.0,48895.0
mean,40.728949,-73.95217,152.720687,7.029962,23.274466,1.373221,7.143982,112.781327
std,0.05453,0.046157,240.15417,20.51055,44.550582,1.680442,32.952519,131.622289
min,40.49979,-74.24442,0.0,1.0,0.0,0.01,1.0,0.0
25%,40.6901,-73.98307,69.0,1.0,1.0,0.19,1.0,0.0
50%,40.72307,-73.95568,106.0,3.0,5.0,0.72,1.0,45.0
75%,40.763115,-73.936275,175.0,5.0,24.0,2.02,2.0,227.0
max,40.91306,-73.71299,10000.0,1250.0,629.0,58.5,327.0,365.0


## Missing Values? Impute them with 0

In [4]:
from IPython.display import display
missing_vals = df.isnull().sum()
print("Before Imputing Missing Values")
display(missing_vals.to_frame().reset_index().rename({'index': 'Variables', 0: 'Missing Values'}, axis = 1).sort_values(by = 'Missing Values', ascending = False).style.background_gradient('Blues'))


df.fillna(0, inplace = True)
print("After Imputing Missing Values")
display(df.isnull().sum().to_frame().reset_index().rename({'index': 'Variables', 0: 'Missing Values'}, axis = 1).style.background_gradient('Blues'))

Before Imputing Missing Values


Unnamed: 0,Variables,Missing Values
7,reviews_per_month,10052
0,neighbourhood_group,0
1,room_type,0
2,latitude,0
3,longitude,0
4,price,0
5,minimum_nights,0
6,number_of_reviews,0
8,calculated_host_listings_count,0
9,availability_365,0


After Imputing Missing Values


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  downcast=downcast,


Unnamed: 0,Variables,Missing Values
0,neighbourhood_group,0
1,room_type,0
2,latitude,0
3,longitude,0
4,price,0
5,minimum_nights,0
6,number_of_reviews,0
7,reviews_per_month,0
8,calculated_host_listings_count,0
9,availability_365,0


## Q1. What is the most frequent observation (mode) for the column 'neighbourhood_group'?

In [5]:
print("Mode for variable 'neighbourhood_group': %s" %(df['neighbourhood_group'].value_counts().head(1)))

Mode for variable 'neighbourhood_group': Manhattan    21661
Name: neighbourhood_group, dtype: int64


## Split the data

    * Split your data in train/val/test sets, with 60%/20%/20% distribution.
    * Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.
    * Make sure that the target value ('price') is not in your dataframe.


In [6]:
from sklearn.model_selection import train_test_split
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
df_train, df_val = train_test_split(df_full_train, test_size=0.2, random_state=42)
len(df_train), len(df_val), len(df_test)

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

## Q2:

    * Create the correlation matrix for the numerical features of your train dataset.
        - In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
    * What are the two features that have the biggest correlation in this dataset?


In [7]:
"""Creating List of Numerical and Categorical columns"""
categorical = [col for col in df.columns if df[col].dtype == 'object']
numerical = [col for col in df.columns if col not in categorical]

"""Correlation of Numerical Columns"""
display(df[numerical].corr())


Unnamed: 0,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
latitude,1.0,0.084788,0.033939,0.024869,-0.015389,-0.018758,0.019517,-0.010983
longitude,0.084788,1.0,-0.150019,-0.062747,0.059094,0.138516,-0.114713,0.082731
price,0.033939,-0.150019,1.0,0.042799,-0.047954,-0.050564,0.057472,0.081829
minimum_nights,0.024869,-0.062747,0.042799,1.0,-0.080116,-0.124905,0.12796,0.144303
number_of_reviews,-0.015389,0.059094,-0.047954,-0.080116,1.0,0.589407,-0.072376,0.172028
reviews_per_month,-0.018758,0.138516,-0.050564,-0.124905,0.589407,1.0,-0.047312,0.163732
calculated_host_listings_count,0.019517,-0.114713,0.057472,0.12796,-0.072376,-0.047312,1.0,0.225701
availability_365,-0.010983,0.082731,0.081829,0.144303,0.172028,0.163732,0.225701,1.0


## Highest Correlation is between

    * reviews_per_month and number_of_reviews: 0.589407


## Make price binary

    * We need to turn the price variable from numeric into binary.
    * Let's create a variable above_average which is 1 if the price is above (or equal to) 152.


In [8]:
"""price variable from numeric into binary."""
df_full_train['above_average'] = (df_full_train['price'] >= 152).values.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Q3:

    * Calculate the mutual information score with the (binarized) price for the two categorical variables that we have. Use the training set only.
    * Which of these two variables has bigger score?
    * Round it to 2 decimal digits using round(score, 2)


In [9]:
"""Mutual Information"""
def mutual_info_bin_score(series):
    return mutual_info_score(series, df_full_train.above_average)

mi = df_full_train[categorical].apply(mutual_info_bin_score)
mi.round(2).sort_values(ascending=False)    

room_type              0.14
neighbourhood_group    0.05
dtype: float64

## Q4:

    * Now let's train a logistic regression
    * Remember that we have two categorical variables in the data. Include them using one-hot encoding.
    * Fit the model on the training dataset.
        - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
        - model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)
    * Calculate the accuracy on the validation dataset and rount it to 2 decimal digits.


In [10]:
"""Taking Care of Categorical variables"""
y_train = (df_train['price'] >= 152).values.astype(int)
y_val = (df_val['price'] >= 152).values.astype(int)
y_test = (df_test['price'] >= 152).values.astype(int)

del df_train['price']
del df_val['price']
del df_test['price']

numerical.remove('price')

decision = (y_val >= 152).astype(int)

In [11]:
def calculate_accuracy(features):
    # one-hot encoding datasets
    dv = DictVectorizer(sparse=False)

    train_dict = df_train[features].to_dict(orient='records')
    val_dict = df_val[features].to_dict(orient='records')

    X_train = dv.fit_transform(train_dict)
    X_val = dv.transform(val_dict)

    """Fitting the Model on Training Set"""
    model = LogisticRegression(solver='liblinear', C=1.0, random_state=42)
    model.fit(X_train, y_train)

    """Using the model on validation"""
    y_pred = model.predict_proba(X_val)[:,1]

    """Setting up Decision Threshold to 0.5"""
    decision = (y_pred >= 0.5)

    """Calculating accuracy"""
    accuracy = (y_val == decision).mean()
    

    df_pred = pd.DataFrame()
    df_pred['probability'] = y_pred
    df_pred['prediction'] = decision
    df_pred['actual'] = y_val
    df_pred['correct'] = df_pred.prediction == df_pred.actual
    return accuracy, df_pred
acc, df_pred = calculate_accuracy(numerical+categorical)    

print(acc)
df_pred.head()

0.7914110429447853


Unnamed: 0,probability,prediction,actual,correct
0,0.028766,False,0,True
1,0.591596,True,0,False
2,0.413831,False,1,False
3,0.075227,False,0,True
4,0.813254,True,1,True


In [12]:
all_vars_accuracy,_ = calculate_accuracy(numerical+categorical)
all_vars_accuracy.round(2)

0.79

## Q5:

    * We have 9 features: 7 numerical features and 2 categorical.
    * Let's find the least useful one using the feature elimination technique.
    * Train a model with all these features (using the same parameters as in Q4).
    * Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
    * For each feature, calculate the difference between the original accuracy and the accuracy without the feature.
    * Which of following feature has the smallest difference?
        - neighbourhood_group
        - room_type
        - number_of_reviews
        - reviews_per_month

In [13]:
useful_features = numerical + categorical
diff = {}
for i in useful_features:
  features = useful_features.copy()
  features.remove(i)
  acc,_ = calculate_accuracy(features)
  diff["Difference in accuracy when removing %s"%i] = all_vars_accuracy - acc

diff

{'Difference in accuracy when removing latitude': 0.005112474437627745,
 'Difference in accuracy when removing longitude': 0.005240286298568542,
 'Difference in accuracy when removing minimum_nights': 0.0001278118609406853,
 'Difference in accuracy when removing number_of_reviews': -0.0005112474437628522,
 'Difference in accuracy when removing reviews_per_month': -0.0001278118609406853,
 'Difference in accuracy when removing calculated_host_listings_count': 0.0006390593047034265,
 'Difference in accuracy when removing availability_365': 0.011247443762781195,
 'Difference in accuracy when removing neighbourhood_group': 0.04115541922290389,
 'Difference in accuracy when removing room_type': 0.06480061349693245}

In [14]:
pd.DataFrame(diff.values(), index = diff.keys()).rename({0: 'differences'}, axis = 1).sort_values(by = 'differences', ascending = True).style.background_gradient('Blues')

Unnamed: 0,differences
Difference in accuracy when removing number_of_reviews,-0.000511
Difference in accuracy when removing reviews_per_month,-0.000128
Difference in accuracy when removing minimum_nights,0.000128
Difference in accuracy when removing calculated_host_listings_count,0.000639
Difference in accuracy when removing latitude,0.005112
Difference in accuracy when removing longitude,0.00524
Difference in accuracy when removing availability_365,0.011247
Difference in accuracy when removing neighbourhood_group,0.041155
Difference in accuracy when removing room_type,0.064801


### Smallest Difference:

    number_of_reviews (0.000895)


## Question 6:

    * For this question, we'll see how to use a linear regression model from Scikit-Learn
    * We'll need to use the original column 'price'. Apply the logarithmic transformation to this column.
    * Fit the Ridge regression model on the training data.
    * This model has a parameter alpha. Let's try the following values: [0, 0.01, 0.1, 1, 10]
    * Which of these alphas leads to the best RMSE on the validation set? Round your RMSE scores to 3 decimal digits.


In [15]:
"""Preparing Data for Linear Regression with Price Variable included"""
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
df_train, df_val = train_test_split(df_full_train, test_size=0.2, random_state=42)
len(df_train), len(df_val), len(df_test)

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

"""Creating List of Numerical and Categorical columns"""
categorical = [col for col in df.columns if df[col].dtype == 'object']
numerical = [col for col in df.columns if col not in categorical]

"""Apply the log transformation to the price variable using the np.log1p() function."""
y_train = np.log1p(df_train['price'].values)
y_val = np.log1p(df_val['price'].values)
y_test = np.log1p(df_test['price'].values)


"""Make sure that the target value ('price') is not in your dataframe."""
del df_train['price']
del df_val['price']
del df_test['price']

"""Taking care of Categorical Variables"""
dv = DictVectorizer(sparse=False)

train_dict = df_train[features].to_dict(orient='records')
val_dict = df_val[features].to_dict(orient='records')

X_train = dv.fit_transform(train_dict)
X_val = dv.transform(val_dict)

In [16]:
scores = {}
for alpha in [0, 0.01, 0.1, 1, 10]:
    model = Ridge(alpha=alpha)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    score = np.sqrt(mean_squared_error(y_val, y_pred))
    scores[alpha] = score.round(3)
    print("RMSE with alpha = %s and not rounding to 3 digits: %s"%(alpha, score) )
print(" \nRMSE with rounding off to 3 digits")
print(scores)

RMSE with alpha = 0 and not rounding to 3 digits: 0.6249924174058848
RMSE with alpha = 0.01 and not rounding to 3 digits: 0.6250146864354889
RMSE with alpha = 0.1 and not rounding to 3 digits: 0.6250139954471738
RMSE with alpha = 1 and not rounding to 3 digits: 0.6250233985636761
RMSE with alpha = 10 and not rounding to 3 digits: 0.625991202150716
 
RMSE with rounding off to 3 digits
{0: 0.625, 0.01: 0.625, 0.1: 0.625, 1: 0.625, 10: 0.626}


In [17]:
print("Table of RMSE rounded to 3 digits")
pd.DataFrame(scores.values(), index = scores.keys()).rename({0: 'RMSE'}, axis = 1).sort_values(by = 'RMSE', ascending = True).style.background_gradient('Blues')

Table of RMSE rounded to 3 digits


Unnamed: 0,RMSE
0.0,0.625
0.01,0.625
0.1,0.625
1.0,0.625
10.0,0.626


alpha that leads to the best RMSE: 0