# Homework

## Dataset

In this homework, we will continue the New York City Airbnb Open Data. You can take it from [Kaggle](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data?select=AB_NYC_2019.csv) or download from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv) if you don't want to sign up to Kaggle.

We'll keep working with the 'price' variable, and we'll transform it to a classification task.

In [1]:
!wget -O data.csv https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv

--2021-09-27 13:45:58--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7077973 (6.8M) [text/plain]
Saving to: ‘data.csv’


2021-09-27 13:45:58 (192 MB/s) - ‘data.csv’ saved [7077973/7077973]



In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mutual_info_score, accuracy_score
from sklearn.linear_model import LogisticRegression

In [3]:
cols_to_be_used = ['neighbourhood_group', 'room_type','latitude','longitude','price','minimum_nights','number_of_reviews','reviews_per_month','calculated_host_listings_count','availability_365']

In [4]:
df = pd.read_csv('/content/data.csv', usecols=cols_to_be_used)
df.head()

Unnamed: 0,neighbourhood_group,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,Brooklyn,40.64749,-73.97237,Private room,149,1,9,0.21,6,365
1,Manhattan,40.75362,-73.98377,Entire home/apt,225,1,45,0.38,2,355
2,Manhattan,40.80902,-73.9419,Private room,150,3,0,,1,365
3,Brooklyn,40.68514,-73.95976,Entire home/apt,89,1,270,4.64,1,194
4,Manhattan,40.79851,-73.94399,Entire home/apt,80,10,9,0.1,1,0


In [5]:
df.isnull().sum()

neighbourhood_group                   0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

In [6]:
# Fill na with 0
df = df.fillna(0)
df.isnull().sum()

neighbourhood_group               0
latitude                          0
longitude                         0
room_type                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
reviews_per_month                 0
calculated_host_listings_count    0
availability_365                  0
dtype: int64

## Questions

### Question 1

What is the most frequent observation (mode) for the column 'neighbourhood_group'?

In [7]:
df['neighbourhood_group'].mode()

0    Manhattan
dtype: object

### Split the data

- Split your data in train/val/test sets, with 60%/20%/20% distribution.
- Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.
- Make sure that the target value ('price') is not in your dataframe.



In [8]:
df_train_full, df_test = train_test_split(df, test_size=0.2, random_state=42)

print(df_train_full.shape)
print(df_test.shape)

(39116, 10)
(9779, 10)


In [9]:
df_train, df_val = train_test_split(df_train_full, test_size=0.2, random_state=42)

print(df_train.shape)
print(df_val.shape)

(31292, 10)
(7824, 10)


In [10]:
X_train = df_train.drop('price', axis=1)
X_val = df_val.drop('price', axis=1)
X_test = df_test.drop('price', axis=1)

y_train = df_train.price.values
y_val = df_val.price.values
y_test = df_test.price.values

print(X_train.shape, y_train.shape)
print(X_val.shape, y_val.shape)
print(X_test.shape, y_test.shape)

(31292, 9) (31292,)
(7824, 9) (7824,)
(9779, 9) (9779,)


### Question 2

- Create the correlation matrix for the numerical features of your train dataset.
  - In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
- What are the two features that have the biggest correlation in this dataset?


In [11]:
train_corr = X_train.select_dtypes(exclude='object').corr()
train_corr

Unnamed: 0,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
latitude,1.0,0.082567,0.024824,-0.007411,-0.009479,0.019051,-0.006793
longitude,0.082567,1.0,-0.061049,0.055967,0.134702,-0.116155,0.083446
minimum_nights,0.024824,-0.061049,1.0,-0.074459,-0.118436,0.114218,0.136383
number_of_reviews,-0.007411,0.055967,-0.074459,1.0,0.591234,-0.072782,0.174931
reviews_per_month,-0.009479,0.134702,-0.118436,0.591234,1.0,-0.047711,0.166007
calculated_host_listings_count,0.019051,-0.116155,0.114218,-0.072782,-0.047711,1.0,0.226329
availability_365,-0.006793,0.083446,0.136383,0.174931,0.166007,0.226329,1.0


Features with the biggest correlation in the dataset:
- `calculated_host_listings_count`
- `availability_365`

### Make price binary

- We need to turn the price variable from numeric into binary.
- Let's create a variable `above_average` which is 1 if the price is above (or equal to) 152.


In [12]:
y_train_above_average = (y_train >= 152).astype(int)
y_val_above_average = (y_val >= 152).astype(int)
y_test_above_average = (y_test >= 152).astype(int)

### Question 3

- Calculate the mutual information score with the (binarized) price for the two categorical variables that we have. Use the training set only.
- Which of these two variables has bigger score?
- Round it to 2 decimal digits using round(score, 2)


In [13]:
cat_cols = ['neighbourhood_group', 'room_type']

for col in cat_cols:
  print(col, round(mutual_info_score(y_train_above_average, X_train[col]), 2))

neighbourhood_group 0.05
room_type 0.14


`room_type` has bigger mutual information score

### Question 4

- Now let's train a logistic regression
- Remember that we have two categorical variables in the data. Include them using one-hot encoding.
- Fit the model on the training dataset.
  - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
  - model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)
- Calculate the accuracy on the validation dataset and round it to 2 decimal digits.


In [14]:
# Handle categorical variables
# 1. Create dummies for categorical columns in Train, Val, Test
X_train_cat_cols = pd.get_dummies(X_train[cat_cols], drop_first=True)
X_val_cat_cols = pd.get_dummies(X_val[cat_cols], drop_first=True)
X_test_cat_cols = pd.get_dummies(X_test[cat_cols], drop_first=True)

# 2. Concatenate dummy cols in existing Train, Val, Test
X_train_ohe = pd.concat([X_train, X_train_cat_cols], axis=1)
X_val_ohe = pd.concat([X_val, X_val_cat_cols], axis=1)
X_test_ohe = pd.concat([X_test, X_test_cat_cols], axis=1)

# 3. Drop original cols in Train, Val, Test
X_train_ohe.drop(cat_cols, axis=1, inplace=True)
X_val_ohe.drop(cat_cols, axis=1, inplace=True)
X_test_ohe.drop(cat_cols, axis=1, inplace=True)

In [15]:
X_train_ohe.head()

Unnamed: 0,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,neighbourhood_group_Brooklyn,neighbourhood_group_Manhattan,neighbourhood_group_Queens,neighbourhood_group_Staten Island,room_type_Private room,room_type_Shared room
32249,40.71754,-73.95906,2,15,1.14,1,0,1,0,0,0,0,0
15113,40.78784,-73.94998,5,88,2.23,1,326,0,1,0,0,0,0
23032,40.7358,-73.9889,1,68,2.67,1,324,0,1,0,0,0,0
21109,40.70921,-73.94144,1,2,0.07,1,0,1,0,0,0,1,0
23025,40.7241,-73.98959,7,9,0.35,1,0,0,1,0,0,0,0


In [16]:
y_train[:10]

array([125, 149, 287,  77, 133,  88,  62,  49, 300,  40])

In [17]:
y_train_above_average[:10]

array([0, 0, 1, 0, 0, 0, 0, 0, 1, 0])

In [18]:
lr = LogisticRegression(solver = 'lbfgs', C=1.0, random_state=42)
lr.fit(X_train_ohe, y_train_above_average)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=42, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [19]:
y_val_preds = lr.predict(X_val_ohe)
print('Accuracy (val):', round(accuracy_score(y_val_above_average, y_val_preds), 2))

Accuracy (val): 0.79


Accuracy on validation set is 79%

### Question 5

- We have 9 features: 7 numerical features and 2 categorical.
- Let's find the least useful one using the feature elimination technique.
- Train a model with all these features (using the same parameters as in Q4).
- Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
- For each feature, calculate the difference between the original accuracy and the accuracy without the feature.
- Which of following feature has the smallest difference?
  - neighbourhood_group
  - room_type
  - number_of_reviews
  - reviews_per_month


In [20]:
# Mutual info score
def mutual_info_w_price_train_score(series):
  return mutual_info_score(series, y_train_above_average)

In [21]:
df_train[cat_cols].apply(mutual_info_w_price_train_score)

neighbourhood_group    0.046690
room_type              0.142102
dtype: float64

In [24]:
# Correlation
df_train['price_above_average'] = (df_train['price'] >= 152).astype(int)
corr_scores = df_train.drop(['price', 'price_above_average'], axis=1).select_dtypes(exclude='object').corrwith(df_train.price_above_average).abs()
corr_scores.sort_values(ascending=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


longitude                         0.266868
calculated_host_listings_count    0.172242
availability_365                  0.103098
latitude                          0.057347
reviews_per_month                 0.056029
number_of_reviews                 0.055563
minimum_nights                    0.030924
dtype: float64

In [28]:
features = ['neighbourhood_group','room_type', 'number_of_reviews', 'reviews_per_month']
cat_cols = ['neighbourhood_group','room_type']

In [30]:
def convert_ohe(df, cat_cols):
  dummies = pd.get_dummies(df[cat_cols], drop_first=True)
  df_ohe_ = pd.concat([df, dummies], axis=1)
  df_ohe_.drop(cat_cols, axis=1, inplace=True)
  return df_ohe_

In [47]:
def train_model_acc_on_val(feature_list, cat_cols):
  X_train_ohe_ = convert_ohe(df_train[feature_list], cat_cols)
  X_val_ohe_ = convert_ohe(df_val[feature_list], cat_cols)

  lr_ = LogisticRegression(solver = 'lbfgs', C=1.0, random_state=42)
  lr_.fit(X_train_ohe_, y_train_above_average)
  
  y_val_preds_ = lr_.predict(X_val_ohe_)
  return accuracy_score(y_val_above_average, y_val_preds_)


In [48]:
orig_acc = train_model_acc_on_val(features, cat_cols)
print('Original accuracy', orig_acc)

Original accuracy 0.7792689161554193


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [49]:
features_1 = ['room_type', 'number_of_reviews', 'reviews_per_month']
cat_cols_1 = ['room_type']

acc_1 = train_model_acc_on_val(features_1, cat_cols_1)
print('Diff b/w original and current Accuracy w/o neighborhood_group', orig_acc - acc_1)

Diff b/w original and current Accuracy w/o neighborhood_group 0.04971881390593047


In [50]:
features_2 = ['neighbourhood_group', 'number_of_reviews', 'reviews_per_month']
cat_cols_2 = ['neighbourhood_group']

acc_2 = train_model_acc_on_val(features_2, cat_cols_2)
print('Diff b/w original and current Accuracy w/o room_type', orig_acc - acc_2)

Diff b/w original and current Accuracy w/o room_type 0.09061860940695299


In [51]:
features_3 = ['neighbourhood_group', 'room_type', 'reviews_per_month']
cat_cols_3 = ['neighbourhood_group', 'room_type']

acc_3 = train_model_acc_on_val(features_3, cat_cols_3)
print('Diff b/w original and current Accuracy w/o number_of_reviews', orig_acc - acc_3)

Diff b/w original and current Accuracy w/o number_of_reviews -0.0001278118609406853


In [52]:
features_4 = ['neighbourhood_group', 'room_type', 'number_of_reviews']
cat_cols_4 = ['neighbourhood_group', 'room_type']

acc_4 = train_model_acc_on_val(features_4, cat_cols_4)
print('Diff b/w original and current Accuracy w/o reviews_per_month', orig_acc - acc_4)

Diff b/w original and current Accuracy w/o reviews_per_month 0.0


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


`reviews_per_month` has the smallest difference among all features.