# Lab 8: Implement Your Machine Learning Project Plan

In this lab assignment, you will implement the machine learning project plan you created in the written assignment. You will:

1. Load your data set and save it to a Pandas DataFrame.
2. Perform exploratory data analysis on your data to determine which feature engineering and data preparation techniques you will use.
3. Prepare your data for your model and create features and a label.
4. Fit your model to the training data and evaluate your model.
5. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.

### Import Packages

Before you get started, import a few packages.

In [36]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need for this task.

In [37]:
# YOUR CODE HERE
import scipy.stats as stats
%matplotlib inline

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split 
from sklearn.metrics import log_loss
from sklearn.metrics import accuracy_score

## Part 1: Load the Data Set


You have chosen to work with one of four data sets. The data sets are located in a folder named "data." The file names of the three data sets are as follows:

* The "adult" data set that contains Census information from 1994 is located in file `adultData.csv`
* The airbnb NYC "listings" data set is located in file  `airbnbListingsData.csv`
* The World Happiness Report (WHR) data set is located in file `WHR2018Chapter2OnlineData.csv`
* The book review data set is located in file `bookReviewsData.csv`



<b>Task:</b> In the code cell below, use the same method you have been using to load your data using `pd.read_csv()` and save it to DataFrame `df`.

In [38]:
# YOUR CODE HERE
df = pd.read_csv(os.path.join(os.getcwd(), "data", "airbnbListingsData.csv"), header = 0)

## Part 2: Exploratory Data Analysis

The next step is to inspect and analyze your data set with your machine learning problem and project plan in mind. 

This step will help you determine data preparation and feature engineering techniques you will need to apply to your data to build a balanced modeling data set for your problem and model. These data preparation techniques may include:
* addressing missingness, such as replacing missing values with means
* renaming features and labels
* finding and replacing outliers
* performing winsorization if needed
* performing one-hot encoding on categorical features
* performing vectorization for an NLP problem
* addressing class imbalance in your data sample to promote fair AI


Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.


<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. 

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [39]:
# YOUR CODE HERE
nan_count = np.sum(df.isnull(), axis=0)

condition = nan_count != 0
col_names = list(nan_count[condition].index)
col_names

['name',
 'description',
 'neighborhood_overview',
 'host_location',
 'host_about',
 'host_response_rate',
 'host_acceptance_rate',
 'bedrooms',
 'beds']

In [40]:
df[col_names].dtypes

name                      object
description               object
neighborhood_overview     object
host_location             object
host_about                object
host_response_rate       float64
host_acceptance_rate     float64
bedrooms                 float64
beds                     float64
dtype: object

In [41]:
response_rate_mean = df['host_response_rate'].mean()
acceptance_rate_mean = df['host_acceptance_rate'].mean()
bedrooms_mean = df['bedrooms'].mean()
beds_mean = df['beds'].mean()
print(response_rate_mean)
print(acceptance_rate_mean)
print(bedrooms_mean)
print(beds_mean)

0.9069009209469064
0.7919528061978829
1.3297084130019121
1.62955602219889


In [42]:
df['host_response_rate'].fillna(value=response_rate_mean, inplace=True)
df['host_acceptance_rate'].fillna(value=acceptance_rate_mean, inplace=True)
df['bedrooms'].fillna(value=bedrooms_mean, inplace=True)
df['beds'].fillna(value=beds_mean, inplace=True)

In [43]:
np.sum(df[col_names].isnull(), axis=0)

name                         5
description                570
neighborhood_overview     9816
host_location               60
host_about               10945
host_response_rate           0
host_acceptance_rate         0
bedrooms                     0
beds                         0
dtype: int64

In [44]:
corrs = df.corr()['instant_bookable']
corrs_sorted = corrs.sort_values()
corrs_sorted

minimum_nights                                 -0.097435
n_host_verifications                           -0.091419
minimum_minimum_nights                         -0.086198
review_scores_communication                    -0.063706
review_scores_rating                           -0.058469
review_scores_checkin                          -0.058336
review_scores_value                            -0.046112
bedrooms                                       -0.041853
bathrooms                                      -0.030011
review_scores_location                         -0.027357
review_scores_cleanliness                      -0.023509
beds                                           -0.014686
calculated_host_listings_count_entire_homes    -0.012601
minimum_nights_avg_ntm                         -0.012250
maximum_minimum_nights                         -0.008761
accommodates                                   -0.005734
maximum_nights                                 -0.003601
calculated_host_listings_count 

In [45]:
df.describe()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,host_total_listings_count,accommodates,bathrooms,bedrooms,beds,price,minimum_nights,...,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications
count,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,...,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0
mean,0.906901,0.791953,14.554778,14.554778,2.874491,1.142174,1.329708,1.629556,154.228749,18.689387,...,4.8143,4.808041,4.750393,4.64767,9.5819,5.562986,3.902077,0.048283,1.758325,5.16951
std,0.172697,0.214963,120.721287,120.721287,1.860251,0.421132,0.663238,1.070269,140.816605,25.569151,...,0.438603,0.464585,0.415717,0.518023,32.227523,26.121426,17.972386,0.442459,4.446143,2.028497
min,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,29.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.01,1.0
25%,0.906901,0.791953,1.0,1.0,2.0,1.0,1.0,1.0,70.0,2.0,...,4.81,4.81,4.67,4.55,1.0,0.0,0.0,0.0,0.13,4.0
50%,0.906901,0.791953,1.0,1.0,2.0,1.0,1.0,1.0,115.0,30.0,...,4.96,4.97,4.88,4.78,1.0,1.0,0.0,0.0,0.51,5.0
75%,1.0,0.95,3.0,3.0,4.0,1.0,1.329708,2.0,180.0,30.0,...,5.0,5.0,5.0,5.0,3.0,1.0,1.0,0.0,1.83,7.0
max,1.0,1.0,3387.0,3387.0,16.0,8.0,12.0,21.0,1000.0,1250.0,...,5.0,5.0,5.0,5.0,421.0,308.0,359.0,8.0,141.0,13.0


In [46]:
true_count = np.sum(df['instant_bookable'] == True)
false_count = np.sum(df['instant_bookable'] == False)
print(true_count)
print(false_count)

7640
20382


In [47]:
majority = df[df['instant_bookable'] == False]
minority = df[df['instant_bookable'] == True]

new_oversampled_minority = minority.sample(n=len(majority), replace=True)
df = pd.concat([majority, new_oversampled_minority])

In [48]:
true_count = np.sum(df['instant_bookable'] == True)
false_count = np.sum(df['instant_bookable'] == False)
print(true_count)
print(false_count)

20382
20382


In [49]:
df_zscores = df.select_dtypes(include=['number']).apply(stats.zscore)
df_zscores.head()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,host_total_listings_count,accommodates,bathrooms,bedrooms,beds,price,minimum_nights,...,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications
0,-0.713895,-3.304887,-0.072264,-0.072264,-1.016719,-0.333754,0.019921,-0.591594,-0.055494,0.482138,...,-0.100989,-0.013091,0.274724,-0.435536,-0.213854,-0.089552,-0.244478,-0.113828,-0.339953,1.885576
1,-5.143485,-0.674685,-0.116588,-0.116588,0.065283,-0.333754,-0.491832,1.320307,-0.573029,-0.629665,...,-0.055704,0.00835,-0.085234,0.004931,-0.277801,-0.167046,-0.244478,-0.113828,0.554007,0.451473
2,0.533876,-2.90024,-0.116588,-0.116588,0.606284,0.896814,1.060308,0.364357,0.807062,-0.476313,...,0.442436,0.437172,-0.589175,0.694358,-0.277801,-0.167046,-0.244478,-0.113828,-0.401129,-0.98263
3,0.533876,0.89332,-0.116588,-0.116588,-0.475718,-0.333754,-0.491832,-0.591594,-0.621332,-0.591327,...,-0.327416,-0.806411,0.298721,-0.53129,-0.277801,-0.205793,-0.18629,-0.113828,0.321143,-0.504596
4,-0.046956,-0.158999,-0.116588,-0.116588,-1.016719,-0.333754,-0.491832,-0.591594,-0.573029,-0.591327,...,0.374508,0.329967,0.466701,0.541152,-0.277801,-0.205793,-0.18629,-0.113828,-0.233388,0.929507


In [50]:
for c in df.select_dtypes('float64').columns:
    df[c + '_win'] = stats.mstats.winsorize(df[c], limits=[0.01, 0.01])


In [51]:
one_hot_encoding_col = ['has_availability']

In [52]:
pd.get_dummies(df, columns=['has_availability'], prefix=['has_availability'])

Unnamed: 0,name,description,neighborhood_overview,host_name,host_location,host_about,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,...,maximum_nights_avg_ntm_win,review_scores_rating_win,review_scores_cleanliness_win,review_scores_checkin_win,review_scores_communication_win,review_scores_location_win,review_scores_value_win,reviews_per_month_win,has_availability_False,has_availability_True
0,Skylit Midtown Castle,"Beautiful, spacious skylit studio in the heart...",Centrally located in the heart of Manhattan ju...,Jennifer,"New York, New York, United States",A New Yorker since 2000! My passion is creatin...,0.800000,0.170000,True,8.0,...,1125.0,4.70,4.62,4.76,4.79,4.86,4.41,0.33,0,1
1,"Whole flr w/private bdrm, bath & kitchen(pls r...","Enjoy 500 s.f. top floor in 1899 brownstone, w...",Just the right mix of urban center and local n...,LisaRoxanne,"New York, New York, United States",Laid-back Native New Yorker (formerly bi-coast...,0.090000,0.690000,True,1.0,...,730.0,4.45,4.49,4.78,4.80,4.71,4.64,4.86,0,1
2,"Spacious Brooklyn Duplex, Patio + Garden",We welcome you to stay in our lovely 2 br dupl...,,Rebecca,"Brooklyn, New York, United States","Rebecca is an artist/designer, and Henoch is i...",1.000000,0.250000,True,1.0,...,1125.0,5.00,5.00,5.00,5.00,4.50,5.00,0.02,0,1
3,Large Furnished Room Near B'way,Please don’t expect the luxury here just a bas...,"Theater district, many restaurants around here.",Shunichi,"New York, New York, United States",I used to work for a financial industry but no...,1.000000,1.000000,True,1.0,...,14.0,4.21,3.73,4.66,4.42,4.87,4.36,3.68,0,1
4,Cozy Clean Guest Room - Family Apt,"Our best guests are seeking a safe, clean, spa...",Our neighborhood is full of restaurants and ca...,MaryEllen,"New York, New York, United States",Welcome to family life with my oldest two away...,0.906901,0.791953,True,1.0,...,14.0,4.91,4.82,4.97,4.95,4.94,4.92,0.87,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1968,Cozy Entire Apt1Bd APT inGREAT Loc,"Amazing location on Madison Avenue, just a 5 b...",The area is very safe and only a 3 blocks away...,Carlos,"New York, New York, United States",I'm a songwriter/ music producer working for U...,0.906901,0.791953,True,1.0,...,7.0,4.00,4.00,5.00,5.00,5.00,5.00,0.05,0,1
11365,Queens Artist' Corner,My little Queens getaway is cozy and colorful....,"Diverse mix of young professionals, artists, f...",Sarita,"Queens, New York, United States",I am a recently graduated student of Film Stud...,0.906901,0.791953,True,1.0,...,1125.0,5.00,5.00,5.00,5.00,5.00,5.00,0.04,0,1
24410,Private room in the heart of Williamsburg,"A cozy apartment in the heart of Williamsburg,...",,Agnese,"New York, New York, United States",,1.000000,1.000000,True,1.0,...,1125.0,4.00,4.00,5.00,5.00,5.00,4.00,0.37,0,1
24306,Spacious TS Ball Drop Overlooking View Higher Fl,Best for your New York City trip! <br />Enjoy ...,,M,"New York, New York, United States",,0.980000,0.990000,True,11.0,...,28.0,4.89,4.56,4.67,4.22,5.00,4.67,1.59,0,1


## Part 3: Implement Your Project Plan

<b>Task:</b> Use the rest of this notebook to carry out your project plan. You will:

1. Prepare your data for your model and create features and a label.
2. Fit your model to the training data and evaluate your model.
3. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.


Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit.

In [98]:
y = df['instant_bookable']
X = df[list(top_features['Feature'])]

In [99]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1234)

In [100]:
def train_test_LR(X_train, y_train, X_test, y_test, c=1):
    model = LogisticRegression(C=c)
    model.fit(X_train, y_train)
    probability_predictions = model.predict_proba(X_test)
    l_loss = log_loss(y_test, probability_predictions)
    class_label_predictions = model.predict(X_test)
    acc_score = accuracy_score(y_test, class_label_predictions)
    
    return acc_score, model

In [101]:
acc, model = train_test_LR(X_train, y_train, X_test, y_test)
print('Accuracy: ' + str(acc))
feature_importances = model.coef_[0]
print(feature_importances)

Accuracy: 0.5086597784880695
[ 1.80270819e-09  1.83399851e-12  1.76464899e-12  1.73258708e-12
  2.12180364e-13  2.12180364e-13  9.51052773e-14  9.42244649e-14
 -6.07061853e-10  1.95834497e-14]


In [97]:
coef_df = pd.DataFrame({'Feature': X_train.columns, 'Coefficient': feature_importances})

coef_df = coef_df.sort_values(by='Coefficient', ascending=False)

top_n = 10
top_features = coef_df.head(top_n)
print(top_features)
print(list(top_features['Feature']))

                          Feature   Coefficient
11         maximum_maximum_nights  1.331445e-03
32     maximum_maximum_nights_win  6.972662e-05
34     maximum_nights_avg_ntm_win  6.708869e-05
31     minimum_maximum_nights_win  6.588464e-05
2             host_listings_count  8.070622e-06
3       host_total_listings_count  8.070622e-06
28                      price_win  3.590528e-06
7                           price  3.557068e-06
10         minimum_maximum_nights  3.019665e-06
24  host_total_listings_count_win  7.445353e-07
['maximum_maximum_nights', 'maximum_maximum_nights_win', 'maximum_nights_avg_ntm_win', 'minimum_maximum_nights_win', 'host_listings_count', 'host_total_listings_count', 'price_win', 'price', 'minimum_maximum_nights', 'host_total_listings_count_win']


In [107]:
cs = [10**i for i in range(-10,10)]
ll_cs = []
acc_cs = []
for c in cs:
    result = train_test_LR(X_train, y_train, X_test, y_test, c)
    print(result)
    ll_cs.append(result[0])
    acc_cs.append(result[1])

(0.5086597784880695, LogisticRegression(C=1e-10, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False))
(0.5086597784880695, LogisticRegression(C=1e-09, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False))
(0.5086597784880695, LogisticRegression(C=1e-08, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001,