# Lab 8: Implement Your Machine Learning Project Plan

In this lab assignment, you will implement the machine learning project plan you created in the written assignment. You will:

1. Load your data set and save it to a Pandas DataFrame.
2. Perform exploratory data analysis on your data to determine which feature engineering and data preparation techniques you will use.
3. Prepare your data for your model and create features and a label.
4. Fit your model to the training data and evaluate your model.
5. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.

### Import Packages

Before you get started, import a few packages.

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need for this task.

In [2]:
import scipy.stats as stats
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import mean_squared_error, r2_score

## Part 1: Load the Data Set


You have chosen to work with one of four data sets. The data sets are located in a folder named "data." The file names of the three data sets are as follows:

* The "adult" data set that contains Census information from 1994 is located in file `adultData.csv`
* The airbnb NYC "listings" data set is located in file  `airbnbListingsData.csv`
* The World Happiness Report (WHR) data set is located in file `WHR2018Chapter2OnlineData.csv`
* The book review data set is located in file `bookReviewsData.csv`



<b>Task:</b> In the code cell below, use the same method you have been using to load your data using `pd.read_csv()` and save it to DataFrame `df`.

In [3]:
airbnbDataSet_filename=os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
df=pd.read_csv(airbnbDataSet_filename, header=0)

## Part 2: Exploratory Data Analysis

The next step is to inspect and analyze your data set with your machine learning problem and project plan in mind. 

This step will help you determine data preparation and feature engineering techniques you will need to apply to your data to build a balanced modeling data set for your problem and model. These data preparation techniques may include:
* addressing missingness, such as replacing missing values with means
* renaming features and labels
* finding and replacing outliers
* performing winsorization if needed
* performing one-hot encoding on categorical features
* performing vectorization for an NLP problem
* addressing class imbalance in your data sample to promote fair AI


Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.


<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. 

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [4]:
df.shape
df.head()

Unnamed: 0,name,description,neighborhood_overview,host_name,host_location,host_about,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,...,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications
0,Skylit Midtown Castle,"Beautiful, spacious skylit studio in the heart...",Centrally located in the heart of Manhattan ju...,Jennifer,"New York, New York, United States",A New Yorker since 2000! My passion is creatin...,0.8,0.17,True,8.0,...,4.79,4.86,4.41,False,3,3,0,0,0.33,9
1,"Whole flr w/private bdrm, bath & kitchen(pls r...","Enjoy 500 s.f. top floor in 1899 brownstone, w...",Just the right mix of urban center and local n...,LisaRoxanne,"New York, New York, United States",Laid-back Native New Yorker (formerly bi-coast...,0.09,0.69,True,1.0,...,4.8,4.71,4.64,False,1,1,0,0,4.86,6
2,"Spacious Brooklyn Duplex, Patio + Garden",We welcome you to stay in our lovely 2 br dupl...,,Rebecca,"Brooklyn, New York, United States","Rebecca is an artist/designer, and Henoch is i...",1.0,0.25,True,1.0,...,5.0,4.5,5.0,False,1,1,0,0,0.02,3
3,Large Furnished Room Near B'way,Please don’t expect the luxury here just a bas...,"Theater district, many restaurants around here.",Shunichi,"New York, New York, United States",I used to work for a financial industry but no...,1.0,1.0,True,1.0,...,4.42,4.87,4.36,False,1,0,1,0,3.68,4
4,Cozy Clean Guest Room - Family Apt,"Our best guests are seeking a safe, clean, spa...",Our neighborhood is full of restaurants and ca...,MaryEllen,"New York, New York, United States",Welcome to family life with my oldest two away...,,,True,1.0,...,4.95,4.94,4.92,False,1,0,1,0,0.87,7


In [5]:
df['label']=stats.mstats.winsorize(df['price'], limits=[0.01,0.01])
df.head()

Unnamed: 0,name,description,neighborhood_overview,host_name,host_location,host_about,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,...,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications,label
0,Skylit Midtown Castle,"Beautiful, spacious skylit studio in the heart...",Centrally located in the heart of Manhattan ju...,Jennifer,"New York, New York, United States",A New Yorker since 2000! My passion is creatin...,0.8,0.17,True,8.0,...,4.86,4.41,False,3,3,0,0,0.33,9,150.0
1,"Whole flr w/private bdrm, bath & kitchen(pls r...","Enjoy 500 s.f. top floor in 1899 brownstone, w...",Just the right mix of urban center and local n...,LisaRoxanne,"New York, New York, United States",Laid-back Native New Yorker (formerly bi-coast...,0.09,0.69,True,1.0,...,4.71,4.64,False,1,1,0,0,4.86,6,75.0
2,"Spacious Brooklyn Duplex, Patio + Garden",We welcome you to stay in our lovely 2 br dupl...,,Rebecca,"Brooklyn, New York, United States","Rebecca is an artist/designer, and Henoch is i...",1.0,0.25,True,1.0,...,4.5,5.0,False,1,1,0,0,0.02,3,275.0
3,Large Furnished Room Near B'way,Please don’t expect the luxury here just a bas...,"Theater district, many restaurants around here.",Shunichi,"New York, New York, United States",I used to work for a financial industry but no...,1.0,1.0,True,1.0,...,4.87,4.36,False,1,0,1,0,3.68,4,68.0
4,Cozy Clean Guest Room - Family Apt,"Our best guests are seeking a safe, clean, spa...",Our neighborhood is full of restaurants and ca...,MaryEllen,"New York, New York, United States",Welcome to family life with my oldest two away...,,,True,1.0,...,4.94,4.92,False,1,0,1,0,0.87,7,75.0


In [6]:
change=(df['price'])-(df['label'])
print(np.unique(change))

[ -1.   0.   1.   2.   6.   7.  12.  15.  25.  26.  41.  43.  44.  46.
  50.  51.  58.  71.  78.  81.  83.  86.  87.  93.  96.  99. 100. 101.]


In [7]:
exclude=['label', 'price']
corrs=df.corr()['label'].drop(exclude, axis=0)
corrs_sorted=corrs.sort_values(axis=0)
corrs_sorted

minimum_nights                                 -0.082053
minimum_minimum_nights                         -0.073076
calculated_host_listings_count_shared_rooms    -0.046807
calculated_host_listings_count_private_rooms   -0.045742
number_of_reviews                              -0.033141
n_host_verifications                           -0.025404
minimum_nights_avg_ntm                         -0.010942
maximum_minimum_nights                         -0.007278
review_scores_value                            -0.004888
review_scores_checkin                          -0.003423
maximum_nights                                 -0.001023
review_scores_communication                     0.000932
host_response_rate                              0.007774
calculated_host_listings_count                  0.021968
reviews_per_month                               0.032779
host_acceptance_rate                            0.039622
instant_bookable                                0.040906
review_scores_rating           

In [8]:
my_features=['accommodates', 'bedrooms', 'beds', 'availability_60','availability_90', 'availability_30', 'availability_365', 'maximum_nights', 'minimum_nights', 'instant_bookable', 'room_type', 'bathrooms', 'label']
df=df[my_features]
df.head()

Unnamed: 0,accommodates,bedrooms,beds,availability_60,availability_90,availability_30,availability_365,maximum_nights,minimum_nights,instant_bookable,room_type,bathrooms,label
0,1,,1.0,33,63,3,338,1125,30,False,Entire home/apt,1.0,150.0
1,3,1.0,3.0,6,18,3,194,730,1,False,Entire home/apt,1.0,75.0
2,4,2.0,2.0,3,12,3,123,1125,5,False,Entire home/apt,1.5,275.0
3,2,1.0,1.0,16,34,1,192,14,2,False,Private room,1.0,68.0
4,1,1.0,1.0,0,0,0,0,14,2,False,Private room,1.0,75.0


In [9]:
df.isnull().values.any()
nan_count=np.sum(df.isnull(),axis=0)
print(nan_count)
print(df.shape)

accommodates           0
bedrooms            2918
beds                1354
availability_60        0
availability_90        0
availability_30        0
availability_365       0
maximum_nights         0
minimum_nights         0
instant_bookable       0
room_type              0
bathrooms              0
label                  0
dtype: int64
(28022, 13)


In [10]:
df['beds_na']=df['beds'].isnull()
df['bedrooms_na']=df['bedrooms'].isnull()
df.head(50)

Unnamed: 0,accommodates,bedrooms,beds,availability_60,availability_90,availability_30,availability_365,maximum_nights,minimum_nights,instant_bookable,room_type,bathrooms,label,beds_na,bedrooms_na
0,1,,1.0,33,63,3,338,1125,30,False,Entire home/apt,1.0,150.0,False,True
1,3,1.0,3.0,6,18,3,194,730,1,False,Entire home/apt,1.0,75.0,False,False
2,4,2.0,2.0,3,12,3,123,1125,5,False,Entire home/apt,1.5,275.0,False,False
3,2,1.0,1.0,16,34,1,192,14,2,False,Private room,1.0,68.0,False,False
4,1,1.0,1.0,0,0,0,0,14,2,False,Private room,1.0,75.0,False,False
5,2,1.0,,17,47,2,322,21,4,False,Private room,1.5,98.0,True,False
6,3,,1.0,30,30,2,179,730,30,True,Entire home/apt,1.0,89.0,False,True
7,1,1.0,1.0,4,34,1,309,700,30,True,Private room,1.0,62.0,False,False
8,1,1.0,1.0,7,23,0,271,45,27,False,Private room,1.0,90.0,False,False
9,4,1.0,2.0,33,63,6,334,1125,2,True,Entire home/apt,1.0,199.0,False,False


In [11]:
median_beds=df['beds'].median()
print(median_beds)
df['beds'].fillna(value=median_beds, inplace=True)
median_bedrooms=df['bedrooms'].median()
print(median_bedrooms)
df['bedrooms'].fillna(value=median_bedrooms, inplace=True)
df.head(50)

1.0
1.0


Unnamed: 0,accommodates,bedrooms,beds,availability_60,availability_90,availability_30,availability_365,maximum_nights,minimum_nights,instant_bookable,room_type,bathrooms,label,beds_na,bedrooms_na
0,1,1.0,1.0,33,63,3,338,1125,30,False,Entire home/apt,1.0,150.0,False,True
1,3,1.0,3.0,6,18,3,194,730,1,False,Entire home/apt,1.0,75.0,False,False
2,4,2.0,2.0,3,12,3,123,1125,5,False,Entire home/apt,1.5,275.0,False,False
3,2,1.0,1.0,16,34,1,192,14,2,False,Private room,1.0,68.0,False,False
4,1,1.0,1.0,0,0,0,0,14,2,False,Private room,1.0,75.0,False,False
5,2,1.0,1.0,17,47,2,322,21,4,False,Private room,1.5,98.0,True,False
6,3,1.0,1.0,30,30,2,179,730,30,True,Entire home/apt,1.0,89.0,False,True
7,1,1.0,1.0,4,34,1,309,700,30,True,Private room,1.0,62.0,False,False
8,1,1.0,1.0,7,23,0,271,45,27,False,Private room,1.0,90.0,False,False
9,4,1.0,2.0,33,63,6,334,1125,2,True,Entire home/apt,1.0,199.0,False,False


In [12]:
print(np.sum(df['beds'].isnull(), axis=0))
print(np.sum(df['bedrooms'].isnull(), axis=0))

0
0


In [13]:
df.dtypes

accommodates          int64
bedrooms            float64
beds                float64
availability_60       int64
availability_90       int64
availability_30       int64
availability_365      int64
maximum_nights        int64
minimum_nights        int64
instant_bookable       bool
room_type            object
bathrooms           float64
label               float64
beds_na                bool
bedrooms_na            bool
dtype: object

In [14]:
df['room_type'].nunique()

4

In [15]:
df_room_types = pd.get_dummies(df['room_type'], prefix='room_type')
df_room_types

Unnamed: 0,room_type_Entire home/apt,room_type_Hotel room,room_type_Private room,room_type_Shared room
0,1,0,0,0
1,1,0,0,0
2,1,0,0,0
3,0,0,1,0
4,0,0,1,0
...,...,...,...,...
28017,0,0,1,0
28018,1,0,0,0
28019,0,0,1,0
28020,1,0,0,0


In [16]:
df.drop(columns='room_type', inplace=True)
df=df.join(df_room_types)


In [17]:
df.head()

Unnamed: 0,accommodates,bedrooms,beds,availability_60,availability_90,availability_30,availability_365,maximum_nights,minimum_nights,instant_bookable,bathrooms,label,beds_na,bedrooms_na,room_type_Entire home/apt,room_type_Hotel room,room_type_Private room,room_type_Shared room
0,1,1.0,1.0,33,63,3,338,1125,30,False,1.0,150.0,False,True,1,0,0,0
1,3,1.0,3.0,6,18,3,194,730,1,False,1.0,75.0,False,False,1,0,0,0
2,4,2.0,2.0,3,12,3,123,1125,5,False,1.5,275.0,False,False,1,0,0,0
3,2,1.0,1.0,16,34,1,192,14,2,False,1.0,68.0,False,False,0,0,1,0
4,1,1.0,1.0,0,0,0,0,14,2,False,1.0,75.0,False,False,0,0,1,0


## Part 3: Implement Your Project Plan

<b>Task:</b> Use the rest of this notebook to carry out your project plan. You will:

1. Prepare your data for your model and create features and a label.
2. Fit your model to the training data and evaluate your model.
3. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.


Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit.

In [18]:
features=['accommodates', 'bedrooms', 'beds', 'availability_60','availability_90', 'availability_30', 'availability_365', 'maximum_nights', 'minimum_nights', 'instant_bookable', 'bathrooms', 'room_type_Entire home/apt', 'room_type_Hotel room', 'room_type_Private room', 'room_type_Shared room' ]
y=df['label']
X=df[features]
print(y.shape)
print(X.shape)

(28022,)
(28022, 15)


In [19]:
X_train, X_test, y_train, y_test=train_test_split(X,y, test_size=0.33, random_state=1234)
print(X_train.shape)
print(X_test.shape)



(18774, 15)
(9248, 15)


In [20]:
print('Begin Random Forest')
#create model 
r1model=RandomForestRegressor(n_estimators=300, min_samples_leaf=2, max_depth=25, max_features=5)
#fit model to training data
r1model.fit(X_train, y_train)
#make prediction 
predicted_price1=r1model.predict(X_test)
#create model 
r2model=RandomForestRegressor(n_estimators=300, min_samples_leaf=2, max_depth=35, max_features=5)
#fit model to training data
r2model.fit(X_train, y_train)
#make prediction 
predicted_price2=r2model.predict(X_test)
print('Finish')

Begin Random Forest


KeyboardInterrupt: 

In [None]:
mse1=mean_squared_error(y_test, predicted_price1)
r2_1=r2_score(y_test, predicted_price1)
mse2=mean_squared_error(y_test, predicted_price2)
r2_2=r2_score(y_test, predicted_price2)
print("Mean Squared Error: (", mse1 ,") (", mse2, ")" )
print("R-squared: (", r2_1 ,") (", r2_2, ")" )

I wanted to try two different was to choose the features. First I just played with the feature values until it got above .5 R-squared score. It origionally started in the low .4's  and next I will try using grid search.

In [21]:
from sklearn.model_selection import GridSearchCV
param_grid= {
    'n_estimators': [200,250,300],
    'max_depth':[10, 15, 20],
    'min_samples_leaf': [2,5,8]
}
print("start")
base_model=RandomForestRegressor()
print("next")
grid_search=GridSearchCV(base_model, param_grid, cv=5)
print("next")
grid_search.fit(X_train, y_train)
print("finish fit")


start
next
next
finish fit


In [22]:
print('start')
best_n=grid_search.best_estimator_.n_estimators
best_md=grid_search.best_estimator_.max_depth
best_msl=grid_search.best_estimator_.min_samples_leaf
best_mf=grid_search.best_estimator_.max_features
print("Best n:", best_n, "Best max depth:", best_md, "Best min samples leaf:", best_msl)

start
Best n: 250 Best max depth: 15 Best min samples leaf: 5


In [28]:
print('Begin Random Forest')
#create model 
finalmodel=RandomForestRegressor(n_estimators=250, min_samples_leaf=5, max_depth=15, max_features=5)
#fit model to training data
finalmodel.fit(X_train, y_train)
#make prediction 
predicted_price=finalmodel.predict(X_test)
print("End")


Begin Random Forest
End


In [29]:
mse=mean_squared_error(y_test, predicted_price)
r2=r2_score(y_test, predicted_price)
print("Mean Squared: R2", mse, ":", r2)

Mean Squared: R2 9512.156149538512 : 0.4945017009509709
