# Lab 8: Define and Solve an ML Problem of Your Choosing

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.neural_network import MLPRegressor
from scipy import stats

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [2]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(airbnbDataSet_filename)

df.head()
print(df.columns)
print(df['amenities'])

Index(['name', 'description', 'neighborhood_overview', 'host_name',
       'host_location', 'host_about', 'host_response_rate',
       'host_acceptance_rate', 'host_is_superhost', 'host_listings_count',
       'host_total_listings_count', 'host_has_profile_pic',
       'host_identity_verified', 'neighbourhood_group_cleansed', 'room_type',
       'accommodates', 'bathrooms', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'has_availability', 'availability_30',
       'availability_60', 'availability_90', 'availability_365',
       'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d',
       'review_scores_rating', 'review_scores_cleanliness',
       'review_scores_checkin', 'review_scores_communication',
       'review_scores_location', 'review_scores_value',

## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

1. We chose the airBnb data set. 
2. We will be predicting the price of the airBnb. The label would be price
3. This is a supervised learning problem. This is a regression problem.
4. Features: bedrooms, bathrooms, amenities, reviews_scores_ratings, number_of_reviews, host_is_superhost, accomodates, room_type.
5. This can help AirBnB owners decide how much to charge for their AirBnbs. It would work similar to Zillow's Zestimate feature.

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [3]:
# YOUR CODE HERE
df.dtypes
df.isnull().sum()
to_include = ['room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'amenities', 'number_of_reviews', 'review_scores_rating', 'price', 'neighbourhood_group_cleansed']
df = df[to_include]
print(df.head(10))
print(df.dtypes)

         room_type  accommodates  bathrooms  bedrooms  beds  \
0  Entire home/apt             1        1.0       NaN   1.0   
1  Entire home/apt             3        1.0       1.0   3.0   
2  Entire home/apt             4        1.5       2.0   2.0   
3     Private room             2        1.0       1.0   1.0   
4     Private room             1        1.0       1.0   1.0   
5     Private room             2        1.5       1.0   NaN   
6  Entire home/apt             3        1.0       NaN   1.0   
7     Private room             1        1.0       1.0   1.0   
8     Private room             1        1.0       1.0   1.0   
9  Entire home/apt             4        1.0       1.0   2.0   

                                           amenities  number_of_reviews  \
0  ["Extra pillows and blankets", "Baking sheet",...                 48   
1  ["Extra pillows and blankets", "Luggage dropof...                409   
2  ["Kitchen", "BBQ grill", "Cable TV", "Carbon m...                  2   
3  ["R

## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

Yes, we have a new feature list. We are keeping review_scores_rating, price, neighborhood group, beds, bathrooms, room types, and accommodates, because we believe those affect price significantly.

We plan on using one hot encoding for object values. We also plan on removing outliers and creating a correlation matrix.

We will utilize a neural network.

We will train it with one hidden layer and max_iter of 1000. We will use train_test_split and train a model.

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [4]:
#converting amenities to number of amentities offered per airBnb
len_amenities = []
for i in range(len(df)):
    len_amenities.append(len(df['amenities'].iloc[i]))
df['amenities'] = len_amenities
print(df['amenities'])

0        530
1        621
2        231
3        424
4        271
        ... 
28017    489
28018    127
28019    239
28020    711
28021    233
Name: amenities, Length: 28022, dtype: int64


In [5]:
df.isnull().sum()

room_type                          0
accommodates                       0
bathrooms                          0
bedrooms                        2918
beds                            1354
amenities                          0
number_of_reviews                  0
review_scores_rating               0
price                              0
neighbourhood_group_cleansed       0
dtype: int64

In [6]:
df['bedrooms_is_null'] = df['bedrooms'].isna().astype(int)
df['beds_is_null'] = df['beds'].isna().astype(int)


bedroom_median = df['bedrooms'].median()
df['bedrooms'].fillna(value = bedroom_median, inplace = True)

beds_median = df['beds'].median()
df['beds'].fillna(value = beds_median, inplace = True)

print(df.isnull().sum())
print(df.shape)
print(df.head())

room_type                       0
accommodates                    0
bathrooms                       0
bedrooms                        0
beds                            0
amenities                       0
number_of_reviews               0
review_scores_rating            0
price                           0
neighbourhood_group_cleansed    0
bedrooms_is_null                0
beds_is_null                    0
dtype: int64
(28022, 12)
         room_type  accommodates  bathrooms  bedrooms  beds  amenities  \
0  Entire home/apt             1        1.0       1.0   1.0        530   
1  Entire home/apt             3        1.0       1.0   3.0        621   
2  Entire home/apt             4        1.5       2.0   2.0        231   
3     Private room             2        1.0       1.0   1.0        424   
4     Private room             1        1.0       1.0   1.0        271   

   number_of_reviews  review_scores_rating  price  \
0                 48                  4.70  150.0   
1               

In [7]:
# one hot encoding
df_room_type = pd.get_dummies(df['room_type'])
df_room_type


Unnamed: 0,Entire home/apt,Hotel room,Private room,Shared room
0,1,0,0,0
1,1,0,0,0
2,1,0,0,0
3,0,0,1,0
4,0,0,1,0
...,...,...,...,...
28017,0,0,1,0
28018,1,0,0,0
28019,0,0,1,0
28020,1,0,0,0


In [8]:
df = df.join(df_room_type)
df.drop(columns = 'room_type', inplace = True)
#df.shape


In [9]:
df.columns

Index(['accommodates', 'bathrooms', 'bedrooms', 'beds', 'amenities',
       'number_of_reviews', 'review_scores_rating', 'price',
       'neighbourhood_group_cleansed', 'bedrooms_is_null', 'beds_is_null',
       'Entire home/apt', 'Hotel room', 'Private room', 'Shared room'],
      dtype='object')

In [10]:
df_nei_type = pd.get_dummies(df['neighbourhood_group_cleansed'])
df = df.join(df_nei_type)
df.drop(columns = 'neighbourhood_group_cleansed', inplace = True)
df.head(5)

Unnamed: 0,accommodates,bathrooms,bedrooms,beds,amenities,number_of_reviews,review_scores_rating,price,bedrooms_is_null,beds_is_null,Entire home/apt,Hotel room,Private room,Shared room,Bronx,Brooklyn,Manhattan,Queens,Staten Island
0,1,1.0,1.0,1.0,530,48,4.7,150.0,1,0,1,0,0,0,0,0,1,0,0
1,3,1.0,1.0,3.0,621,409,4.45,75.0,0,0,1,0,0,0,0,1,0,0,0
2,4,1.5,2.0,2.0,231,2,5.0,275.0,0,0,1,0,0,0,0,1,0,0,0
3,2,1.0,1.0,1.0,424,507,4.21,68.0,0,0,0,0,1,0,0,0,1,0,0
4,1,1.0,1.0,1.0,271,118,4.91,75.0,0,0,0,0,1,0,0,0,1,0,0


In [11]:
corr_matrix = round(df.corr(),5)
corr_matrix

Unnamed: 0,accommodates,bathrooms,bedrooms,beds,amenities,number_of_reviews,review_scores_rating,price,bedrooms_is_null,beds_is_null,Entire home/apt,Hotel room,Private room,Shared room,Bronx,Brooklyn,Manhattan,Queens,Staten Island
accommodates,1.0,0.36944,0.72365,0.75425,0.21827,0.0647,0.0078,0.51906,-0.07052,-0.05234,0.45266,-0.01507,-0.43853,-0.06092,-0.00831,0.027,-0.02531,-0.00212,0.01391
bathrooms,0.36944,1.0,0.48365,0.3757,0.11673,-0.03266,-0.00208,0.3313,-0.10914,-0.01265,0.03139,-0.01759,-0.0288,-0.00112,-0.02246,0.07051,-0.04493,-0.02558,0.00349
bedrooms,0.72365,0.48365,1.0,0.73937,0.18951,0.00666,0.01365,0.44824,-0.15012,-0.05084,0.32158,-0.02705,-0.30729,-0.05163,-0.00953,0.07288,-0.07706,0.00587,0.01812
beds,0.75425,0.3757,0.73937,1.0,0.23244,0.07154,0.00423,0.40076,-0.09643,-0.12515,0.33112,-0.01427,-0.33345,0.01154,0.00523,0.02267,-0.05782,0.03712,0.03496
amenities,0.21827,0.11673,0.18951,0.23244,1.0,0.22294,0.14902,0.15548,-0.07315,-0.04131,0.14298,-0.0095,-0.13581,-0.02795,0.06507,0.05596,-0.13418,0.05726,0.07553
number_of_reviews,0.0647,-0.03266,0.00666,0.07154,0.22294,1.0,0.06718,-0.03349,-0.03554,-0.04139,0.00312,0.03595,-0.00688,-0.00569,0.01997,0.03306,-0.07566,0.04421,0.02198
review_scores_rating,0.0078,-0.00208,0.01365,0.00423,0.14902,0.06718,1.0,0.04507,-0.01924,-0.03202,0.096,-0.02559,-0.08842,-0.01901,-0.0054,0.0512,-0.03569,-0.023,0.0145
price,0.51906,0.3313,0.44824,0.40076,0.15548,-0.03349,0.04507,1.0,0.025,-0.02859,0.3469,0.12791,-0.35546,-0.04794,-0.07212,-0.11197,0.23764,-0.1321,-0.03698
bedrooms_is_null,-0.07052,-0.10914,-0.15012,-0.09643,-0.07315,-0.03554,-0.01924,0.025,1.0,0.05939,0.20348,-0.0105,-0.19366,-0.03997,-0.02145,-0.11471,0.17166,-0.06806,-0.00883
beds_is_null,-0.05234,-0.01265,-0.05084,-0.12515,-0.04131,-0.04139,-0.03202,-0.02859,0.05939,1.0,-0.04086,-0.01102,0.03533,0.03114,0.01418,-0.00882,0.00088,0.00313,0.00378


In [12]:
corrs = corr_matrix['price']
corrs

accommodates            0.51906
bathrooms               0.33130
bedrooms                0.44824
beds                    0.40076
amenities               0.15548
number_of_reviews      -0.03349
review_scores_rating    0.04507
price                   1.00000
bedrooms_is_null        0.02500
beds_is_null           -0.02859
Entire home/apt         0.34690
Hotel room              0.12791
Private room           -0.35546
Shared room            -0.04794
Bronx                  -0.07212
Brooklyn               -0.11197
Manhattan               0.23764
Queens                 -0.13210
Staten Island          -0.03698
Name: price, dtype: float64

In [13]:
df['price'] = stats.mstats.winsorize(df['price'], limits=[0.05, 0.05])
print(df['price'])
print(df.head(5))

0        150.0
1         75.0
2        275.0
3         68.0
4         75.0
         ...  
28017     89.0
28018    397.0
28019     64.0
28020     84.0
28021     70.0
Name: price, Length: 28022, dtype: float64
   accommodates  bathrooms  bedrooms  beds  amenities  number_of_reviews  \
0             1        1.0       1.0   1.0        530                 48   
1             3        1.0       1.0   3.0        621                409   
2             4        1.5       2.0   2.0        231                  2   
3             2        1.0       1.0   1.0        424                507   
4             1        1.0       1.0   1.0        271                118   

   review_scores_rating  price  bedrooms_is_null  beds_is_null  \
0                  4.70  150.0                 1             0   
1                  4.45   75.0                 0             0   
2                  5.00  275.0                 0             0   
3                  4.21   68.0                 0             0   
4    

In [14]:
#corr_features = ['accommodates', 'bathrooms', 'bedrooms', 'beds', 'Entire home/apt', 'Private room', 'Manhattan', 'price']
#df = df[corr_features]

y = df['price']
#X = df[corr_features]
X = df.drop(columns="price")

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1234)
df.head()

Unnamed: 0,accommodates,bathrooms,bedrooms,beds,amenities,number_of_reviews,review_scores_rating,price,bedrooms_is_null,beds_is_null,Entire home/apt,Hotel room,Private room,Shared room,Bronx,Brooklyn,Manhattan,Queens,Staten Island
0,1,1.0,1.0,1.0,530,48,4.7,150.0,1,0,1,0,0,0,0,0,1,0,0
1,3,1.0,1.0,3.0,621,409,4.45,75.0,0,0,1,0,0,0,0,1,0,0,0
2,4,1.5,2.0,2.0,231,2,5.0,275.0,0,0,1,0,0,0,0,1,0,0,0
3,2,1.0,1.0,1.0,424,507,4.21,68.0,0,0,0,0,1,0,0,0,1,0,0
4,1,1.0,1.0,1.0,271,118,4.91,75.0,0,0,0,0,1,0,0,0,1,0,0


In [None]:
#takes a while to run
#performed grid search and got this output: Best Parameters: {'hidden_layer_sizes': (100,), 'learning_rate_init': 0.001, 'max_iter': 1000}
#commented out grid search code because it takes a very long time to run the grid search

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


model = MLPRegressor(max_iter = 1000, hidden_layer_sizes = (100,))
# model = MLPRegressor()
# param_grid = {
#     'hidden_layer_sizes': [(50,), (100,)],
#     'learning_rate_init': [0.001, 0.01, 0.1],
#     'max_iter': [1000]
# }

# grid = GridSearchCV(model, param_grid = param_grid,cv=5)

# grid_search = grid.fit(X_train_scaled, y_train)
# print("Best Parameters:", grid_search.best_params_)
# best_model = grid_search.best_estimator_
# test_preds = best_model.predict(X_test_scaled)

model.fit(X_train_scaled, y_train)
prediction = model.predict(X_test_scaled)

In [None]:
print('\nModel Performance\n\nRMSE =   %.2f'
      % np.sqrt(mean_squared_error(y_test, prediction)))
print(' R^2 =   %.2f'
      % r2_score(y_test, prediction))

In [None]:
#Analysis:
#I tested the data on multiple models, including a decision tree, linear regression, and a neural network.
#Of all the models the neural network led to the lowest RMSE and highest R^2 score.