# Using Keras (machine learning/Neural Network) to predict the host rating in airbnb

In this notebook, I demonstarte a simple example of how we can use kears to buuild a model very easily. I show an application in forecasting review score of airbnb hosts based on only a few features. I use a small data from inside airbnb.

In an attempt to keep the notebook short, I have not shown the data collection and preparation work that were put into this. I still am keeping s small part of variable creation and cleaning in this notebook just to remind how important that part is in any data project. Data cleaning and visualization is a very important part of any data project. I am going to put some demonstration on EDA later.

### Import packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense
import keras
from keras.layers import Dense, Dropout, Activation, Flatten


Using TensorFlow backend.


### Read the data

In [2]:
df=pd.read_csv(r"E:\JY\Boston\bostondetailedlistingsdf20200803.csv", low_memory=False)
# there are some text data in the file, so use low_memory=false option.

I am going to work with a smaller number of features in this example. 

In [3]:
features=["host_since", "host_response_time", "host_response_rate", "host_acceptance_rate", "host_is_superhost",
   "host_total_listings_count", "host_has_profile_pic", "host_identity_verified", "room_type",
   "bathrooms", "bedrooms", "beds"]

target="review_scores_rating"

In [4]:
df=pd.concat([df[features], df[target]], axis=1) #create a smaller df, as I know I am only using this part.

### Quick inspection of the data

In [5]:
df.head() # take a quick look at the data. notice the % sign in some variables.

Unnamed: 0,host_since,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_total_listings_count,host_has_profile_pic,host_identity_verified,room_type,bathrooms,bedrooms,beds,review_scores_rating
0,2013-09-14,within an hour,100%,100%,t,2.0,t,t,Private room,1.0,1.0,1.0,96.0
1,2009-05-11,within an hour,100%,50%,t,1.0,t,t,Private room,1.0,1.0,1.0,98.0
2,2012-06-07,within a few hours,100%,100%,f,1.0,t,t,Private room,1.0,1.0,1.0,95.0
3,2011-01-02,within a few hours,94%,96%,f,1.0,t,t,Entire home/apt,1.0,1.0,2.0,86.0
4,2012-08-05,within a few hours,100%,87%,f,2.0,t,t,Private room,2.0,1.0,1.0,97.0


### Create new variables

Creating new variables. I am interested in the hosting experience, creating some intermediate variables so the work is easier to follow.

In [6]:
df["host_since"]=pd.to_datetime(df["host_since"])


In [7]:
df["beg_month"]=df["host_since"].dt.month
df["beg_year"]=df["host_since"].dt.year

In [8]:
df["curr_year"]=2015
df["curr_month"]=10

In [9]:
df["exper"]=(df["curr_year"]-df["beg_year"])*12+df["curr_month"]-df["beg_month"]

### Check the variable types

In [10]:
df[features].dtypes # check the data types of these variables. Notice the % signs made the numeric values to be a string.

host_since                   datetime64[ns]
host_response_time                   object
host_response_rate                   object
host_acceptance_rate                 object
host_is_superhost                    object
host_total_listings_count           float64
host_has_profile_pic                 object
host_identity_verified               object
room_type                            object
bathrooms                           float64
bedrooms                            float64
beds                                float64
dtype: object

### Clean data

In [11]:
df["host_response_rate"]=df["host_response_rate"].str.replace("%", "").astype(float)

In [12]:
df["host_acceptance_rate"]=df["host_acceptance_rate"].str.replace("%", "").astype(float)


Create another df with the dummy variables. I prefer creating new DF instead of using the original df with the same name. I think this helps when something goes wrong. But not super important, this is more of a preference.

In [13]:
df2=pd.get_dummies(df, columns=["host_is_superhost", "host_has_profile_pic", "host_identity_verified",
                                "room_type", "host_response_time"])

In [14]:
df2.shape
# compare the shape of this df with the shape we earlier saw. lose much?

(152354, 27)

In [15]:
df["room_type"].value_counts() 
# this helps to see what room types are more common and if the data is really sparse in some dimension.
# I have checked other categorical variables as well. Not showing all of them to keep the notebook shorter.

Entire home/apt    96295
Private room       53895
Shared room         1817
Hotel room           347
Name: room_type, dtype: int64

In [16]:
df2.keys() # column names

Index(['host_since', 'host_response_rate', 'host_acceptance_rate',
       'host_total_listings_count', 'bathrooms', 'bedrooms', 'beds',
       'review_scores_rating', 'beg_month', 'beg_year', 'curr_year',
       'curr_month', 'exper', 'host_is_superhost_f', 'host_is_superhost_t',
       'host_has_profile_pic_f', 'host_has_profile_pic_t',
       'host_identity_verified_f', 'host_identity_verified_t',
       'room_type_Entire home/apt', 'room_type_Hotel room',
       'room_type_Private room', 'room_type_Shared room',
       'host_response_time_a few days or more',
       'host_response_time_within a day',
       'host_response_time_within a few hours',
       'host_response_time_within an hour'],
      dtype='object')

In [17]:
varlist=["host_total_listings_count", "bathrooms", "bedrooms", "beds",
          "exper", 'host_is_superhost_t', 'host_has_profile_pic_t',
         'host_identity_verified_t', 'room_type_Entire home/apt', 'room_type_Private room', 
          'host_response_time_within an hour',
         'host_response_time_within a few hours', 'review_scores_rating'] 
# all the variables I am considering for the model.
# see how this is differnet from the original list of features we started with.

In [18]:
df2[varlist].isnull().any() # check which variables have missing values.

host_total_listings_count                 True
bathrooms                                 True
bedrooms                                  True
beds                                      True
exper                                     True
host_is_superhost_t                      False
host_has_profile_pic_t                   False
host_identity_verified_t                 False
room_type_Entire home/apt                False
room_type_Private room                   False
host_response_time_within an hour        False
host_response_time_within a few hours    False
review_scores_rating                      True
dtype: bool

In [19]:
# for this example I am going to drop the variables with missing values,
# but there are better ways to deal with missing values like imputing with some function of the observed values.
# I don't recommend dropping variables because of missing values.

df3=df2[varlist].dropna()

In [20]:
df3.shape # see how much information we missed

(120749, 13)

### Scale the data.

In [21]:
scalar=StandardScaler() 
# I recommend scaling the data before using in Keras models.
# cerate the standard scaler from sklearn.

In [22]:
trans=scalar.fit(df3[varlist]) # fit the standard scaler on the data. 
# I have scaled the target variable as well. Not recommend for categorical target.

In [23]:
clean_ft=["host_total_listings_count", "bathrooms", "bedrooms", "beds",
          "exper", 'host_is_superhost_t', 'host_has_profile_pic_t',
         'host_identity_verified_t', 'room_type_Entire home/apt', 'room_type_Private room', 
          'host_response_time_within an hour',
         'host_response_time_within a few hours'] # these are the clean feature variables I am using.

In [24]:
df4=pd.DataFrame(trans.transform(df3[varlist]), columns=varlist) # creating another df for the transformed variables.

In [25]:
y=df4["review_scores_rating"] # target variable in the model

l=len(clean_ft) # used to set the input shape in the network
X=df4[clean_ft] # features data used in the model

In [26]:
# split the sample data
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.1,
                                                    random_state = 1) 
# split the sample for checking overfitting later. 

### Use keras to build a neural net model

In [27]:
model=Sequential() # start keras model. There are ways to suppress the warnings, I prefer keeping them.




In [28]:
# add layers to the model. I have kept the dimension small. But you can play aorund and try a much complex structure.

model.add(Dense(8, kernel_initializer='he_normal', input_shape=(l, ), activation='relu'))
model.add(Dropout(0.5)) 
# randmoly dropping some nods/weights in the network. dropping 50%. essentially setting weight to zero.
model.add(Dense(4, activation='relu', kernel_initializer='he_normal'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='relu', kernel_initializer='he_normal'))
model.add(Dropout(0.5))




Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [29]:
model.compile(optimizer='adam', loss='mse') # compile the model.
# for continuous target variable, it is common to use loss function of mse.
# in case of binary target, can use cross-entropy.




In [30]:
model.fit(X_train, y_train, epochs=20, batch_size=128) # fit the model with the train data




Epoch 1/20





Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x2611b837278>

In [31]:
model.evaluate(X_test, y_test, verbose=0) 
# test performance on test data. This is pretty good based on how simple a network we used and the size of data.

0.9319346206578162