# Airbnb Room Price Predicition 

### Washington DC

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import sklearn.metrics as metrics
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

### Load in Dataset with Pandas

In [2]:
listings_csv = 'airbnb_listings_cleaned.csv'

# read the file into a dataframe
df = pd.read_csv(listings_csv)

#Peak at the first 5 rows 
df.head()

Unnamed: 0,neighbourhood,room_type,accommodates,bedrooms,beds,price,availability_30,number_of_reviews,review_scores_rating,instant_bookable,cancellation_policy,reviews_per_month
0,"Capitol View, Marshall Heights, Benning Heights",Private room,2,1,1,38,25,1,100,t,moderate,1.0
1,"Takoma, Brightwood, Manor Park",Private room,2,1,1,71,17,4,90,t,flexible,0.33
2,"Colonial Village, Shepherd Park, North Portal ...",Entire home/apt,1,1,1,55,0,1,100,f,strict,1.0
3,"Lamont Riggs, Queens Chapel, Fort Totten, Plea...",Private room,2,1,1,60,26,2,90,t,flexible,0.43
4,"Woodridge, Fort Lincoln, Gateway",Private room,2,1,1,52,24,1,80,f,flexible,1.0


Can see we have various features about a Airbnb room, most of which have numeric values. *room type* and *cancellation policy* look to have only a few types of values, which we could assign numbers to, making it easier for the model to interpret. However, *neighbourhood* looks to have a lot of different values. Lets take a look at the ammount of unique values in this column: 

In [3]:
df['neighbourhood'].value_counts()

Columbia Heights, Mt. Pleasant, Pleasant Plains, Park View                                           238
Dupont Circle, Connecticut Avenue/K Street                                                           214
Union Station, Stanton Park, Kingman Park                                                            172
Shaw, Logan Circle                                                                                   156
Edgewood, Bloomingdale, Truxton Circle, Eckington                                                    149
Capitol Hill, Lincoln Park                                                                           143
Kalorama Heights, Adams Morgan, Lanier Heights                                                       126
Brightwood Park, Crestwood, Petworth                                                                 104
Downtown, Chinatown, Penn Quarters, Mount Vernon Square, North Capitol Street                         94
Howard University, Le Droit Park, Cardozo/Shaw         

In [4]:
df['neighbourhood'].value_counts().size

39

### One hot encoding

One hot encoding transforms categorical features to a format that works better with classification and regression algorithms

Neighbourhood, Room Type and Cancellation policies are all string values that can be grouped into categories and be represented as numbers

To do this we will use a Pandas fuction called *get_dummies*. It expands a column with categorical values into n columns that correspong to the entries in the original column, and the values are 0 or 1 depending on what they originally were.

Below we do this for neighourhood, room type and cancellation policy. We also transform instant_bookable column from 't's' and 'f's' to 1's and 0's.

In [5]:
# Create one hot encoded columns from original column values
n_dummies = pd.get_dummies(df.neighbourhood)
rt_dummies = pd.get_dummies(df.room_type)
xcl_dummies = pd.get_dummies(df.cancellation_policy)

# convert boolean column to a single boolean value 1 or 0
ib_dummies = pd.get_dummies(df.instant_bookable, prefix="instant")

ib_dummies.head()

Unnamed: 0,instant_f,instant_t
0,0,1
1,0,1
2,1,0
3,0,1
4,1,0


In [6]:
#the above line will create 2 columns, one for t and one for f, so we drop f
ib_dummies = ib_dummies.drop('instant_f', axis=1) 

# replace the old columns with our new one-hot encoded ones
alldata = pd.concat((df.drop(['neighbourhood', \
    'room_type', 'cancellation_policy', 'instant_bookable'], axis=1), \
    n_dummies.astype(int), rt_dummies.astype(int), \
    xcl_dummies.astype(int), ib_dummies.astype(int)), \
    axis=1)

In [7]:
#Look at transformed dataframe:
alldata.head()

Unnamed: 0,accommodates,bedrooms,beds,price,availability_30,number_of_reviews,review_scores_rating,reviews_per_month,"Brightwood Park, Crestwood, Petworth","Brookland, Brentwood, Langdon",...,"Woodland/Fort Stanton, Garfield Heights, Knox Hill","Woodridge, Fort Lincoln, Gateway",Entire home/apt,Private room,Shared room,flexible,moderate,strict,super_strict_30,instant_t
0,2,1,1,38,25,1,100,1.0,0,0,...,0,0,0,1,0,0,1,0,0,1
1,2,1,1,71,17,4,90,0.33,0,0,...,0,0,0,1,0,1,0,0,0,1
2,1,1,1,55,0,1,100,1.0,0,0,...,0,0,1,0,0,0,0,1,0,0
3,2,1,1,60,26,2,90,0.43,0,0,...,0,0,0,1,0,1,0,0,0,1
4,2,1,1,52,24,1,80,1.0,0,0,...,0,1,0,1,0,1,0,0,0,0


### Training and Test split

We split our data into traning and test data so that we can traing our model on training data and evaluate the performance of our model on the test data - data it has never seen before. 

Therefore we can identify if the model is *overfitting* (low training error and high test error) and if it *generalizes* well (low training and test error).

To do this we use sklearns train_test_split model which will randomly separate the data into two dataframes - training and test.

In [8]:
#Split data into features (X) and labels (y)
x = alldata.drop(['price'], axis=1).values
y = alldata['price']

#Split x and y further into training and test sets
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size =0.2)

### Create  & train the model

Now we will create our Model. As this is a regression problem we will be using a Linear Regression model which is one of the simplest models. It fits a linear model to the data set by adjusting a set of parameters in order to make the error as small as possible.

In [9]:
#Create model
lm = LinearRegression()

#Train it with training data
lm.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [10]:
#Look at the weights assigned to each feature
#You can see that all of the neighbourhood features have the same weights
lm.coef_

array([  6.55995509e+00,   2.22159124e+13,   4.07167405e-01,
         8.61791898e-01,   3.82698676e-02,   5.08193644e-01,
        -2.91059252e+00,  -9.27278389e+09,  -9.27278389e+09,
        -9.27278386e+09,  -9.27278391e+09,  -9.27278388e+09,
        -9.27278387e+09,  -9.27278390e+09,  -9.27278388e+09,
        -9.27278387e+09,  -9.27278392e+09,  -9.27278389e+09,
        -9.27278384e+09,  -9.27278386e+09,  -9.27278392e+09,
        -9.27278388e+09,  -9.27278393e+09,  -9.27278387e+09,
        -9.27278384e+09,  -9.27278388e+09,  -9.27278390e+09,
        -9.27278388e+09,  -9.27278389e+09,  -9.27278387e+09,
        -9.27278391e+09,  -9.27278389e+09,  -9.27278386e+09,
        -9.27278388e+09,  -9.27278389e+09,  -9.27278390e+09,
        -9.27278386e+09,  -9.27278387e+09,  -9.27278387e+09,
        -9.27278388e+09,  -9.27278391e+09,  -9.27278389e+09,
        -9.27278387e+09,  -9.27278385e+09,  -9.27278390e+09,
        -9.27278391e+09,   6.80474592e+11,   6.80474592e+11,
         6.80474592e+11,

#### Predict

In [11]:
# do predictions on training data and test data
pred_train = lm.predict(X_train)
pred_test = lm.predict(X_test)

#### Calcuate Error

##### Mean Squared Error & Mean Absoloute Error
More information on the difference here: https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d

In [12]:
#Training error
np.sqrt(np.mean ((y_train - pred_train) **2))

30.339101719804546

In [13]:
#Test Error
np.sqrt(np.mean ((y_test - pred_test) **2))


34.160716589440391

In [14]:
#Can also use sklearn metrics module. 
train_mse = metrics.mean_squared_error(y_train, pred_train)
test_mse = metrics.mean_squared_error(y_test, pred_test)

train_rmse = np.sqrt(train_mse)
test_rmse = np.sqrt(test_mse)

#Lets try mean absoloute error too
train_mae = metrics.median_absolute_error(y_train, pred_train)
test_mae = metrics.median_absolute_error(y_test, pred_test)

In [15]:
train_rmse

30.339101719804546

In [16]:
train_mae

16.453125

## Decision Tree Model

In [17]:
from sklearn.tree import DecisionTreeRegressor

dt = DecisionTreeRegressor()
dt.fit(X_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')

#### Predict

In [18]:
dt_pred_train = dt.predict(X_train)
dt_pred_test = dt.predict(X_test)

#### Calcuate Error

In [19]:
dt_train_mse = metrics.mean_squared_error(y_train, dt_pred_train)
dt_test_mse = metrics.mean_squared_error(y_test, dt_pred_test)

dt_train_rmse = np.sqrt(dt_train_mse)
dt_test_rmse = np.sqrt(dt_test_mse)

In [20]:
dt_train_rmse

0.44374814180854677

In [21]:
dt_test_rmse

49.507734174298946

#### Predict value from test set

In [22]:
X_test[20]

array([  4.  ,   1.  ,   2.  ,  26.  ,  29.  ,  99.  ,   2.45,   0.  ,
         0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,
         0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   1.  ,   0.  ,   0.  ,
         0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,
         0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,
         0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   1.  ,   0.  ,
         0.  ,   0.  ,   0.  ,   1.  ,   0.  ,   0.  ])

In [23]:
#Look at actual value
y_test[20:21]

1480    175
Name: price, dtype: int64

In [24]:
lm.predict([X_test[20]])

array([ 141.5546875])

In [25]:
dt.predict([X_test[20]])

array([ 135.])