# Predict Airbnb Prices, New Orleans, Louisiana:

*Walker Blackston, June 27th, 2019*

Having now lived in New Orleans for the past year, several things have become apparent. 

*1) The heat is as advertised.* 

*2) The people are as advertised.*

*3) Be careful about the company in which you mention an "Airbnb" - whether you are on the business end or renting*

Regardless, I love this place and it's clear that many of our millions of yearly visitors do as well. I am not here to provide support for our current zoning or short-term rental policies. Like it or not, Airbnb is here to stay. Inspired by some machine learning models and numerous competitions, I want to throw my hat into the Airbnb predictive modelling ring. 

For my purposes, this analysis will only evaluate pricing for New Orleans where a close family friend is considering renting out several units. Finding the "ideal" pricing for these units, would be fundamental to getting and securing clients.

Because it is a technique I would like to learn (by doing!), **we will construct K-nearest neighbors (KNN) models to predict optimal pricing.**

In [1]:
import pandas as pd
nola_list = pd.read_csv('listings.csv')

#check that data read in correctly and print its dimensions:
print(nola_list.shape)

#check the first few observations:
nola_list.head(4)

(6962, 106)


Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,10291,https://www.airbnb.com/rooms/10291,20190505154604,2019-05-05,Spacious Cottage in Mid-City!,,"Spacious house located in Mid-City, New Orlean...","Spacious house located in Mid-City, New Orlean...",none,,...,f,f,moderate,t,f,1,1,0,0,1.04
1,19091,https://www.airbnb.com/rooms/19091,20190505154604,2019-05-05,Fully Furnished Cozy Apartment,CITY OF NEW ORLEANS STR LICENSE PERMIT NUMBER:...,SEE SPECIAL REDUCED SUMMER PRICING! This apa...,CITY OF NEW ORLEANS STR LICENSE PERMIT NUMBER:...,none,"NorthWest Carrollton, where this apartment is ...",...,t,f,strict_14_with_grace_period,f,f,1,1,0,0,4.0
2,26834,https://www.airbnb.com/rooms/26834,20190505154604,2019-05-05,Maison Mandeville in the Marigny,,Charming shotgun apartment in the Marigny neig...,Charming shotgun apartment in the Marigny neig...,none,,...,f,f,strict_14_with_grace_period,f,t,1,1,0,0,2.16
3,53173,https://www.airbnb.com/rooms/53173,20190505154604,2019-05-05,$95. BEST VALUE > HUNDREDS OF 5 STAR REVIEWS !,"Huge room, bath and sitting room. Current ar...",This is a VERY large bedroom - boasting very c...,"Huge room, bath and sitting room. Current ar...",none,"Funky creative types (sorta like Williamsburg,...",...,t,f,strict_14_with_grace_period,f,f,1,0,1,0,1.35


In [2]:
import numpy as np

acc_value = 3
first_living_space_value = nola_list.loc[0,'accommodates']
first_distance = np.abs(first_living_space_value - acc_value)

print(first_distance)

nola_list['distance'] = np.abs(nola_list.accommodates - acc_value)
nola_list.distance.value_counts().sort_index()

1


0      306
1     4088
2      286
3     1342
4       82
5      435
6       32
7      238
8        6
9       42
10      12
11      24
12      10
13      57
17       2
Name: distance, dtype: int64

The code above accomplishes a few tasks. 

1) It sets a generic "Euclidean distance" between our clusters or neighbors- this is a straightforward mathematical process mainly used for continuous measures, such as price, to evaluate their closeness to other values

2) Takes the 'living space' in terms of square footage and compares it to 'accomodates,' a measure of our recommended accomdation cap per unit. 

3) In our data, we have (n=306) listings with a distance of zero

In [3]:
nola_list = nola_list.sample(frac=1,random_state=0)
nola_list = nola_list.sort_values('distance')
nola_list.price.head()

4148    $150.00
2500     $70.00
1105    $110.00
2998    $275.00
5914     $94.00
Name: price, dtype: object

This code re-shuffles our data, creating reproducibility for future analyses.

In [4]:
nola_list['price'] = nola_list.price.str.replace("\$|,",'').astype(float)
mean_price = nola_list.price.iloc[:5].mean()
mean_price

139.8

#### The predicted average price of a one-unit Airbnb rental in New Orleans = $139.80 (when only considering accomodation space)

But... how accurate is our model and what variables really matter?

In [5]:
#split the data:
nola_list.drop('distance',axis=1)
train_df = nola_list.copy().iloc[:2792]
test_df = nola_list.copy().iloc[2792:]

def pred_price(new_listing_value,feature_column):
    temp_df = train_df
    temp_df['distance'] = np.abs(nola_list[feature_column] - new_listing_value)
    temp_df = temp_df.sort_values('distance')
    knn_5 = temp_df.price.iloc[:5]
    predicted_price = knn_5.mean()
    return(predicted_price)

test_df['predicted_price'] = test_df.accommodates.apply(pred_price,feature_column='accommodates')

In [6]:
#evaluate using RMSE
test_df['squared_error'] = (test_df['predicted_price'] - test_df['price'])**(2)
mse = test_df['squared_error'].mean()
rmse = mse ** (1/2)
rmse

359.6783678473854

This RMSE is quite large, given that our mean is only 139.80 (our error is more than 2x the mean value, which suggests a rather terrible distribution). What should we do?

### Multivariate Modeling: 

Before we dive right into modelling with several variables, let's do some inspection of variables that might, and perhaps strongly, correlate with pricing. 

First, let's look at the scope of our data:

In [7]:
list(nola_list.columns.values)

['id',
 'listing_url',
 'scrape_id',
 'last_scraped',
 'name',
 'summary',
 'space',
 'description',
 'experiences_offered',
 'neighborhood_overview',
 'notes',
 'transit',
 'access',
 'interaction',
 'house_rules',
 'thumbnail_url',
 'medium_url',
 'picture_url',
 'xl_picture_url',
 'host_id',
 'host_url',
 'host_name',
 'host_since',
 'host_location',
 'host_about',
 'host_response_time',
 'host_response_rate',
 'host_acceptance_rate',
 'host_is_superhost',
 'host_thumbnail_url',
 'host_picture_url',
 'host_neighbourhood',
 'host_listings_count',
 'host_total_listings_count',
 'host_verifications',
 'host_has_profile_pic',
 'host_identity_verified',
 'street',
 'neighbourhood',
 'neighbourhood_cleansed',
 'neighbourhood_group_cleansed',
 'city',
 'state',
 'zipcode',
 'market',
 'smart_location',
 'country_code',
 'country',
 'latitude',
 'longitude',
 'is_location_exact',
 'property_type',
 'room_type',
 'accommodates',
 'bathrooms',
 'bedrooms',
 'beds',
 'bed_type',
 'amenities',


In [8]:
#this will visualize correlations between variables, but without labels (too many vars)
import matplotlib.pyplot as plt

plt.matshow(nola_list.corr())
plt.show()

<Figure size 480x480 with 1 Axes>

**#TODO:
1) Clean data for relevant variables, reproduce (updated) correlation matrix
2) Create a minimum of three models and add in sequentially
3) Implement models into Scikit-learn
4) Split data
5) Train KNN models from sklearn
6) Fit model 
7) Make predictions**

## Airbnb prices within a ... Random Forest:

Now, we would like to apply a common technique of feature selection - Random Forest model (RF)- on our data to continue with feature selection. The assumptions of this process are as follows: 

0) Our data are in numeric format

0) a) Drop variables that are clearly not useful for our analysis

1) Assess and address missingness

2) Assess and address variables with low to minimal inherent variance

3) Filter for variables highly correlated with our target, prices

4) We train a random forest regressor on random data 
 
This assumes: we align our measure of variable importance with the goal of the final model. Here, since we aim to predict pricing, we wish to use regression techniques (continuous outcome) so we will be assessing features based on *variance*

## Dimensionality reduction & Feature Selection:


id                                                0.000000
listing_url                                       0.000000
scrape_id                                         0.000000
last_scraped                                      0.000000
name                                              0.000000
summary                                           1.881643
space                                            15.713875
description                                       0.703821
experiences_offered                               0.000000
neighborhood_overview                            23.340994
notes                                            41.166332
transit                                          24.446998
access                                           30.810112
interaction                                      24.849181
house_rules                                      27.894283
thumbnail_url                                   100.000000
medium_url                                      100.0000

Unnamed: 0,id,scrape_id,thumbnail_url,medium_url,xl_picture_url,host_id,host_acceptance_rate,host_listings_count,host_total_listings_count,neighbourhood_group_cleansed,...,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,distance
count,6962.0,6962.0,0.0,0.0,0.0,6962.0,0.0,6962.0,6962.0,0.0,...,6349.0,6348.0,6349.0,6349.0,6962.0,6962.0,6962.0,6962.0,6377.0,6962.0
mean,20624600.0,20190510000000.0,,,,80273200.0,,90.583022,90.583022,,...,9.864546,9.828765,9.618523,9.584501,16.948578,16.371445,0.545102,0.032031,2.381015,2.118931
std,9442324.0,2.976776,,,,75720840.0,,301.150496,301.150496,,...,0.504783,0.547833,0.663199,0.704642,42.183436,42.337097,1.611103,0.373266,1.738637,2.096423
min,10291.0,20190510000000.0,,,,971.0,,0.0,0.0,,...,2.0,2.0,2.0,2.0,1.0,0.0,0.0,0.0,0.01,0.0
25%,14493190.0,20190510000000.0,,,,13203910.0,,1.0,1.0,,...,10.0,10.0,9.0,9.0,1.0,1.0,0.0,0.0,1.01,1.0
50%,22099900.0,20190510000000.0,,,,52871960.0,,2.0,2.0,,...,10.0,10.0,10.0,10.0,2.0,1.0,0.0,0.0,2.07,1.0
75%,28967060.0,20190510000000.0,,,,126818600.0,,7.0,7.0,,...,10.0,10.0,10.0,10.0,6.0,5.0,0.0,0.0,3.38,3.0
max,34421590.0,20190510000000.0,,,,259528400.0,,1653.0,1653.0,,...,10.0,10.0,10.0,10.0,167.0,167.0,16.0,6.0,18.81,17.0


['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary', 'description', 'experiences_offered', 'picture_url', 'host_id', 'host_url', 'host_name', 'host_since', 'host_location', 'host_response_time', 'host_response_rate', 'host_is_superhost', 'host_thumbnail_url', 'host_picture_url', 'host_listings_count', 'host_total_listings_count', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'street', 'neighbourhood', 'neighbourhood_cleansed', 'city', 'state', 'zipcode', 'market', 'smart_location', 'country_code', 'country', 'latitude', 'longitude', 'is_location_exact', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', 'price', 'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights', 'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability', '

**This means that there are now a total of (n=88) remaining variables, after excluding variables with missingness greater than 10%**

In [29]:
#updating data set 
final_df = feats.loc[feats.index.isin(nola_list)]

final_df.head(5)

AttributeError: 'list' object has no attribute 'loc'

In [20]:
labs = np.array(nola_list['price'])

features = nola_list.drop('price', axis = 1)

In [None]:
#filtering and dropping non-sensical variables (ID's, URL's, etc)
update_nola = nola_list.drop(['id', 'scrape_id', 'last_scraped','listing_url', 'name', 'description',
                              'host_url', 'host_id', 'country', 'latitude', 'longitude', 'summary'], axis = 1)
       
print(update_nola.shape)

In [None]:
update_nola.head(4)

In [None]:
df = update_nola.drop('target', 1)
df.corr()

In [None]:
from sklearn.ensemble.forest import RandomForestRegressor
rf_mod = RandomForestRegressor(random_state=1, max_depth=10)

update_nola = pd.get_dummies(update_nola)
rf_mod.fit(update_nola, update_nola.target)

*This will take a bit to run- it's running a model on all possible permutations of our remaining features with our outcome*

In [None]:
features = update_nola.columns
importances = rf_mod.feature_importances_
indices = np.argsort(importances)[-9:]  # top 10 
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()