# AirBnB Price Prediction

### Introduction/Background
NYC has the second most available listings in the US. It's a common destination for a vacation, work trip, or just a trip to the city. New York presents a contained microcosm of AirBnB that's representative of many metropolis destinations for the app. It would be useful to know what factors go into how much an AirBnB costs within New York and then eventually determine if that's true for all cities.

This notebook will attempt to predict prices for AirBnB's in NYC.

### Importing Data
I'll use an overkill method for efficiently loading our data. This changes the datatypes to be much smaller than Pandas large default datatypes and will occupy less system memory while we excute this notebook.

In [29]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from tqdm import tqdm

from sklearn.model_selection import train_test_split

In [30]:
bnb_path = "../data/AB_NYC_2019.csv"

df_tmp = pd.read_csv(
    bnb_path,
    nrows=5
)

traintypes = {
    'id': 'int32',
    'name': 'str',
    'host_id': 'int32',
    'host_name': 'str',
    'neighbourhood_group': 'str',
    'neighbourhood': 'str',
    'latitude': 'float32',
    'longitude': 'float32',
    'room_type': 'str',
    'price': 'uint16',
    'minimum_nights': 'uint16',
    'number_of_reviews': 'uint16',
    'last_review': 'str',
    'reviews_per_month': 'float16',
    'calculated_host_listings_count': 'uint16',
    'availability_365': 'uint16',
}

df_list = []

chunksize = 1_000_000

for df_chunk in tqdm(
    pd.read_csv(
        data_path, 
        dtype=traintypes, 
        chunksize=chunksize
    )
):
    df_chunk['last_review'] = df_chunk['last_review'].str.slice(0, 16)
    df_chunk['last_review'] = pd.to_datetime(df_chunk['last_review'], utc=True, format='%Y-%m-%d %H:%M')
    
    df_list.append(df_chunk)
    
bnb_df = pd.concat(df_list[0:1])
bnb_df = bnb_df[~bnb_df.isin([np.nan, np.inf, -np.inf]).any(1)]

del df_list

bnb_df.head()

1it [00:00,  2.37it/s]


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.647491,-73.972366,Private room,149,1,9,2018-10-19 00:00:00+00:00,0.209961,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.983772,Entire home/apt,225,1,45,2019-05-21 00:00:00+00:00,0.379883,2,355
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.685139,-73.959763,Entire home/apt,89,1,270,2019-07-05 00:00:00+00:00,4.640625,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.798512,-73.943993,Entire home/apt,80,10,9,2018-11-19 00:00:00+00:00,0.099976,1,0
5,5099,Large Cozy 1 BR Apartment In Midtown East,7322,Chris,Manhattan,Murray Hill,40.747669,-73.974998,Entire home/apt,200,3,74,2019-06-22 00:00:00+00:00,0.589844,1,129


We've got a fair amount of features that we might be able to use to predict price here. Let's bring in another dataset that we'll use to cross-reference the AirBnB prices we see.

In [31]:
housing_path = "../data/nyc-rolling-sales.csv"

housing_df = pd.read_csv(housing_path)

old_names = list(housing_df)
new_names = [col.lower() for col in old_names]
rename_dict = dict(zip(old_names, new_names))

housing_df = housing_df.rename(columns=rename_dict)
housing_df = housing_df[~housing_df.isin([np.nan, np.inf, -np.inf]).any(1)]

housing_df.head()

Unnamed: 0,unnamed: 0,borough,neighborhood,building class category,tax class at present,block,lot,ease-ment,building class at present,address,...,residential units,commercial units,total units,land square feet,gross square feet,year built,tax class at time of sale,building class at time of sale,sale price,sale date
0,4,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2A,392,6,,C2,153 AVENUE B,...,5,0,5,1633,6440,1900,2,C2,6625000,2017-07-19 00:00:00
1,5,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2,399,26,,C7,234 EAST 4TH STREET,...,28,3,31,4616,18690,1900,2,C7,-,2016-12-14 00:00:00
2,6,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2,399,39,,C7,197 EAST 3RD STREET,...,16,1,17,2212,7803,1900,2,C7,-,2016-12-09 00:00:00
3,7,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2B,402,21,,C4,154 EAST 7TH STREET,...,10,0,10,2272,6794,1913,2,C4,3936272,2016-09-23 00:00:00
4,8,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2A,404,55,,C2,301 EAST 10TH STREET,...,6,0,6,2369,4615,1900,2,C2,8000000,2016-11-17 00:00:00


In [32]:
print(housing_df.describe())

         unnamed: 0       borough         block           lot      zip code  \
count  84548.000000  84548.000000  84548.000000  84548.000000  84548.000000   
mean   10344.359878      2.998758   4237.218976    376.224015  10731.991614   
std     7151.779436      1.289790   3568.263407    658.136814   1290.879147   
min        4.000000      1.000000      1.000000      1.000000      0.000000   
25%     4231.000000      2.000000   1322.750000     22.000000  10305.000000   
50%     8942.000000      3.000000   3311.000000     50.000000  11209.000000   
75%    15987.250000      4.000000   6281.000000   1001.000000  11357.000000   
max    26739.000000      5.000000  16322.000000   9106.000000  11694.000000   

       residential units  commercial units   total units    year built  \
count       84548.000000      84548.000000  84548.000000  84548.000000   
mean            2.025264          0.193559      2.249184   1789.322976   
std            16.721037          8.713183     18.972584    537.34

In [33]:
housing_df["neighborhood"] = housing_df["neighborhood"].apply(lambda x: x.title())

housing_neighborhoods = housing_df["neighborhood"].unique()
bnb_neighborhoods = bnb_df["neighbourhood"].unique()

print("In housing but not bnb:")
print(len(list(set(housing_neighborhoods) - set(bnb_neighborhoods))))

print("In airbnb but not housing:")
print(len(list(set(bnb_neighborhoods) - set(housing_neighborhoods))))

print("In both:")
print(len(list(set(bnb_neighborhoods) & set(housing_neighborhoods))))

print("Total:")
print(len(housing_neighborhoods) + len(bnb_neighborhoods))

In housing but not bnb:
119
In airbnb but not housing:
83
In both:
135
Total:
472


It turns out that a lot of the AirBnB neighborhoods aren't exact. For example the neighborhood listed might be near a boundary line and be listed as "neighborhood1/neighborhood2". We'll figure out a way to get around that when we come up against the issue.

In [34]:
# for neighborhood in housing_neighborhoods:
#     neighborhood_mean_price = np.mean(housing_df[housing_df["neighborhood"] == neighborhood]["price"])