# Using k Nearest Neighbors to Estimate AirBnB Listing Price

Using 2015 data from http://data.insideairbnb.com/united-states/dc/washington-dc/2015-10-03/data/listings.csv.gz I will predict the price for various listings.

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline

In [5]:
listings = pd.read_csv('listings.csv')
listings.columns

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary',
       'space', 'description', 'experiences_offered', 'neighborhood_overview',
       'notes', 'transit', 'thumbnail_url', 'medium_url', 'picture_url',
       'xl_picture_url', 'host_id', 'host_url', 'host_name', 'host_since',
       'host_location', 'host_about', 'host_response_time',
       'host_response_rate', 'host_acceptance_rate', 'host_is_superhost',
       'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood',
       'host_listings_count', 'host_total_listings_count',
       'host_verifications', 'host_has_profile_pic', 'host_identity_verified',
       'street', 'neighbourhood', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market',
       'smart_location', 'country_code', 'country', 'latitude', 'longitude',
       'is_location_exact', 'property_type', 'room_type', 'accommodates',
       'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', '

In [9]:
cols_to_keep = ['host_response_rate',
               'host_acceptance_rate',
               'host_listings_count',
               'latitude',
               'longitude',
               'city',
               'zipcode',
               'state',
               'accommodates',
               'room_type',
               'bedrooms',
               'bathrooms',
               'beds',
               'price',
               'cleaning_fee',
               'security_deposit',
               'minimum_nights',
               'maximum_nights',
               'number_of_reviews']

In [10]:
dc_listings = listings[cols_to_keep]

In [12]:
print(dc_listings.iloc[0])

host_response_rate                  92%
host_acceptance_rate                91%
host_listings_count                  26
latitude                          38.89
longitude                      -77.0028
city                         Washington
zipcode                           20003
state                                DC
accommodates                          4
room_type               Entire home/apt
bedrooms                              1
bathrooms                             1
beds                                  2
price                           $160.00
cleaning_fee                    $115.00
security_deposit                $100.00
minimum_nights                        1
maximum_nights                     1125
number_of_reviews                     0
Name: 0, dtype: object


## Distance Metric

We will be using the standard Euclidean distance metric $\sum\left(p_i - q_i\right)^2$ across numeric values for our distance metric

First, I will be testing across the *accommodates* variable to get a result

In [32]:
new_listing = 3
first_distance = np.sqrt((new_listing - dc_listings.iloc[0]['accommodates'])**2)
print(first_distance)

1.0


In [33]:
dc_listings['distance'] = dc_listings['accommodates'].apply(lambda x: int(np.sqrt((new_listing - x)**2)))
print(dc_listings['distance'].value_counts())

1     2294
2      503
0      461
3      279
5       73
4       35
7       22
6       17
9       12
13       8
8        7
12       6
11       4
10       2
Name: distance, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [34]:
len(dc_listings)

3723

In [35]:
np.random.seed(1)

In [41]:
dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]

In [42]:
dc_listings = dc_listings.sort_values(by=['distance'])
print(dc_listings.head(10))

     host_response_rate host_acceptance_rate  host_listings_count   latitude  \
2318                66%                  96%                    5  38.921236   
3211               100%                 100%                    2  38.931935   
3356                50%                 100%                    1  38.934330   
1524                91%                  91%                    1  38.949829   
2477                66%                  96%                    5  38.920730   
334                 32%                  80%                    8  38.899747   
3140                90%                  60%                    1  38.928391   
3179                70%                  75%                    2  38.923038   
1011                NaN                  NaN                    1  38.907382   
293                 92%                  76%                  206  38.901786   

      longitude        city zipcode state  accommodates        room_type  \
2318 -77.040380  Washington   20009    DC  

In [45]:
dc_listings['price'] = dc_listings['price'].str.replace(',','')
dc_listings['price'] = dc_listings['price'].str.replace('$','')
dc_listings['price'] = dc_listings['price'].apply(lambda x: float(x))

In [46]:
mean_price = np.mean(dc_listings['price'].head(5))

In [47]:
mean_price

86.8

## Resetting, and Making Predictions for Various Accomodation Sizes

In [49]:
dc_listings = pd.read_csv('listings.csv')
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')
dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]

def predict_price(new_listing):
    temp_df = dc_listings.copy()
    temp_df['distance'] = temp_df['accommodates'].apply(lambda x: int(np.sqrt((new_listing - x)**2)))
    temp_df = temp_df.sort_values(by=['distance'])
    new_listing = np.mean(temp_df['price'].head(5))
    return(new_listing)

acc_one = predict_price(1)
acc_two = predict_price(2)
acc_four = predict_price(4)

In [50]:
print(acc_one,acc_two,acc_four)

77.0 92.0 172.6
