# AIRBNB PROJECT
---
As an AirBnB host I want to:
- find listings that are similar to mine
- average the listings' prices 
- set my place's price to the averaged price

This makes sure that I don't lose business because my price is too high but it also makes sure that I don't lose money because I set the price too low.

Machine Learning Model: `K-Nearest Neighbors` - unsupervised model

Data: 
- [Inside AirBnB](http://insideairbnb.com/get-the-data.html)
    - `October 3, 2015: listings.csv.gz` Archived
- Dataquest.io: `dc_airbnb.csv`
    - Cleaned to keep the essential columns only (from 92 to 19)

## 1. Data Exploration

### 1.1 Introducing the data

In [1]:
import pandas as pd
import numpy as np

dc = pd.read_csv('data/dc_airbnb.csv')

In [2]:
len(dc.columns)

19

In [3]:
dc.columns

Index(['host_response_rate', 'host_acceptance_rate', 'host_listings_count',
       'accommodates', 'room_type', 'bedrooms', 'bathrooms', 'beds', 'price',
       'cleaning_fee', 'security_deposit', 'minimum_nights', 'maximum_nights',
       'number_of_reviews', 'latitude', 'longitude', 'city', 'zipcode',
       'state'],
      dtype='object')

In [4]:
dc.shape

(3723, 19)

In [5]:
# display first row
dc.iloc[0]

host_response_rate                  92%
host_acceptance_rate                91%
host_listings_count                  26
accommodates                          4
room_type               Entire home/apt
bedrooms                              1
bathrooms                             1
beds                                  2
price                           $160.00
cleaning_fee                    $115.00
security_deposit                $100.00
minimum_nights                        1
maximum_nights                     1125
number_of_reviews                     0
latitude                          38.89
longitude                      -77.0028
city                         Washington
zipcode                           20003
state                                DC
Name: 0, dtype: object

### 1.2 Euclidian distance

Here's the strategy I wanted to use:
- Find a few similar listings.
- Calculate the average nightly rental price of these listings.
- Set the average price as the price for my listing.

When trying to predict a continuous value, like price, the main similarity metric that's used is Euclidean distance.
$$
d = \sqrt{(q_1 - p_1)^2 + (q_2 - p_2)^2 + ... + (q_n - p_n)^2}
$$

In my case, since I am using one feature, the *univariate Euclidian distance* is:
$$
d = |q_1 - p_1|
$$

The living space that I want to rent can accommodate 3 people. I will first calculate the distance, using just the `accommodates` feature, between the first living space in the dataset and my own.

In [6]:
my_acc = 3
acc_1st = dc.iloc[0]['accommodates']
euclid_1 = np.abs(acc_1st - my_acc)
euclid_1

1

- The lowest value $d$ can have is 0, meaning that $q_1 = p_1$
- The closer to 0 the distance, the more similar the living spaces are
- I can calculate the distance between my place and each place in the dataset and rank the result by ascending distance values
    - The higher the ranking, the higher the similarity

In [7]:
dc['euclid'] = dc['accommodates'].apply(
    lambda x: np.abs(x - my_acc))
dc['euclid'].value_counts()

1     2294
2      503
0      461
3      279
5       73
4       35
7       22
6       17
9       12
13       8
8        7
12       6
11       4
10       2
Name: euclid, dtype: int64

- There are 461 places that can accommodate 3 people, just like my place
- I will now select only these entries

In [8]:
dc[dc['euclid'] == 0]['accommodates']

26      3
34      3
36      3
40      3
44      3
       ..
3675    3
3697    3
3707    3
3714    3
3722    3
Name: accommodates, Length: 461, dtype: int64

### 1.3 Randomizing and sorting

- If I selected the first 5 living spaces with a eaclidian distance of 0, I would be biasing the result to the ordering of the dataset
- Instead I will randomize the ordering of the dataset and then sort the dataset by the `euclid` column
    - This means that all of the similar places will still be on top but in random order across the first 461 rows

In [12]:
np.random.seed(1)
dc = dc.loc[np.random.permutation(len(dc))]
dc = dc.sort_values('euclid')
dc.iloc[0:10]['price']

577     $185.00
2166    $180.00
3631    $175.00
71      $128.00
1011    $115.00
380     $219.00
943     $125.00
3107    $250.00
1499     $94.00
625     $150.00
Name: price, dtype: object

### 1.4 Calculating avg price
- Before I can select the 5 most similar living spaces and compute the average price, I need to clean the price column. 
- Right now, the price column 
    - contains comma characters (,) and dollar sign characters and 
    - is formatted as a text column instead of a numeric one. 
- I need to remove these values and convert the entire column to the float datatype. 
- Then, I can calculate the average price.

In [13]:
remove_comma = dc['price'].str.replace(',', '')
remove_dollar = remove_comma.str.replace('$', '')
dc['price'] = remove_dollar.astype('float')
avg_price = dc.iloc[0:5]['price'].mean()
avg_price

156.6