## Introduction 

Hosts looking to rent their living space are faced with a certain challenge:

- if we charge above the market price for a living space we would like to rent, potential renters will select a more affordable alternative which are similar to ours


- if the rent price is too low, we could miss out on potential revenue.


One strategy to solve this is to:

- find a few listings that are similar


- average the listed price for the ones most similar to ours


- set the listing price to calculate this average


I'm going to use a specific machine learning technique called **k-nearest neighbors** that mirrors the pattern outlined above to solve this. 

### Getting familiar with the dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 

In [2]:
dc_listings = pd.read_csv('/storage/emulated/0/DataQuest/Datasets/dc_airbnb.csv')


In [3]:
dc_listings.head()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,accommodates,room_type,bedrooms,bathrooms,beds,price,cleaning_fee,security_deposit,minimum_nights,maximum_nights,number_of_reviews,latitude,longitude,city,zipcode,state
0,92%,91%,26,4,Entire home/apt,1.0,1.0,2.0,$160.00,$115.00,$100.00,1,1125,0,38.890046,-77.002808,Washington,20003,DC
1,90%,100%,1,6,Entire home/apt,3.0,3.0,3.0,$350.00,$100.00,,2,30,65,38.880413,-76.990485,Washington,20003,DC
2,90%,100%,2,1,Private room,1.0,2.0,1.0,$50.00,,,2,1125,1,38.955291,-76.986006,Hyattsville,20782,MD
3,100%,,1,2,Private room,1.0,1.0,1.0,$95.00,,,1,1125,0,38.872134,-77.019639,Washington,20024,DC
4,92%,67%,1,4,Entire home/apt,1.0,1.0,1.0,$50.00,$15.00,$450.00,7,1125,0,38.996382,-77.041541,Silver Spring,20910,MD


We are going to calculate the Euclidean distance between our living space, which can accomodate **3** people, and the first living space in the `dc_listings` dataframe. 

In [4]:
our_listing = 3
dc_listings['distance'] = dc_listings['accommodates'].apply(lambda x: np.abs(x - our_listing))
dc_listings['distance'].value_counts()

1     2294
2      503
0      461
3      279
5       73
4       35
7       22
6       17
9       12
13       8
8        7
12       6
11       4
10       2
Name: distance, dtype: int64

From the result of the Euclidean distance calculation above, we can see that there are **461** house listings that can accomodate **3** people like ours.

Now let's sort by the `distance` column, indexing by listings whose Euclidean distance is equal to **0**.

In [5]:
dc_listings[dc_listings.distance == 0]['accommodates']


26      3
34      3
36      3
40      3
44      3
       ..
3675    3
3697    3
3707    3
3714    3
3722    3
Name: accommodates, Length: 461, dtype: int64

We are going to randomize the order of the listings with `np.randon.permutation()` method so that we could make selections randomly.

In [6]:
np.random.seed(1)
dc_listings = dc_listings.loc[
    np.random.permutation(len(dc_listings))
]

In [7]:
dc_listings = dc_listings.sort_values('distance')

In [8]:
dc_listings.iloc[0:10]['price']

577     $185.00
2166    $180.00
3631    $175.00
71      $128.00
1011    $115.00
380     $219.00
943     $125.00
3107    $250.00
1499     $94.00
625     $150.00
Name: price, dtype: object

The above output is the house prices for **10** houses that accomodates **3** like ours.

Before we can select the five most similar listings, we have to clean the price column and then we convert it to `float` dtype.

In [9]:
dc_listings['price'] = dc_listings['price'].apply(lambda x: x.replace(',', ''))
dc_listings['price'] = dc_listings['price'].apply(lambda x: x.replace('$', ''))
dc_listings['price'] = dc_listings['price'].astype('float')

In [10]:
mean_price = dc_listings['price'].iloc[0:5].mean()

In [11]:
mean_price

156.6

Based on the average price of other listings that accomodate 3 people, we should charge **156.6** dollars per night for a guest to stay at our living space.

We can further write a general function that can suggest the optimal price for other values of the `accomodates` column.


In [26]:
listings = pd.read_csv('/storage/emulated/0/DataQuest/Datasets/dc_airbnb.csv')


In [52]:
def predict_price(new_listing):
    
    listings['distance'] = listings['accommodates'].apply(lambda x: np.abs(x - new_listing))
    
    np.random.seed(1)
    listings_df = listings.iloc[np.random.permutation(len(listings))]
    sorted_values = listings_df.sort_values('distance')
    
    sorted_values['price'] = sorted_values['price'].apply(lambda x: x.replace(',', ''))
    sorted_values['price'] = sorted_values['price'].apply(lambda x: x.replace('$', ''))
    sorted_values['price'] = sorted_values['price'].astype('float')
    
    mean_price = sorted_values['price'].iloc[0:5].mean()
    
    return mean_price


In [57]:
predict_price(10)

404.6

We can now use the function above to suggest prices based on how many people the house can accomodate.