# 1. Problem Definition

Air BNB problem:
As a host, if we try to charge above market price for a living space we'd like to rent, then renters will select more affordable alternatives which are similar to ours.. If we set our nightly rent price too low, we'll miss out on potential revenue.

One strategy we could use is to:

- find a few listings that are similar to ours,
- average the listed price for the ones most similar to ours,
- set our listing price to this calculated average price.

The process of discovering patterns in existing data to make a prediction is called machine learning. In our case, we want to use data on local listings to predict the optimal price for us to set. In this mission, we'll explore a specific machine learning technique called k-nearest neighbors, which mirrors the strategy we just described. Before we dive further into machine learning and k-nearest neighbors, let's get familiar with the dataset we'll be working with.

# 2. Intro to the Data
While AirBnB doesn't release any data on the listings in their marketplace, a separate group named Inside AirBnB has extracted data on a sample of the listings for many of the major cities on the website. In this post, we'll be working with their dataset from October 3, 2015 on the listings from Washington, D.C., the capital of the United States. Here's a direct link to that dataset. Each row in the dataset is a specific listing that's available for renting on AirBnB in the Washington, D.C. area

To make the dataset less cumbersome to work with, we've removed many of the columns in the original dataset and renamed the file to dc_airbnb.csv. Here are the columns we kept:

- host_response_rate: the response rate of the host
- host_acceptance_rate: number of requests to the host that convert to rentals
- host_listings_count: number of other listings the host has
- latitude: latitude dimension of the geographic coordinates
- longitude: longitude part of the coordinates
- city: the city the living space resides
- zipcode: the zip code the living space resides
- state: the state the living space resides
- accommodates: the number of guests the rental can accommodate
- room_type: the type of living space (Private room, Shared room or Entire home/apt
- bedrooms: number of bedrooms included in the rental
- bathrooms: number of bathrooms included in the rental
- beds: number of beds included in the rental
- price: nightly price for the rental
- cleaning_fee: additional fee used for cleaning the living space after the guest leaves
- security_deposit: refundable security deposit, in case of damages
- minimum_nights: minimum number of nights a guest can stay for the rental
- maximum_nights: maximum number of nights a guest can stay for the rental
- number_of_reviews: number of reviews that previous guests have left

Let's read the dataset into Pandas and become more familiar with it.

In [63]:
import pandas as pd
dc_listings = pd.read_csv('dc_airbnb.csv')
print(dc_listings.columns, '\n','\n', dc_listings.shape, '\n','\n', dc_listings.head())

Index(['host_response_rate', 'host_acceptance_rate', 'host_listings_count',
       'accommodates', 'room_type', 'bedrooms', 'bathrooms', 'beds', 'price',
       'cleaning_fee', 'security_deposit', 'minimum_nights', 'maximum_nights',
       'number_of_reviews', 'latitude', 'longitude', 'city', 'zipcode',
       'state'],
      dtype='object') 
 
 (3723, 19) 
 
   host_response_rate host_acceptance_rate  host_listings_count  accommodates  \
0                92%                  91%                   26             4   
1                90%                 100%                    1             6   
2                90%                 100%                    2             1   
3               100%                  NaN                    1             2   
4                92%                  67%                    1             4   

         room_type  bedrooms  bathrooms  beds     price cleaning_fee  \
0  Entire home/apt       1.0        1.0   2.0  $160.00      $115.00    
1  Entire ho

# 3. K-Nearest Neighbors

Here's the strategy we wanted to use:

- Find a few similar listings.
- Calculate the average nightly rental price of these listings.
- Set the average price as the price for our listing.

The k-nearest neighbors algorithm is similar to this strategy. Here's an overview:
![title](knn_infographic.png)

There are 2 things we need to unpack in more detail:
- the similarity metric
- how to choose the k value

In this mission, we'll define what similarity metric we're going to use. Then, we'll implement the k-nearest neighbors algorithm and use it to suggest a price for a new, unpriced listing. We'll use a k value of 5 in this mission. In later missions, we'll learn how to evaluate how good the suggested prices are, how to choose the optimal k value, and more.

# 4. Euclidean Distance

The similarity metric works by comparing a fixed set of numerical features, another word for attributes, between 2 observations, or living spaces in our case. When trying to predict a continuous value, like price, the main similarity metric that's used is Euclidean distance. Here's the general formula for Euclidean distance:

$$d = \sqrt{(q_1-p_1)^2 + (q_2-p_2)^2 + \cdots + (q_n-p_n)^2}$$

where q1 to qn represent the feature values for one observation and p1 to pn represent the feature values for the other observation. Here's a diagram that breaks down the Euclidean distance between the first 2 observations in the dataset using only the host_listings_count, accommodates, bedrooms, bathrooms, and beds columns:

![Title](euclidean_distance_five_features.png)

In this mission, we'll use just one feature in this mission to keep things simple as you become familiar with the machine learning workflow. Since we're only using one feature, this is known as the univariate case. The square root and the squared power cancel and the formula simplifies to:

$$d = | q_1 - p_1 |$$

The living space that we want to rent can accommodate 3 people. Let's first calculate the distance, using just the accommodates feature, between the first living space in the dataset and our own.

In [64]:
import numpy as np
our_acc_value = 3
first_living_space_value = dc_listings.iloc[0]['accommodates']
first_distance = np.abs(first_living_space_value - our_acc_value)
print(first_distance)

1


# 5. Calculate Distance for All Observations
The Euclidean distance between the first row in the dc_listings Dataframe and our own living space is 1. How do we know if this is high or low? If you look at the Euclidean distance equation itself, the lowest value you can achieve is 0. This happens when the value for the feature is exactly the same for both observations you're comparing. If p1=q1, then d=|q1−p1| which results in d=0. The closer to 0 the distance the more similar the living spaces are.

Let's try calcualting  the Euclidean distance between each living space in the dataset and a living space that accommodates 3 people (the number of people our listing accommodates):

In [65]:
#heres one way
#dc_listings['distance'] = [abs(x-3) for x in dc_listings.accommodates]    

#or heres another!
new_listing = 3
dc_listings['distance'] = dc_listings['accommodates'].apply(lambda x: np.abs(x-new_listing))
print(dc_listings['distance'].value_counts())

    

1     2294
2      503
0      461
3      279
5       73
4       35
7       22
6       17
9       12
13       8
8        7
12       6
11       4
10       2
Name: distance, dtype: int64


# 6. Randomizing and Sorting
It looks like there are quite a few, 461 to be precise, living spaces that can accommodate 3 people just like ours. This means the 5 "nearest neighbors" we select after sorting all will have a distance value of 0. If we sort by the distance column and then just select the first 5 living spaces, we would be biasing the result to the ordering of the dataset.

Let's instead randomize the ordering of the dataset and then sort the Dataframe by the distance column. This way, all of the living spaces with the same number of bedrooms will still be at the top of the Dataframe but will be in random order across the first 461 rows. We've already done the first step of setting the random seed, so we can perform answer checking on our end.

In [66]:
import numpy as np
np.random.seed(1)

rand_order = np.random.permutation(len(dc_listings))
dc_listings = dc_listings.loc[rand_order]
dc_listings = dc_listings.sort_values('distance')
print(dc_listings.iloc[0:10]['price'])

577     $185.00 
2166    $180.00 
3631    $175.00 
71      $128.00 
1011    $115.00 
380     $219.00 
943     $125.00 
3107    $250.00 
1499     $94.00 
625     $150.00 
Name: price, dtype: object


# 7. Average Price

Before we can select the 5 most similar living spaces and compute the average price, we need to clean the price column. Right now, the price column contains comma characters (,) and dollar sign characters and is formatted as a text column instead of a numeric one. We need to remove these values and convert the entire column to the float datatype. Then, we can calculate the average price

In [67]:
no_commas = dc_listings['price'].str.replace(',','')
no_dolla_sign = no_commas.str.replace('$','').astype('float')
dc_listings.price = no_dolla_sign

In [72]:
mean_price = dc_listings.price[0:5].mean()
mean_price

156.6

# 8. Function to Make Predictions
Congrats! You've just made your first prediction! Based on the average price of other listings that accommdate 3 people, we should charge 156.6 dollars per night for a guest to stay at our living space. In the next mission, we'll dive into evaluating how good of a prediction this is.

Let's write a more general function that can suggest the optimal price for other values of the accommodates column. The dc_listings Dataframe has information specific to our living space, e.g. the distance column. To save you time, we've reset the dc_listings Dataframe to a clean slate and only kept the data cleaning and randomization we did since those weren't unique to the prediction we were making for our living space.

In [85]:
# Brought along the changes we made to the `dc_listings` Dataframe.
import numpy as np
np.random.seed(1)
dc_listings = pd.read_csv('dc_airbnb.csv')
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')
dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]

def predict_price(new_listing,k):
    temp_df = dc_listings
    temp_df['distance'] = temp_df['accommodates'].apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbors = temp_df.iloc[0:k]['price']
    predicted_price = nearest_neighbors.mean()
    return(predicted_price)

acc_one = predict_price(1,5)
acc_two = predict_price(2,5)
acc_four = predict_price(4,5)
print(acc_one)
print(acc_two)
print(acc_four)

68.0
112.8
124.8
