We'll explore a specific machine learning technique called k-nearest neighbors. Before we dive further into machine learning and k-nearest neighbors, let's get familiar with the dataset we'll be working with.

In [1]:
import pandas as pd
dc_listings = pd.read_csv('dc_airbnb.csv')
dc_listings 

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,accommodates,room_type,bedrooms,bathrooms,beds,price,cleaning_fee,security_deposit,minimum_nights,maximum_nights,number_of_reviews,latitude,longitude,city,zipcode,state
0,92%,91%,26,4,Entire home/apt,1.0,1.0,2.0,$160.00,$115.00,$100.00,1,1125,0,38.890046,-77.002808,Washington,20003,DC
1,90%,100%,1,6,Entire home/apt,3.0,3.0,3.0,$350.00,$100.00,,2,30,65,38.880413,-76.990485,Washington,20003,DC
2,90%,100%,2,1,Private room,1.0,2.0,1.0,$50.00,,,2,1125,1,38.955291,-76.986006,Hyattsville,20782,MD
3,100%,,1,2,Private room,1.0,1.0,1.0,$95.00,,,1,1125,0,38.872134,-77.019639,Washington,20024,DC
4,92%,67%,1,4,Entire home/apt,1.0,1.0,1.0,$50.00,$15.00,$450.00,7,1125,0,38.996382,-77.041541,Silver Spring,20910,MD
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3718,100%,60%,1,4,Entire home/apt,1.0,1.0,2.0,$135.00,$45.00,$400.00,3,60,19,38.885492,-76.987765,Washington,20003,DC
3719,100%,50%,1,2,Private room,1.0,2.0,1.0,$79.00,,,3,365,36,38.889401,-76.986646,Washington,20003,DC
3720,100%,100%,2,6,Entire home/apt,2.0,1.0,3.0,$275.00,$100.00,$500.00,2,2147483647,12,38.889533,-77.001010,Washington,20003,DC
3721,88%,100%,1,2,Entire home/apt,1.0,1.0,1.0,$179.00,$25.00,,2,21,48,38.890815,-77.002283,Washington,20002,DC


### K-nearest neighbors 

Here's the strategy we wanted to use:

1. Find a few similar listings.
2. Calculate the average nightly rental price of these listings.
3. Set the average price as the price for our listing.

### Euclidean distance

The similarity metric works by comparing a fixed set of numerical features, another word for attributes, between 2 observations, or living spaces in our case. When trying to predict a continuous value, like price, the main similarity metric that's used is Euclidean distance. Here's the general formula for Euclidean distance:

d = \sqrt{(q_1-p_1)^2 + (q_2-p_2)^2 + \... + (q_n-p_n)^2}

where q_1  to  represent the feature values for one observation and p_1 to  represent the feature values for the other observation

Here's a break down the Euclidean distance between the first 2 observations in the dataset using only the `host_listings_count, accommodates, bedrooms, bathrooms, and beds columns`:

In [4]:
a = dc_listings[['host_listings_count','accommodates','accommodates','bathrooms','beds']].head(2)
a

Unnamed: 0,host_listings_count,accommodates,accommodates.1,bathrooms,beds
0,26,4,4,1.0,2.0
1,1,6,6,3.0,3.0


In [5]:
import math
math.sqrt((a.iloc[0][0]- a.iloc[1][0])**2 + (a.iloc[0][1] - a.iloc[1][1])**2 + (a.iloc[0][2] - a.iloc[1][2])**2 +(a.iloc[0][3] - a.iloc[1][3])**2 +(a.iloc[0][4] - a.iloc[1][4])**2)

25.25866188063018

we'll use just one feature to keep things simple as you become familiar with the machine learning workflow. Since we're only using one feature, this is known as the univariate case. Here's what the formula looks like for the univariate case:




d = \sqrt{(q_1 - p_1)^2}

The square root and the squared power cancel and the formula simplifies to:

d = | q_1 - p_1 |

### Instructions
1. Calculate the Euclidean distance between our living space, which can accommodate 3 people, and the first living space in the dc_listings Dataframe.
2. Assign the result to first_distance and display the value using the print function.

In [8]:
import numpy as np
our_acc_value = 3
first_living_space_value = dc_listings.iloc[0]['accommodates']
first_distance = np.abs(first_living_space_value - our_acc_value)
print(first_distance)

1


### Calculate distance for all observations

The Euclidean distance between the first row in the dc_listings Dataframe and our own living space is 1. How do we know if this is high or low? If you look at the Euclidean distance equation itself, the lowest value you can achieve is 0. This happens when the value for the feature is exactly the same for both observations you're comparing. If `p1 = q1` , then `d = | q_1 - p_1 |`  which results in `0` . The closer to 0 the distance the more similar the living spaces are.

If we wanted to calculate the Euclidean distance between each living space in the dataset and a living space that accommodates 8 people.

### Instructions

1. Calculate the distance between each value in the accommodates column from dc_listings and the value 3, which is the number of people our listing accommodates:
2. Use the apply method to calculate the absolute value between each value in accommodates and 3 and return a new Series containing the distance values.
3. Assign the distance values to the distance column.
4. Use the Series method value_counts and the print function to display the unique value counts for the distance column

In [9]:
import numpy as np
b = dc_listings['accommodates']
o = 3
dc_listings['distance'] =(np.abs(b-o))
dc_listings['distance'].value_counts() 

1     2294
2      503
0      461
3      279
5       73
4       35
7       22
6       17
9       12
13       8
8        7
12       6
11       4
10       2
Name: distance, dtype: int64

In [10]:
new_listing = 3
dc_listings['distance'] = dc_listings['accommodates'].apply(lambda x: np.abs(x - new_listing))
print(dc_listings['distance'].value_counts())

1     2294
2      503
0      461
3      279
5       73
4       35
7       22
6       17
9       12
13       8
8        7
12       6
11       4
10       2
Name: distance, dtype: int64


### Randomizing, and sorting

It looks like there are quite a few, 461 to be precise, living spaces that can accommodate 3 people just like ours. This means the 5 "nearest neighbors" we select after sorting all will have a distance value of 0. If we sort by the distance column and then just select the first 5 living spaces, we would be biasing the result to the ordering of the dataset.

In [11]:
print(dc_listings[dc_listings["distance"] == 0]["accommodates"])

26      3
34      3
36      3
40      3
44      3
       ..
3675    3
3697    3
3707    3
3714    3
3722    3
Name: accommodates, Length: 461, dtype: int64


Let's instead randomize the ordering of the dataset and then sort the Dataframe by the distance column. This way, all of the living spaces with the same number of bedrooms will still be at the top of the Dataframe but will be in random order across the first 461 rows. We've already done the first step of setting the random seed, so we can perform answer checking on our end.

In [12]:
np.random.seed(1)

1. Randomize the order of the rows in dc_listings:
2. Use the np.random.permutation() function to return a NumPy array of shuffled index values.
3. Use the Dataframe method loc[] to return a new Dataframe containing the shuffled order.
4. Assign the new Dataframe back to dc_listings.
5. After randomization, sort dc_listings by the distance column and assign back to dc_listings.
6. Display the first 10 values in the price column using the print function.

##### np.random.permutation(len(dc_listings)) returns a NumPy array of shuffled index values for dc_listings.

In [21]:
import numpy as np
np.random.seed(1)
p = np.random.permutation(len(dc_listings)) 
dc = dc_listings.loc[p]
df = dc.copy()
dc_listings= df.sort_values(by = ['distance']) 
dc_listings['price'].head(10)

577     $185.00
2166    $180.00
3631    $175.00
71      $128.00
1011    $115.00
380     $219.00
943     $125.00
3107    $250.00
1499     $94.00
625     $150.00
Name: price, dtype: object

In [22]:
dc_listings 

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,accommodates,room_type,bedrooms,bathrooms,beds,price,cleaning_fee,security_deposit,minimum_nights,maximum_nights,number_of_reviews,latitude,longitude,city,zipcode,state,distance
577,98%,52%,49,3,Private room,1.0,1.0,2.0,$185.00,,,2,14,1,38.908356,-77.028146,Washington,20005,DC,0
2166,100%,89%,2,3,Entire home/apt,1.0,1.0,1.0,$180.00,,$100.00,1,14,10,38.905808,-77.000012,Washington,20002,DC,0
3631,98%,52%,49,3,Entire home/apt,1.0,1.0,2.0,$175.00,,,3,14,1,38.889065,-76.993576,Washington,20003,DC,0
71,100%,94%,1,3,Entire home/apt,1.0,1.0,1.0,$128.00,$40.00,,1,1125,9,38.879960,-77.006491,Washington,20003,DC,0
1011,,,1,3,Entire home/apt,0.0,1.0,1.0,$115.00,,,1,1125,0,38.907382,-77.035075,Washington,20005,DC,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1596,100%,95%,3,16,Entire home/apt,5.0,3.5,5.0,$299.00,$155.00,,3,365,8,38.944327,-77.015149,"Washington, D.C.",20011,DC,13
1818,50%,100%,1,16,Shared room,1.0,0.5,16.0,$10.00,$5.00,,1,2,0,38.960445,-77.008756,Washington,20011,DC,13
1402,92%,94%,30,16,Entire home/apt,8.0,6.0,13.0,"$1,200.00",$300.00,,3,1125,10,38.914757,-77.033483,Washington,20009,DC,13
763,100%,,1,16,Private room,1.0,1.0,1.0,"$1,000.00",,,1,1125,0,38.901322,-76.986356,Washington,20002,DC,13


### Average price

Before we can select the 5 most similar living spaces and compute the average price, we need to clean the price column. Right now, the price column contains comma characters (,) and dollar sign characters and is formatted as a text column instead of a numeric one. We need to remove these values and convert the entire column to the float datatype. Then, we can calculate the average price.

#### Instructions
Remove the commas (,) and dollar sign characters ($) from the price column:


Use the str accessor so we can apply string methods to each value in the column followed by the string method replace to replace all comma characters with the empty character: stripped_commas = dc_listings['price'].str.replace(',', '')


Repeat to remove the dollar sign characters as well.


Convert the new Series object containing the cleaned values to the float datatype and assign back to the price column in dc_listings.


Calculate the mean of the first 5 values in the price column and assign to mean_price.


Use the print function or the variable inspector below to display mean_price

In [23]:
stripped_commas = dc_listings['price'].str.replace(',','')
stripped_commas = stripped_commas.str.replace('$','')
stripped_commas = stripped_commas.astype(float)
dc_listings['price'] = stripped_commas 
mean_price = dc_listings['price'].head().mean()
mean_price

156.6

### Function to make predictions

### Instructions
Write a function named predict_price that can use the k-nearest neighbors machine learning technique to calculate the suggested price for any value for accommodates. This function should:
1. Take in a single parameter, new_listing, that describes the number of bedrooms.
2. We've added code that assigns dc_listings to a new Dataframe named temp_df. We used the pandas.DataFrame.copy() method so the underlying dataframe is assigned to temp_df, instead of just a reference to dc_listings.
3. Calculate the distance between each value in the accommodates column and the new_listing value that was passed in. Assign the resulting Series object to the distance column in temp_df.
4. Sort temp_df by the distance column and select the first 5 values in the price column. Don't randomize the ordering of temp_df.
5. Calculate the mean of these 5 values and use that as the return value for the entire predict_price function.
6. Use the predict_price function to suggest a price for a living space that:
   accommodates 1 person, assign the suggested price to acc_one.
   accommodates 2 people, assign the suggested price to acc_two.
   accommodates 4 people, assign the suggested price to acc_four.

In [24]:
# Brought along the changes we made to the `dc_listings` Dataframe.
dc_listings = pd.read_csv('dc_airbnb.csv')
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')
dc_listin = dc_listings.loc[np.random.permutation(len(dc_listings))]
def predict_price(new_listing):
    temp_df = dc_listings.copy()
    ## Complete the function.
    b = temp_df['accommodates'] 
    temp_df['distance'] = (np.abs(b-new_listing))
    
    
    df = temp_df.sort_values(by=['distance']) 
    new_listing = df['price'].head().mean()
    return(new_listing)
acc_one = predict_price(1)
acc_two = predict_price(2)
acc_four = predict_price(4)

In [26]:
acc_two

95.8

In [27]:
acc_four

161.0

In [25]:
acc_one

72.0

### Randomizing the order of a DataFrame:
1. import numpy as np
2. np.random.seed(1)
3. np.random.permutation(len(dc_listings))

### Returning a new DataFrame containing the shuffled order:
1. dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]
2. Applying string methods to replace a comma with an empty character:
stripped_commas = dc_listings['price'].str.replace(',', '')
3. Converting a Series object to a float datatype:
dc_listings['price'] = dc_listings['price'].astype('float')

### Concepts
Machine learning is the process of discovering patterns in existing data to make a prediction.

In machine learning, a feature is an individual measurable characteristic.

When predicting a continuous value, the main similarity metric that's used is Euclidean distance.

K-nearest neighbors computes the Euclidean Distance to find similarity and average to predict an unseen value.

Let q1 to qn represent the feature values for one observation, and p1 to pn represent the feature values for the other observation then the formula for Euclidean distance is as follows:
