# Part 1 - Introduction to K-Nearest Neighbors

* machine learning - The process of discovering patterns in existing data
* Summary of K-nearest neighbors: Find clusters of k elements based on some similarity measure
* Application to the problem of selecting AirBnB prices:
    * Find a few similar listings.
    * Calculate the average nightly rental price of these listings.
    * Set the average price as the price for our listing.
* Measure of similarity: Euclidean distance

d = sqrt((q1-p1)^2 + (q2-p2)^2 + ... + (qn-pn)^2)


In [2]:
import numpy as np
import pandas as pd
dc_listings = pd.read_csv('dc_airbnb.csv')

* Univariate case - the use of one feature/variable

d = | q1 - p1 |

# 5/10 Calculate a distance column

In [3]:
dc_listings['distance'] = dc_listings['accommodates'].apply(lambda x:abs(x-3))
print(dc_listings['distance'].value_counts())

1     2294
2      503
0      461
3      279
5       73
4       35
7       22
6       17
9       12
13       8
8        7
12       6
11       4
10       2
Name: distance, dtype: int64


# 6/10 Introduction to K-Nearest Neighbors


* Randomize the order of the rows in dc_listings:
    * Use the np.random.permutation() function to return a NumPy array of shuffled index values.
    * Use the Dataframe method loc[] to return a new Dataframe containing the shuffled order.
    * Assign the new Dataframe back to dc_listings.
* After randomization, sort dc_listings by the distance column and assign back to dc_listings.
* Display the first 10 values in the price column using the print function.


In [4]:
np.random.seed(1)
shuffle_order = np.random.permutation(len(dc_listings))
dc_listings = dc_listings.loc[shuffle_order]
dc_listings = dc_listings.sort_values(by='distance')
print(dc_listings[0:9]['price'])

577     $185.00
2166    $180.00
3631    $175.00
71      $128.00
1011    $115.00
380     $219.00
943     $125.00
3107    $250.00
1499     $94.00
Name: price, dtype: object


# 7/10 Average Price

* Remove the commas (,) and dollar sign characters ($) from the price column:
    * Use the str accessor so we can apply string methods to each value in the column followed by the string method replace to replace all comma characters with the empty character: stripped_commas = dc_listings['price'].str.replace(',', '')
    * Repeat to remove the dollar sign characters as well.
* Convert the new Series object containing the cleaned values to the float datatype and assign back to the price column in dc_listings.
* Calculate the mean of the first 5 values in the price column and assign to mean_price.
* Use the print function or the variable inspector below to display mean_price.


In [5]:
cleaned_prices = dc_listings['price'].str.replace(',','')
cleaned_prices = cleaned_prices.str.replace('$','')
dc_listings['price'] = cleaned_prices.astype(float)
mean_price = dc_listings.iloc[0:5]['price'].mean()

In [6]:
mean_price

156.6

* Write a function named predict_price that can use the k-nearest neighbors machine learning technique to calculate the suggested price for any value for accommodates. This function should:
    * Take in a single parameter, new_listing, that describes the number of bedrooms.
    * We've added code that assigns dc_listings to a new Dataframe named temp_df. We used the pandas.DataFrame.copy() method so the underlying dataframe is assigned to temp_df, instead of just a reference to dc_listings.
    * Calculate the distance between each value in the accommodates column and the new_listing value that was passed in. Assign the resulting Series object to the distance column in temp_df.
    * Sort temp_df by the distance column and select the first 5 values in the price column. Don't randomize the ordering of temp_df.
    * Calculate the mean of these 5 values and use that as the return value for the entire predict_price function.

    * Use the predict_price function to suggest a price for a living space that:
        * accommodates 1 person, assign the suggested price to acc_one.
        * accommodates 2 people, assign the suggested price to acc_two.
        * accommodates 4 people, assign the suggested price to acc_four.


In [7]:
# Brought along the changes we made to the `dc_listings` Dataframe.
dc_listings = pd.read_csv('dc_airbnb.csv')
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')
dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]

def predict_price(new_listing):
    temp_df = dc_listings.copy()
    ## Complete the function.
    temp_df['distance'] = temp_df['accommodates'].apply(lambda x: abs(x -new_listing))
    
    return(temp_df.sort_values(by = 'distance').iloc[0:5]['price'].mean())

acc_one = predict_price(1)
acc_two = predict_price(2)
acc_four = predict_price(4)

In [9]:
print("Price prediction for accommodation 1"+str(acc_one))
print("Price prediction for accommodation 2"+str(acc_two))
print("Price prediction for accommodation 4"+str(acc_four))

Price prediction for accommodation 171.8
Price prediction for accommodation 296.8
Price prediction for accommodation 496.0


* The function we wrote represents a machine learning model, which means that it outputs a prediction based on the input to the model.

# Part 2 - Evaluating Model Performance

# 1/8 Testing quality of predictions

* A simple way to test the quality of your model is to:

    * split the dataset into 2 partitions:
        * the training set: contains the majority of the rows (75%)
        * the test set: contains the remaining minority of the rows (25%)

    * use the rows in the training set to predict the price value for the rows in the test set
        * add new column named predicted_price to the test set
    * compare the predicted_price values with the actual price values in the test set to see how accurate the predicted values were.

* This validation process, where we use the training set to make predictions and the test set to predict values for, is known as **train/test validation**.

* Within the predict_price function, change the Dataframe that temp_df is assigned to. Change it from dc_listings to train_df, so only the training set is used.
* Use the Series method apply to pass all of the values in the accommodates column from test_df through the predict_price function.
* Assign the resulting Series object to the predicted_price column in test_df.


In [23]:
import pandas as pd
import numpy as np

def predict_price(new_listing):
    ## DataFrame.copy() performs a deep copy
    temp_df = train_df.copy()
    temp_df['distance'] = temp_df['accommodates'].apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbor_prices = temp_df.iloc[0:5]['price']
    predicted_price = nearest_neighbor_prices.mean()
    return(predicted_price)

dc_listings = pd.read_csv("dc_airbnb.csv")
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')
train_df = dc_listings.iloc[0:2792]
test_df = dc_listings.iloc[2792:]
test_df['predicted_price'] = test_df['accommodates'].apply(predict_price)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [24]:
print(test_df.head(5))

     host_response_rate host_acceptance_rate  host_listings_count  \
2792                20%                  75%                    1   
2793               100%                  25%                    2   
2794                NaN                  NaN                    1   
2795               100%                 100%                    1   
2796               100%                 100%                    1   

      accommodates        room_type  bedrooms  bathrooms  beds  price  \
2792             2  Entire home/apt       0.0        1.0   1.0  120.0   
2793             3  Entire home/apt       2.0        2.0   1.0  140.0   
2794             4  Entire home/apt       2.0        1.0   1.0  299.0   
2795             3  Entire home/apt       1.0        1.0   1.0   85.0   
2796             6  Entire home/apt       2.0        2.0   3.0  175.0   

     cleaning_fee security_deposit  minimum_nights  maximum_nights  \
2792          NaN              NaN               1            1125   
2793  

# 2/8 - Error Metrics

* a metric that quantifies how good the predictions were on the test set. This class of metrics is called an **error metric**.
* We could start by calculating the difference between each predicted and actual value and then averaging these differences. This is referred to as mean error.
* mean absolute error, where we compute the absolute value of each error before we average all the errors.

In [25]:
import numpy as np

mae = sum(np.absolute(test_df['predicted_price'] - test_df['price']))/len(test_df)

In [26]:
mae

56.29001074113876

# 3/8 Mean Squared Error

* We can instead take the mean of the squared error values, which is called the mean squared error or MSE for short. The MSE makes the gap between the predicted and actual values more clear. A prediction that's off by 100 dollars will have an error (of 10,000) that's 100 times more than a prediction that's off by only 10 dollars (which will have an error of 100).

In [27]:
import numpy as np

mse = sum((test_df['predicted_price']-test_df['price'])**2)/len(test_df)

print(mse)

18646.525370569325


# 4/8  Train another model

* Modify the predict_price function to the right to use the bathrooms column instead of the accommodates column to make predictions.
* Apply the function to test_df and assign the resulting Series object containing the predicted price values to the predicted_price column in test_df.
* Calculate the squared error between the price and predicted_price columns in test_df and assign the resulting Series object to the squared_error column in test_df.
* Calculate the mean of the squared_error column in test_df and assign to mse.
* Use the print function or the variables inspector to display the MSE value.


In [28]:
train_df = dc_listings.iloc[0:2792]
test_df = dc_listings.iloc[2792:]

def predict_price(new_listing):
    temp_df = train_df.copy()
    temp_df['distance'] = temp_df['bathrooms'].apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbors_prices = temp_df.iloc[0:5]['price']
    predicted_price = nearest_neighbors_prices.mean()
    return(predicted_price)

test_df['predicted_price'] = test_df['bathrooms'].apply(predict_price)

test_df['squared_error'] = np.abs(test_df['price'] - test_df['predicted_price'])**2

mse = test_df['squared_error'].mean()
print(mse)

18405.444081632548


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


# 5/8 Root Mean Squared Error

Root mean squared error is an error metric whose units are the base unit (in our case, dollars). RMSE for short, this error metric is calculated by taking the square root of the MSE value

In [30]:
import math as mt

def predict_price(new_listing):
    temp_df = train_df.copy()
    temp_df['distance'] = temp_df['bathrooms'].apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbors_prices = temp_df.iloc[0:5]['price']
    predicted_price = nearest_neighbors_prices.mean()
    return(predicted_price)

test_df['predicted_price'] = test_df['bathrooms'].apply(lambda x: predict_price(x))
test_df['squared_error'] = (test_df['predicted_price'] - test_df['price'])**(2)
mse = test_df['squared_error'].mean()
rmse = mt.sqrt(mse)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':


In [31]:
print(rmse)

135.6666653295221


# 6/8 Comparing MAE and RMSE

* If you look at the equation for MAE, you'll notice that that the differences between predicted and actual values grow linearly. A prediction that's off by 10 dollars has a 10 times higher error than a prediction that's off by 1 dollar.
* If you look at the equation for RMSE, however, you'll notice that each error is squared before the square root of the sum of all the errors is taken. This means that the individual errors grows quadratically and has a different effect on the final RMSE value. 

In [32]:
import math as mt

errors_one = pd.Series([5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10])
errors_two = pd.Series([5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 10, 5, 1000])

mae_one = 1/len(errors_one)*sum(errors_one)
rmse_one = mt.sqrt(sum(errors_one.apply(lambda x: x**2))/len(errors_one))

mae_two = 1/len(errors_two)*sum(errors_two)
rmse_two = mt.sqrt(sum(errors_two.apply(lambda x: x**2))/len(errors_two))

In [33]:
print(mae_one)
print(rmse_one)

7.5
7.905694150420948


In [34]:
print(mae_two)
print(rmse_two)

62.5
235.82302686548658


# 7/8

* In general, we should expect that the MAE value be much less than the RMSE value. The only difference between the 2 sets of errors is the extreme 1000 value in errors_two instead of 10.

# Part 3 - Multivariate K-Nearest Neighbors

# 1/8  Removing features



Use the DataFrame.info() method to return the number of non-null values in each column.

In [35]:
dc_listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3723 entries, 0 to 3722
Data columns (total 19 columns):
host_response_rate      3289 non-null object
host_acceptance_rate    3109 non-null object
host_listings_count     3723 non-null int64
accommodates            3723 non-null int64
room_type               3723 non-null object
bedrooms                3702 non-null float64
bathrooms               3696 non-null float64
beds                    3712 non-null float64
price                   3723 non-null float64
cleaning_fee            2335 non-null object
security_deposit        1426 non-null object
minimum_nights          3723 non-null int64
maximum_nights          3723 non-null int64
number_of_reviews       3723 non-null int64
latitude                3723 non-null float64
longitude               3723 non-null float64
city                    3723 non-null object
zipcode                 3714 non-null object
state                   3723 non-null object
dtypes: float64(6), int64(5), object(

# 2/8

* Main idea: remove colums from the dataset that cant be used in a Euclidean distance formula

Remove the 9 columns we discussed above from dc_listings:

    3 containing non-numerical values
    3 containing numerical but non-ordinal values
    3 describing the host instead of the living space itself



In [36]:
dc_listings = dc_listings.drop(columns=['room_type','city','state','latitude','longitude','zipcode','host_response_rate','host_acceptance_rate','host_listings_count'])

# 3/8


    Drop the cleaning_fee and security_deposit columns from dc_listings.
    Then, remove all rows that contain a missing value for the bedrooms, bathrooms, or beds column from dc_listings.
        You can accomplish this by using the Dataframe method dropna() and setting the axis parameter to 0.
        Since only the bedrooms, bathrooms, and beds columns contain any missing values, rows containing missing values in these columns will be removed.
    Display the null value counts for the updated dc_listings Dataframe to confirm that there are no missing values left.


In [37]:
dc_listings = dc_listings.drop(columns=['cleaning_fee','security_deposit'])
dc_listings.dropna(axis=0,subset=['bedrooms','bathrooms','beds'],inplace=True)
dc_listings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3671 entries, 0 to 3722
Data columns (total 8 columns):
accommodates         3671 non-null int64
bedrooms             3671 non-null float64
bathrooms            3671 non-null float64
beds                 3671 non-null float64
price                3671 non-null float64
minimum_nights       3671 non-null int64
maximum_nights       3671 non-null int64
number_of_reviews    3671 non-null int64
dtypes: float64(4), int64(4)
memory usage: 258.1 KB


# 4/8

* Note that some attributes in the table vary widely -- could have a disproportionate affect on the ML model
* Need to normalize all columns to have mean 0 and std 1
* Normalizing the values in each column to the standard normal distribution (mean of 0, standard deviation of 1) preserves the distribution of the values in each column while aligning the scales.

* To normalize the values in a column to the standard normal distribution, you need to:

    from each value, subtract the mean of the column
    divide each value by the standard deviation of the column


In [55]:
normalized_listings = (dc_listings - dc_listings.mean())/dc_listings.std()
normalized_listings['price'] = dc_listings['price']
#normalized_listings.head(3)
normalized_listings.iloc[574]
#normalized_listings.iloc[1593]
#normalized_listings.iloc[3091]

accommodates          -0.596544
bedrooms              -0.249467
bathrooms             -0.439151
beds                  -0.546858
price                100.000000
minimum_nights        -0.065038
maximum_nights        -0.016573
number_of_reviews     -0.516709
Name: 586, dtype: float64

In [52]:
normalized_listings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3671 entries, 0 to 3722
Data columns (total 8 columns):
accommodates         3671 non-null float64
bedrooms             3671 non-null float64
bathrooms            3671 non-null float64
beds                 3671 non-null float64
price                3671 non-null float64
minimum_nights       3671 non-null float64
maximum_nights       3671 non-null float64
number_of_reviews    3671 non-null float64
dtypes: float64(8)
memory usage: 418.1 KB


# 5/8 Euclidean distance for multivariate case

# Note

The results that I got in this case werent the same as the online solution, even though the code was the same. Perhaps there was a reshuffling in the dataset

In [56]:
from scipy.spatial import distance

first_fifth_distance = distance.euclidean(normalized_listings.iloc[0][['accommodates','bathrooms']],normalized_listings.iloc[4][['accommodates','bathrooms']])
print("Vector 1:"+str(normalized_listings.iloc[0][['accommodates','bathrooms']]))
print("Vector 2:"+str(normalized_listings.iloc[4][['accommodates','bathrooms']]))
print(first_fifth_distance)

Vector 1:accommodates    0.401366
bathrooms      -0.439151
Name: 0, dtype: float64
Vector 2:accommodates    0.401366
bathrooms      -0.439151
Name: 4, dtype: float64
0.0


# 6/8 Introduction to scikit-learn

* The scikit-learn workflow consists of 4 main steps:

    * instantiate the specific machine learning model you want to use
    * fit the model to the training data
    * use the model to make predictions
    * evaluate the accuracy of the predictions

* Any model that helps us predict numerical values, like listing price in our case, is known as a regression model. 
* The other main class of machine learning models is called classification, where we're trying to predict a label from a fixed set of labels (e.g. blood type or gender)

* by default:

    * n_neighbors: the number of neighbors, is set to 5
    * algorithm: for computing nearest neighbors, is set to auto
    * p: set to 2, corresponding to Euclidean distance


```python
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor()
knn = KNeighborsRegressor(algorithm='brute')
```


# 7/12  Fitting a model and making predictions

You can select the target column from the Dataframe and use that as the second parameter to the fit method:

```python
# Split full dataset into train and test sets.

train_df = normalized_listings.iloc[0:2792]

test_df = normalized_listings.iloc[2792:]

# Matrix-like object, containing just the 2 columns of interest from training set.

train_features = train_df[['accommodates', 'bathrooms']]

# List-like object, containing just the target column, `price`.

train_target = train_df['price']

# Pass everything into the fit method.

knn.fit(train_features, train_target)
```



The number of feature columns you use during both training and testing need to match or scikit-learn will return an error:

```python
predictions = knn.predict(test_df[['accommodates', 'bathrooms']])
```

The predict() method returns a NumPy array containing the predicted price values for the test set.




    Create an instance of the KNeighborsRegressor class with the following parameters:
        n_neighbors: 5
        algorithm: brute

    Use the fit method to specify the data we want the k-nearest neighbor model to use. Use the following parameters:
        training data, feature columns: just the accommodates and bathrooms columns, in that order, from train_df.
        training data, target column: the price column from train_df.

    Call the predict method to make predictions on:
        the accommodates and bathrooms columns from test_df
        assign the resulting NumPy array of predicted price values to predictions.


In [57]:
from sklearn.neighbors import KNeighborsRegressor

train_df = normalized_listings.iloc[0:2792]
test_df = normalized_listings.iloc[2792:]

knn = KNeighborsRegressor(n_neighbors=5,algorithm='brute')

knn.fit(train_df[['accommodates','bathrooms']], train_df['price'])

predictions = knn.predict(test_df[['accommodates','bathrooms']])

In [58]:
predictions

array([  62.8,  120.2,   91. ,  305.6,  120.2,  120.2,  266.6,  120.2,
         91. ,  120.2,  120.2,   62.8,   91. ,  120.2,  120.2,  120.2,
        134.6,  120.2,  113. ,  234.8,  120.2,  120.2,  120.2,  120.2,
         74.2,   91. ,   91. ,   91. ,  120.2,  113. ,  120.2,   91. ,
        120.2,  120.2,  120.2,  120.2,  120.2,  120.2,  120.2,  120.2,
        120.2,  120.2,  120.2,  204.8,  120.2,  120.2,   91. ,   91. ,
        305.6,  120.8,  569.8,  120.2,  204.8,   62.8,  113. ,  120.2,
        120.8,  120.2,  146.4,   91. ,  120.2,   91. ,  113. ,  120.2,
         91. ,  120.2,   62.8,  120.2,   74.2,  146.4,   65.4,  120.2,
        120.8,   91. ,  305.6,  120.2,  120.2,  288. ,  297.8,  113. ,
        120.2,  134.6,  213.6,  120.8,  120.2,  113. ,   91. ,  494.8,
        120.2,   62.8,  120.2,  113. ,  113. ,  120.2,  113. ,   91. ,
        305.6,  120.2,  120.2,  234.8,  120.2,   91. ,  120.2,   91. ,
        204.8,  120.2,  120.2,  141.4,  120.2,   74.2,   91. ,   91. ,
      

# 8/8