# K Nearest Neigbors

https://machinelearningmastery.com/k-nearest-neighbors-for-machine-learning/

### KNN for Regression

When KNN is used for regression problems the prediction is based on the mean or the median of the K-most similar instances.

### KNN for Classification

When KNN is used for classification, the output can be calculated as the class with the highest frequency from the K-most similar instances. Each instance in essence votes for their class and the class with the most votes is taken as the prediction.

## Euclidean Distance

KNN is calculated by Euclidean distance:

$$
    d = \sqrt{(q_1 - p_1)^2 + (q_2 - p_2)^2 + \cdots + (q_n + p_n)^2}
$$

where $q_1$ to $q_n$ represent the feature values for one observation and $p_1$ to $p_n$ represent the feature values for the other observation.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import accuracy_score, classification_report

In [2]:
listings = pd.read_csv('./data/dc_airbnb.csv')
listings.head()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,accommodates,room_type,bedrooms,bathrooms,beds,price,cleaning_fee,security_deposit,minimum_nights,maximum_nights,number_of_reviews,latitude,longitude,city,zipcode,state
0,92%,91%,26,4,Entire home/apt,1.0,1.0,2.0,$160.00,$115.00,$100.00,1,1125,0,38.890046,-77.002808,Washington,20003,DC
1,90%,100%,1,6,Entire home/apt,3.0,3.0,3.0,$350.00,$100.00,,2,30,65,38.880413,-76.990485,Washington,20003,DC
2,90%,100%,2,1,Private room,1.0,2.0,1.0,$50.00,,,2,1125,1,38.955291,-76.986006,Hyattsville,20782,MD
3,100%,,1,2,Private room,1.0,1.0,1.0,$95.00,,,1,1125,0,38.872134,-77.019639,Washington,20024,DC
4,92%,67%,1,4,Entire home/apt,1.0,1.0,1.0,$50.00,$15.00,$450.00,7,1125,0,38.996382,-77.041541,Silver Spring,20910,MD


In [3]:
listings.dtypes

host_response_rate       object
host_acceptance_rate     object
host_listings_count       int64
accommodates              int64
room_type                object
bedrooms                float64
bathrooms               float64
beds                    float64
price                    object
cleaning_fee             object
security_deposit         object
minimum_nights            int64
maximum_nights            int64
number_of_reviews         int64
latitude                float64
longitude               float64
city                     object
zipcode                  object
state                    object
dtype: object

In [4]:
listings.isna().sum()

host_response_rate       434
host_acceptance_rate     614
host_listings_count        0
accommodates               0
room_type                  0
bedrooms                  21
bathrooms                 27
beds                      11
price                      0
cleaning_fee            1388
security_deposit        2297
minimum_nights             0
maximum_nights             0
number_of_reviews          0
latitude                   0
longitude                  0
city                       0
zipcode                    9
state                      0
dtype: int64

In [5]:
listings['price'] = listings['price'].str.replace('[,$]', '').astype('float')

In [8]:
listings.loc[listings['bedrooms'].isna(), 'bedrooms'] = listings['bedrooms'].mean()

In [9]:
listings.loc[listings['bathrooms'].isna(), 'bathrooms'] = listings['bathrooms'].mean()

In [10]:
listings.loc[listings['beds'].isna(), 'beds'] = listings['beds'].mean()

In [11]:
X = listings[['accommodates', 'bedrooms', 'bathrooms', 'beds']]
y = listings['price']

In [12]:
X.head()

Unnamed: 0,accommodates,bedrooms,bathrooms,beds
0,4,1.0,1.0,2.0
1,6,3.0,3.0,3.0
2,1,1.0,2.0,1.0
3,2,1.0,1.0,1.0
4,4,1.0,1.0,1.0


In [13]:
y.head()

0    160.0
1    350.0
2     50.0
3     95.0
4     50.0
Name: price, dtype: float64

## KNN with Scikit Learn

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [16]:
nn = NearestNeighbors(n_neighbors=5).fit(X_train)

In [17]:
distances, indices = nn.kneighbors(X_train)

In [18]:
distances

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       ...,
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

In [19]:
indices

array([[1049, 1646, 1050, 2191, 1055],
       [2238,  640, 1544,  395, 2183],
       [2977,  335,  334, 1529,  332],
       ...,
       [ 525, 1353, 1153,  343,  852],
       [2977,  335,  334, 1529,  332],
       [2977,  335,  334, 1529,  332]])