# k-Nearest Neighbors

In [None]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

## k-Neighbors classification

In [None]:
from helpers import plot_knn_classification

plot_knn_classification.plot_knn_classification(n_neighbors=1)

> For finding closest similar points, you find the distance between points using distance measures such as Euclidean distance, Hamming distance, Manhattan distance and Minkowski distance.

In [None]:
plot_knn_classification.plot_knn_classification(n_neighbors=3)

In [None]:
import helpers
from sklearn.model_selection import train_test_split

X, y = helpers.datasets.make_forge()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [None]:
X,y

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [None]:
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=3)

In [None]:
clf.fit(X_train, y_train)

In [None]:
print(f"Test set predictions: {clf.predict(X_test)}")

In [None]:
print(f"Test set accuracy: {clf.score(X_test, y_test):.2f}")

### Analyzing KNeighborsClassifier

In [None]:
from helpers.plot_2d_separator import plot_2d_separator
from helpers.tools import discrete_scatter

fig, axes = plt.subplots(1, 4, figsize=(10, 3))

for n_neighbors, ax in zip([1, 3, 9, 12], axes):
    clf = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X, y)
    plot_2d_separator(clf, X, fill=True, eps=0.5, ax=ax, alpha=.4)
    discrete_scatter(X[:, 0], X[:, 1], y, ax=ax)
    ax.set_title("{} neighbor(s)".format(n_neighbors))
    ax.set_xlabel("feature 0")
    ax.set_ylabel("feature 1")

    axes[0].legend(loc=3)

- A **smoother boundary corresponds to a simpler model.**


In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler

cancer = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=66)

scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

training_accuracy = []
test_accuracy = []

# try n_neighbors from 1 to 25
neighbors_settings = range(1, 25)

for n_neighbors in neighbors_settings:
    # build the model
    clf = KNeighborsClassifier(n_neighbors=n_neighbors)
    clf.fit(X_train_scaled, y_train)
    # record training set accuracy
    training_accuracy.append(clf.score(X_train_scaled, y_train))
    # record generalization accuracy
    test_accuracy.append(clf.score(X_test_scaled, y_test))
    
    
plt.plot(neighbors_settings, training_accuracy, label="training accuracy")
plt.plot(neighbors_settings, test_accuracy, label="test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("n_neighbors")
plt.legend()
plt.show()

## k-neighbors regression

In [None]:
from helpers import plot_knn_regression

plot_knn_regression.plot_knn_regression(n_neighbors=1)
plt.show()

In [None]:
plot_knn_regression.plot_knn_regression(n_neighbors=3)
plt.show()

In [None]:
import helpers
from sklearn.neighbors import KNeighborsRegressor

X, y = helpers.datasets.make_wave(n_samples=40)


plt.plot(X,y, ".")
plt.show()

In [None]:
# split the wave dataset into a training and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# instantiate the model and set the number of neighbors to consider to 3
reg = KNeighborsRegressor(n_neighbors=3)

# fit the model using the training data and training targets
reg.fit(X_train, y_train)

In [None]:
print(X_test)

In [None]:
print(f"Test set predictions:\n{reg.predict(X_test)}")

In [None]:
print(f"Test set R^2: {reg.score(X_test, y_test):.2f}")

### Analyzing KNeighborsRegressor

In [None]:
fig, axes = plt.subplots(1, 4, figsize=(15, 4))

# create 1,000 data points, evenly spaced between -3 and 3
line = np.linspace(-3, 3, 1000).reshape(-1, 1)

for n_neighbors, ax in zip([1, 3, 9, 12], axes):
    # make predictions using 1, 3, or 9 neighbors
    reg = KNeighborsRegressor(n_neighbors=n_neighbors)
    reg.fit(X_train, y_train)
    ax.plot(line, reg.predict(line))
    ax.plot(X_train, y_train, '^', c="blue", markersize=8)
    ax.plot(X_test, y_test, 'v', c="red", markersize=8)
    ax.set_title(f"{n_neighbors} neighbor(s)\n train score: {reg.score(X_train, y_train):.2f} test score: {reg.score(X_test, y_test):.2f}")
    ax.set_xlabel("Feature")
    ax.set_ylabel("Target")
    axes[0].legend(["Model predictions", "Training data/target","Test data/target"], loc="best")

 
- Considering **more neighbors** leads to **smoother predictions**, but these do **not fit the training data as well**.

### Primer: AirBnB dataset

In [None]:
dc_listings = pd.read_csv('data/dc_airbnb.csv')
dc_listings.head(2)

In [None]:
# price data
stripped_commas = dc_listings['price'].str.replace(',', '', regex=False)
stripped_dollar_sign = stripped_commas.str.replace('$', '', regex=False)
dc_listings['price'] = stripped_dollar_sign.astype('float')


# dropping columns 
drop_columns = ['room_type', 'city', 'state', 'latitude', 
                'longitude', 'zipcode', 'host_response_rate', 
                'host_acceptance_rate', 'host_listings_count',
                'cleaning_fee', 'security_deposit']

dc_listings.drop(drop_columns, axis=1, inplace=True)
dc_listings = dc_listings.dropna(axis=0)

In [None]:
# preverimo če dataset vsebuje manjkajoče vrednosti
dc_listings.isnull().sum()

In [None]:
dc_listings.head(2)

In [None]:
features = dc_listings.columns.tolist()
features.remove('price')

print(features)

X = dc_listings[features].values
y = dc_listings["price"].values

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.20)

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
y_train.shape

In [None]:
y_test.shape

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
from sklearn.neighbors import KNeighborsRegressor

# Instantiate ML model.
knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(X_train, y_train)

In [None]:
knn.score(X_test, y_test)

In [None]:
hyper_params = list(range(7,50,3))
scores = {}

for n_neighbors in hyper_params:
    knn = KNeighborsRegressor(n_neighbors=n_neighbors)
    knn.fit(X_train, y_train)
    score = knn.score(X_test, y_test)
    scores[n_neighbors] = score

In [None]:
scores

In [None]:
plt.scatter(x=scores.keys(), y=scores.values())
plt.show()

## Parameters

In principle, there are two important parameters to the KNeighbors classifier: 
- the number of neighbors
- how you measure distance between data points. 

In practice, **using a small number of neighbors like three or five often works well**, but you should
certainly adjust this parameter. 

By default, **Euclidean distance is used, which works well in many settings**.

## Strengths

- The model is very easy to understand.
- Often gives reasonable performance without a lot of adjustments.
- Using this algorithm is a good baseline method to try before considering more advanced techniques. 
- KNN can be useful in case of nonlinear data.

## Weaknesses

- The nearest k-neighbors algorithm is not often used in practice, due to **prediction being slow and its inability to handle many features.**

- It requires large memory for storing the entire training dataset for prediction.

- Building the nearest neighbors model is usually very fast, but when your training set is very large (either in number of features or in number of samples) prediction can be slow.

- When using the k-NN algorithm, it’s important to preprocess your data.
    - This approach often does not perform well on datasets with many features (hundreds or more), and it does particularly badly with datasets where most features are 0 most of the time (so-called sparse datasets).

- Finally, the KNN algorithm doesn't work well with categorical features since it is difficult to find the distance between dimensions with categorical features.

## Conclusion

- KNN **performs better with a lower number of features than a large number of features**. 
    - You can say that when the number of features increases than it requires more data. 
    - Increase in dimension also leads to the problem of overfitting. 
    - To avoid overfitting, the needed data will need to grow exponentially as you increase the number of dimensions.

- Research has shown that in large dimension Euclidean distance is not useful anymore.