# K-Nearest Neighbors (kNN)
- Step 1: Choose the number of "k" neighbors
    - A common default value is typically 5
- Step 2:Take the K nearest neighbors of the new data point
    - Typically use the Euclidean distance (distance formula) to determine nearest neighbors
- Step 3: Among these K neighbors, count the number of data points in each category
- Step 4: Classify the new data point to the category with the most neighbors

### kNN Visualization
<img src="images/knn/knn_example.png" height="75%" width="75%"></img>
- Notice that the graph's axes are X1 and X2 (the two independent variables)

Let's classify the "new data point" (the grey point) to the correct category.

If K = 5, then find the 5 nearest neighbors of the grey data point.
- It has 3 Category 1 neighbors
- It has 2 Category 2 neighbors

Therefore, the grey data point is classified as "Category 1."

### When To Use?
1. Unlike Logistic Regression, K-Nearest Neighbors can handle multi-nominal (multiple categories) classification.  
2. K-Nearest-Neighbors is also a non-linear model that is based on distance and not probability.
    - If you graph the kNN classifier, the threshold (boundary) does NOT look like a smooth curve

So if you need to classify using those advantages, then kNN is the best option.

In [1]:
# import libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
# import the data set
ads_df = pd.read_csv("datasets/social_network_ads.csv")

ads_df.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


In [3]:
# x is the Age and Estimated Salary columns
x = ads_df.iloc[:, [2, 3]].values

# y is the Purchased column
y = ads_df.iloc[:, 4].values

In [4]:
# split the data set into training and testing data sets
from sklearn.model_selection import train_test_split 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=0)

In [5]:
# import a Standarization Scaler for Feature Scaling
from sklearn.preprocessing import StandardScaler

# feature scale the training and testing sets
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)



# K-Nearest Neighbors Model

In [6]:
# import the k neighbors classifier
from sklearn.neighbors import KNeighborsClassifier

In [18]:
"""
create a kNN classifier with 5 neighbors and Euclidean Distance, then fit to the training set
- minkowski metric refers to using the Euclidean Space
- p = 2 refers to using Euclidean Distance
"""
classifier = KNeighborsClassifier(n_neighbors=5, metric="minkowski", p=2)
classifier.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

In [19]:
# predict the testing set
y_pred = classifier.predict(x_test)

# Confusion Matrix

In [20]:
# import the confusion matrix function
from sklearn.metrics import confusion_matrix

In [21]:
# create a confusion matrix that compares the y_test (actual) to the y_pred (prediction)
cm = confusion_matrix(y_test, y_pred)

"""
Read the Confusion Matrix diagonally:
65 + 29 = 93 correct predictions
4 + 3 = 7 incorrect predictions
"""
cm

array([[64,  4],
       [ 3, 29]])