In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# K-Nearest Neighbors

![wilson](img/wilson.jpg)

## A breakdown of the algorithm

**Fit**

- KNN is considered a lazy learning algorithm. This is because, when we fit a KNN model, all the model does is store the training data as a model attribute.

**Predict**

When we pass features to a model in order to make predictions, this is what happens:

1. We loop over every row in the training data
2. We calculate the distance between the row of features from the training data and the row of features in the data meant for prediction. 
3. We store the distance for every row in the training data
4. Sort the array of distances to find the `k` number of rows in the training data that are closest to the prediction data.
5. We then count how many times each class appears in the `k` number of training data rows. The class that appears the most becomes the predictions. 

> **Note:** It is important to point out that all of the true functionality of KNN takes place during the prediction phase. 

![](img/visual-explainer.png)

**Distance Metrics**

Euclidean distance:

`sqrt(sum((x - y)^2))`

Manhattan Distance:

`sum(|x - y|)`

There a number of distance metrics that can be used with KNN. When using Sklearn, the available metrics can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html#sklearn.neighbors.DistanceMetric).

For a more detailed breakdown of each metric, check out [this](https://www.kdnuggets.com/2020/11/most-popular-distance-metrics-knn.html) article.

Let's code out both of these metrics!

In [None]:
def euclidean(vector_1, vector_2):
    """
    Find the euclidean distance between two vectors:
             
             `sqrt(sum((vector_1 - vector_2)^2))`
    
    vector_1 = An array containing numeric data
    vector_2 = An array containing numeric data
    
    Both vectors must have the same dimensions.
    """
    # Subtract vector_2 from vector_1
    
    # Square the difference
    
    # Sum the squared distance
    
    # Return the square root of the sum
    
    
def manhattan(vector_1, vector_2):
    """
    Find the manhattan distance between two vectors:
             
             `sum(|vector_1 - vector_2|)`
    
    vector_1 = An array containing numeric data
    vector_2 = An array containing numeric data
    
    Both vectors must have the same dimensions.
    """
    # Subtract vector_2 from vector_1
    
    # Find the absolute of the distance
    
    # Return the sum of the absolute distance

**Let's test out our functions!**

In [None]:
np.random.seed(2021)
vector_1 = np.random.normal(size=100)
vector_2 = np.random.normal(loc=3, scale=3, size=100)

vector_a = np.random.exponential(1, 100)
vector_b = np.random.exponential(3, 100)

print(euclidean(vector_1, vector_2)) # 43.2407625261724
print(manhattan(vector_1, vector_2)) # 357.6012129494707
print(euclidean(vector_a, vector_b)) # 32.61377501006896
print(manhattan(vector_a, vector_b)) # 198.35518667980242

**Ok, let's write code to generate predictions.**

In the cell below, create two functions:

1. `find_neighbors`
        
2. `predict`

Both functions should receive 5 arguments:
    
1. `X_train` - The training data
2. `y_train` - The labels for the training data
3. `test_observation` - A single vector for which we would like to generate predictions
4. `distance_function` - A function for calculating the distance between vectors
5. `k` - The number of nearest neighbors we would like to consider
    

In [None]:
def find_neighbors(X_train, y_train, test_observation, distance_function, k):
    """
    Loops over every observation in the training data and calcuates the distance
    between each training observation and the testing_observation vector. 
    ==============================================================================
    X_train =           Training Features Matrix
    
    y_train =           Training Class Labels
    
    test_observation =  Vector of features for a single observation
    
    distance_function = Function used for calculating the distance between vectors
    
    k =                 The number of nearest training observations that contribute 
                        a vote to the predicted class     
    ==============================================================================
    Returns: array. k observations from y_train that have the k smallest distances.
    """
    # Create an array for storing distances

    # Loop over the index of training data

        # Find the distance between the test_observation and
        # the training observation at the given index

        # Store the distance in the distance array

    # Find the top k indices in the distances array 
    # with the smallest distance

    
    # Use the indices for smallest distances
    # to slice y_train and return a list of 
    # class labels for the k nearest neighbors



def predict(X_train, y_train, X_test, distance_function, k):
    """
    Generates predictions for every observation in X_test
    ==============================================================================
    X_train =           Training Features Matrix
    
    y_train =           Training Class Labels
    
    X_test =            Matrix of observations for which prediction 
                        will be generated
    
    distance_function = Function used for calculating the distance between vectors
    
    k =                 The number of nearest training observations that contribute 
                        a vote to the predicted class
    """
    # Create an array to store predictions

    # Loop over the index of the testing data

        # Find the k nearest neighbors class labels

        # Find the most common class label

        # Append the most common class label to the predictions array

    # Return predictions


Let's test our code!

In [None]:
# Load a classification dataset
data = load_breast_cancer()
X = data['data']
y = data['target']

# Create a train test split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Calculate predictions
y_hat_euclidean = y_hat = predict(X_train, y_train, X_test, euclidean, 5)
y_hat_manhattan = predict(X_train, y_train, X_test, manhattan, 5)

print('Euclidean:', accuracy_score(y_test, y_hat_euclidean))
print('Manhattan:', accuracy_score(y_test, y_hat_manhattan))

Can we improve our score?

Below, we will do something called a `GridSearch` that scores the model on every combination of hyperparameters from a list of options we would like to consider.

In [None]:
def score_model(X_train, y_train, X_test, y_test, distance_function, k):
    preds = predict(X_train, y_train, X_test, distance_function, k)
    return accuracy_score(y_test, preds)

functions = {'euclidean': euclidean, 'manhattan': manhattan}

ks = np.arange(1, 10, 2)
scores = []
for k in ks:
    for function in functions:
        score = score_model(X_train, y_train, X_test, y_test, functions[function], k)
        scores.append({'k': k, 'distance': function, 'score': score})
        
best_model = sorted(scores, key=lambda x: x['score'], reverse=True)[0]

In [None]:
best_model

### Implementation with Sklearn

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

model = KNeighborsClassifier(n_neighbors=3, p=1)
cross_val_score(model, X_train, y_train, cv=5)

**Running a GridSearch with Sklearn**

In [None]:
from sklearn.model_selection import GridSearchCV
parameters = {'n_neighbors':np.arange(1,10,2), 'p':[1, 2]}

model = KNeighborsClassifier()
grid_search = GridSearchCV(model, parameters, cv=5)
grid_search.fit(X_train, y_train)

**Accessing the best parameters from a gridsearch:**

In [None]:
grid_search.best_params_

**Accessing the best score from the gridsearch:**

In [None]:
grid_search.best_score_

**Accessing the best model from the gridsearch:**

In [None]:
best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train)

In [None]:
best_model.score(X_test, y_test)

**Some final notes about KNN**

KNearest Neighbors is the second classification algorithm in our toolbelt added to our logistic regression classifier.

If we remember, logistic regression is a supervised, parametric, discriminative model.

KNN is a **supervised, non-parametric, discriminative, lazy-learning algorithm.**

Let's break down those terms:

**Supervised:** The problem is supervised because we have a `target` column.

**Non Parametric:** Linear and Logistic Regression have a set number of parameters, (ie the coefficients and the and the intercept) that are tied to the shape of the data. Non Parametric models do not have a set number of parameters that are tied to the datas' shape. 

**Discriminative:** "Discriminative models, also referred to as conditional models, are a class of logistical models used for classification or regression. They distinguish decision boundaries through observed data" - [Wikipedia](https://en.wikipedia.org/wiki/Discriminative_model#:~:text=Discriminative%20models%2C%20also%20referred%20to,%2Fdead%20or%20healthy%2Fsick.)

**Lazy Learner:** KNN is a lazy learner because all of the data crunching happens in the prediction phase instead of the fit phase. 

![alt text](img/K-NN_Neighborhood_Size_print.png)