# Exercise: code your own KNN classifier

Now it's your turn! In this exercise, you'll complete the KNN classifier class below. Skeleton code is provided and we'll discuss some strategies for constructing the largest method of the class, `predict`.

In [None]:
import numpy as np

class Knn:

    def __init__(self):
        """
        Initialize the Knn class
        self.x_train: training data
        self.y_train: training labels
        """
        # Save the training data to properties of this class
        self.x_train = []
        self.y_train = []

    def fit(self, x, y):
        """
        Save the training data to properties of this class
        Parameters
        ----------
        x: training data
        y: training labels

        Returns
        -------
        None
        """


    def predict(self, x, k):
        """
        Predict the class labels for the provided data
        Parameters
        ----------
        x: data to classify
        k: number of neighbors to use

        Returns
        -------
        np.array(y_hat): array of predicted class labels
        """

        y_hat = []  # Variable to store the estimated class labels
    
        # Calculate the distance from each vector in x to the training data
            # - Loop through each of the samples for which we wish to make predictions
            #   - For each sample, calculate the Euclidean distance to every training sample
            #   - Determine the k nearest samples
            #   - Determine which class of the k nearest observations was most prevalent and assign that label
            # - Append the assigned label to y_hat

        # Return the estimated targets

Start by completing the `fit` function. Here, you're simply storing the training data for later comparison during prediction.

Next, we'll walk through `predict`. There are three main steps as you loop through observations for which you're creating predictions:
1. For each sample, calculate the Euclidean distance to every training sample
2. Determine the k nearest samples
3. Determine which class of the k nearest observations was most prevalent and assign that label as the prediction

By breaking the larger method into these three smaller steps, we can create and test functions that do each of these. To help you work through it, we'll provide you with inputs and outputs for each function and then you'll piece them together in the final method as part of the skeleton code above.

(1) For each sample, calculate the Euclidean distance to every training sample

In [7]:
def get_distance(x,X_train):
    """
    Compute the distance between one observation and a set of observations
    Parameters
    ----------
    x: observation with M features [size M]
    X_train: collection of N observations to compare against [size N x M]

    Returns
    -------
    Array of Euclidean distances between x and each observation in X_train
    """
    diff = X_train - x
    return np.sum(diff**2, axis=1)

A test case to help you with (1):

In [14]:
import numpy as np

# Inputs
x = np.array([0, 1])
X_train = np.array([[0,0],[1,1],[2,2]])

out = get_distance(x,X_train)
out

# Outputs
correct_output = np.array([1, 1, 5])
if np.array_equal(get_distance(x,X_train), correct_output): print("PASSED")


PASSED


(2) Determine the k nearest samples based on the distances you calculated in `get_distance`

In [19]:
def get_nearest(dist,k,labels):
    """
    Gets the labels of the k nearest labels by distance
    Parameters
    ----------
    dist: Euclidean distance observation to each training observation (from `get_distance`) [size N]
    k: number of neighbors to identify [scalar]
    labels: corresponding training data labels for each overservation that was 
        compared when computing `dist` using `get_distance` [size N]

    Returns
    -------
    The target variable class of the k nearest neighbors [size k]
    """
    df_distance = pd.DataFrame({
        'distance':dist, 
        'y':labels
    })
    df_sorted = df_distance.sort_values('distance')
    return df_sorted['y'].iloc[0:k].values

A test case to help you with (2):

In [35]:
import numpy as np
import pandas as pd

# Inputs
dist = np.array([0,6,2,78,3,7,8])
k = 3
labels = np.array(['elephant', 'giraffe', 'tiger', 'lion', 'eagle', 'mouse', 'skunk'])

# Outputs
output = get_nearest(dist,k,labels)
correct_output = np.array(['elephant', 'tiger', 'eagle'])
if np.array_equal(output, correct_output): print("PASSED")

['elephant' 'tiger' 'eagle']
PASSED


! Note size of the labels from the data (extra dimension)
Note the order of the labels needs to be the same for both distance and labels - that can't be changed
What happens if multiple distances are the same?

(3) Determine which class of the k nearest observations was most prevalent from `get_nearest` which is the label that will be assigned as the prediction

In [36]:
def get_most_frequent_class(labels):
    """
    Gets the labels of the k nearest labels by distance
    Parameters
    ----------
    dist: Euclidean distance observation to each training observation (from `get_distance`) [size N]
    k: number of neighbors to identify [scalar]
    labels: corresponding training data labels for each overservation that was 
        compared when computing `dist` using `get_distance` [size N]

    Returns
    -------
    The target variable class of the k nearest neighbors [size k]
    """
    label_series = pd.Series(labels)
    df = label_series.value_counts()
    max_value = df.max()
    options = df[df==max_value].index.values
    return np.random.choice(options) # If there's one option, return it; else, pick one at random

A test case to help you with (3):

In [37]:
import numpy as np
import pandas as pd

# Inputs
labels = np.array(['elephant', 'elephant', 'tiger', 'tiger', 'eagle', 'tiger', 'skunk'])

# Outputs
output = get_most_frequent_class(labels)
correct_output = 'tiger'
if output == correct_output: print("PASSED")

PASSED


What happens if you have a tie?

To test this out, you'll need to apply it to some data. You'll apply this to the iris data that we split previously. Let's start by loading our training and test data:

In [None]:
data_train = 
data_test = 

In [None]:
# Solution

import pandas as pd
data_train = pd.read_csv("data/train.csv")
data_test = pd.read_csv("data/test.csv")

With the data loaded - use the `fit` method of your KNN classifier to train the model. Once your model is trained use it to predict labels for your test data.

Using the predictions you made, compare your predictions to the target variable class found in your test dataset and determine the accuracy of your model's predictions on the test data using the accuracy function we previously developed - and included below.

In [None]:
# Metric of overall classification accuracy
#  both y and y_hat should be numpy arrays
def accuracy(y,y_hat):
    nvalues = len(y)
    accuracy = sum(y == y_hat) / nvalues
    return accuracy