# Exercise: code your own KNN classifier

Now it's your turn! In this exercise, you'll complete the KNN classifier class below. Skeleton code is provided and we'll discuss some strategies for constructing the largest method of the class, `predict`.

In [None]:
import numpy as np
import pandas as pd

class Knn_classification:

    def __init__(self):
        """
        Initialize the Knn class
        self.x_train: training data
        self.y_train: training labels
        """
        # Save the training data to properties of this class
        self.x_train = []
        self.y_train = []

    def fit(self, x, y):
        """
        Save the training data to properties of this class
        Parameters
        ----------
        x: training data
        y: training labels

        Returns
        -------
        None
        """


    def predict(self, x, k):
        """
        Predict the class labels for the provided data
        Parameters
        ----------
        x: data to classify
        k: number of neighbors to use

        Returns
        -------
        np.array(y_hat): array of predicted class labels
        """

        y_hat = []  # Variable to store the estimated class labels
    
        # Calculate the distance from each vector in x to the training data
            # - Loop through each of the samples for which we wish to make predictions
            #   - For each sample, calculate the Euclidean distance to every training sample
            #   - Determine the k nearest samples
            #   - Determine which class of the k nearest observations was most prevalent and assign that label
            # - Append the assigned label to y_hat

        # Return the estimated targets

## 1. `fit()`
Start by completing the `fit` function. Here, you're simply storing the training data for later comparison during prediction.

## 2. `predict()`
Next, we'll walk through `predict`. There are three main steps as you loop through observations for which you're creating predictions:
1. For each sample, calculate the Euclidean distance to every training sample (`get_distance()`)
2. Determine the k nearest samples (`get_nearest()`)
3. Determine which class of the k nearest observations was most prevalent and assign that label as the prediction (`get_most_frequent_class()`)

By breaking the larger method into these three smaller steps, we can create and test functions that do each of these. To help you work through it, we'll provide you with inputs and outputs for each function and then you'll piece them together in the final method as part of the skeleton code above.

### 2.1. `get_distance()`

**Goal:** For each observation, calculate the Euclidean distance to every training sample.

In [None]:
def get_distance(x,X_train):
    """
    Compute the distance between one observation and a set of observations
    Parameters
    ----------
    x: observation with M features [size M]
    X_train: collection of N observations to compare against [size N x M]

    Returns
    -------
    Array of Euclidean distances between x and each observation in X_train
    """
    

Here's a test case to help you with 2.1, below. We'll share one test case for each component of `predict`, although these are not exhaustive tests, you may want to test your functions further.

In [None]:
# Inputs
x = np.array([0, 1])
X_train = np.array([[0,0],[1,1],[2,2]])

out = get_distance(x,X_train)
out

# Outputs
correct_output = np.array([1, 1, 5])
if np.array_equal(get_distance(x,X_train), correct_output): print("PASSED")
else: print("FAILED")

With this and every function, make sure you **always** check the dimensions of each array that you use as inputs to this and every function you write. Sometimes the input is not what you expected so it's best to take a moment to check and make sure you know what the input dimensions are. This can be tricky, as we'll discuss more in the next function.

### 2.2. `get_nearest()`
**Goal**: Determine the k nearest samples based on the distances you calculated in `get_distance`

In [None]:
def get_nearest(dist,k,labels):
    """
    Gets the labels of the k nearest labels by distance
    Parameters
    ----------
    dist: Euclidean distance observation to each training observation (from `get_distance`) [size N]
    k: number of neighbors to identify [scalar]
    labels: corresponding training data labels for each observation that was 
        compared when computing `dist` using `get_distance` [size N]

    Returns
    -------
    The target variable class of the k nearest neighbors [size k]
    """
    

Here's a test case to help you with 2.2:

In [None]:
# Inputs
dist = np.array([0,6,2,78,3,7,8])
k = 3
labels = np.array(['elephant', 'giraffe', 'tiger', 'lion', 'eagle', 'mouse', 'skunk'])

# Outputs
output = get_nearest(dist,k,labels)
correct_output = np.array(['elephant', 'tiger', 'eagle'])
if np.array_equal(output, correct_output): print("PASSED")
else: print("FAILED")

There are a few things to keep in mind with this function.

First, make sure you **always** check the dimensions of each array that you use as inputs to these functions. This can lead to a common error. Numpy arrays that represent a vector can be represented as two-dimensional arrays with a size of `(N,1)` or sometimes just a one-dimensional array with a size given as (N,). These are not always compatible with one another and can lead to unexpected results if you don't make sure the dimensions are the same beforehand. It's easy to go from size `(N,1)` to size `(N,)` by using the `np.squeeze()` method.

Secondly, note that the order of the values in the numpy array must be the same for both the distance and label arrays. If not, the label won't correspond to the distance measure.

Lastly, consider what happens if the distances are the same. What if you're calculating the $k=3$ nearest neighbors and the shortest 5 distances to training samples are, $[0.1,0.2,0.3,0.3.0.4]$? In that case, there's a tie for the third nearest neighbor? This may be uncommon, but it could happen. There are a few ways to resolve this. To keep this simple, if there is a tie, randomly select between the labels that correspond to the equidistant training observations.

### 2.3. `get_most_frequent_class()`

**Goal:** Determine which class of the k nearest observations was most prevalent from `get_nearest` which is the label that will be assigned as the prediction

In [None]:
def get_most_frequent_class(labels):
    """
    Gets the most frequent class label of the k nearest neighbors
    Parameters
    ----------
    labels: categorical training data labels for each observation that was 
        identified as one of the k nearest training data observations [size k]

    Returns
    -------
    The target variable class of the k nearest neighbors [scalar]
    """
    

Here's a test case to help you with 2.3:

In [None]:
# Inputs
labels = np.array(['elephant', 'elephant', 'tiger', 'tiger', 'eagle', 'tiger', 'skunk'])

# Outputs
output = get_most_frequent_class(labels)
correct_output = 'tiger'
if output == correct_output: print("PASSED")
else: print("FAILED")

What happens if you have a tie? For example, imagine you're applying KNN with $k=5$ and the labels of the nearest 5 training observations to the observation that you're making a prediction for are: `['elephant', 'elephant', 'tiger', 'tiger', 'eagle']`. In this case there is a tie between elephant and tiger. 

With each of the three above pieces complete, you should be able to connect the three of them together and place them inside a loop to complete the `predict` function of the `Knn_classification` class.

## 3. Test your code

With your KNN class complete, it's time to apply it to some real data! You'll apply this to the iris data that we split previously. 

3.1. Start by loading the training and test data we prepared previously as `train.csv` and `test.csv`:

In [None]:
import pandas as pd
data_train = 
data_test = 

3.2. For the training and test data, create separate variables from your data for the features and target data. For the training data this would be `x_train` and `y_train` and for the test data, `x_test` and `y_test`. We recommend you save these as numpy arrays (which you can extract from a pandas dataframe using the `.values` property).

In [None]:
x_train = 
y_train = 

x_test = 
y_test = 


3.3. With the data loaded - use the `fit` method of your KNN classifier to train the model. 

In [None]:
# Initialize the KNN model
myknn = Knn_classification()

# Train the model
myknn.fit(COMPLETE_THE_INPUTS_HERE)

3.4. Once your model is trained use it to predict labels for your test data and call those predictions `y_predictions` using $k=5$ nearest neighbors. Look at your predictions to make sure they seem to be reasonable in terms of the right dimensions, be reflective of the potential predictive categories, etc.

In [None]:
# Make predictions on the test data

y_prediction = myknn.predict(COMPLETE_THE_INPUTS_HERE)
y_prediction


3.5. Finally, using the predictions you made, compare your predictions to the target variable classes in the test dataset (`y_test`) and determine the accuracy of your model's predictions on  using the accuracy function we previously developed - and included below. Make sure to check the dimensions of your data in case one array is `(N,1)` and the other is `(N,)`; you can always use `np.squeeze()` to correct this discrepancy.

In [None]:
# Metric of overall classification accuracy
#  both y and y_hat should be numpy arrays
def accuracy(y,y_hat):
    nvalues = len(y)
    accuracy = sum(y == y_hat) / nvalues
    return accuracy

In [None]:
# Compare your predictions to the labels from the test data to evaluate accuracy on the test data
accuracy(COMPLETE_THE_INPUTS_HERE)

3.6. What was the overall accuracy on your test data? 

Congratulations! You've coded up your own prediction algorithm and used it to classify data! 

This is a big step and while it's worth taking a moment to celebrate, there's a lot of nuance here that we didn't explore in this exercise. First off, not every classification problem will work as well as this one. Accuracy will vary by problem and there are many problems far more challenging. Secondly, we used overall accuracy to evaluate our model's predictive performance, but that is a summary value. It's often important to know how well it predicted *each class* since it may work perfectly on some classes and poorly on others - overall accuracy may struggle under those conditions. 

With all of this, you've taken a step deeper into the world of programming, data science, and machine learning!