# First steps in coding your own KNN Classifier

Now it's time to start coding your own KNN classifier! Let's begin by reviewing the pseudocode for our KNN classifier:

**Model training** is the process of fitting the model to the data to help it to make as accurate of predictions for unlabeled data as possible. In this case, it has one simple step:
1. Load and save the training data (features and the outputs we wish to predict). 

**Model prediction** is the process of making predictions for one or more samples of data for which we only input the features (not the corresponding outputs) since we assume that are trying to predict those outputs.
1. Input the features of a sample for which we wish to make a prediction of the outputs
2. Find the distance between the sample features and each of the training data sample features
3. Identify the 5 nearest samples in the training data to the input sample
4. Determine which class is most prevalent among the 5 nearest samples and assign that class to the sample
5. Repeat steps 1-5 for each sample for which predictions are being made

We'll review each step in the pseudocode above and provide some skeleton code to get you thinking through how to implement this.

First, let's plan to implement this as a Python class with three methods:
1. `__init__`. This is required of every Python class and initializes the class. In our case, this method will simply initialize the variables that will store our training data in the `fit` method.
2. `fit`. This method will perform our training step and represents the process of fitting our model to the data.
3. `predict`. This method will take one or more samples and make predictions of their corresponding class labels.

The simplest version of skeleton code for the class we're working to create is written as follows:

In [None]:
# Skeleton code to write your own kNN classifier

class Knn:
# k-Nearest Neighbor class object for classification training and testing
    def __init__(self):
        
    def fit(self, x, y):
        # Save the training data to properties of this class
        
    def predict(self, x, k):
        y_hat = [] # Variable to store the estimated class label for 
        # Calculate the distance from each vector in x to the training data
        
        # Return the estimated targets
        return y_hat


We'll walk through these one at a time. First, let's start with `__init__`. In this case, we can use this to initialize any variables that will be shared across the class. For example, when we run `fit`, we need to store our training data (this is a quirk of the KNN algorithm that's not the case for other machine learning techniques). This `__init__` method will be a good place to accomplish that, so let's add that in along with some documentation for it to get us started:

In [None]:
def __init__(self):
    """
    Initialize the Knn class
    self.x_train: training data features
    self.y_train: training output labels
    """
    # Save the training data to properties of this class
    self.x_train = []
    self.y_train = []

## Model Training

Next, let's move on to model training and review the pseudocode for this section:

Model training:
1. Load and save the training data (features and the outputs we wish to predict). 

This function takes both the features and output labels as inputs and stores them in the receptical variables we initialized in `__init__`. In this section, you'll find blanks for you to fill in for this method.

In [None]:
def fit(self, x, y):
    """
    Save the training data to properties of this class
    Parameters
    ----------
    x: training data features
    y: training data output labels

    Returns
    -------
    None
    """
    self.x_train = _______
    self.y_train = _______

## Model Prediction

The last component is model prediction.

Model prediction has a few steps:
1. Input the features of a sample for which we wish to make a prediction of the outputs
2. Find the distance between the sample features and each of the training data sample features
3. Identify the 5 nearest samples in the training data to the input sample
4. Determine which class is most prevalent among the 5 nearest samples and assign that class to the sample
5. Repeat steps 1-5 for each sample for which predictions are being made

### Determining "nearest"

Here we have some nuance in steps 3 and 4 since we need to determine the nearest samples. How do we define nearest? There are actually many ways, but for this case, we'll define it as the nearest in terms of Euclidean distance, $d()$. The Euclidean distance between two 2-dimensional vectors, $\mathbf{x}_1 = [x_{1,1}, x_{1,2}]$ and $\mathbf{x}_2 = [x_{2,1}, x_{2,2}]$ is:
$$d(\mathbf{x}_1,\mathbf{x}_2) = \sqrt{(x_{1,1}-x_{2,1})^2 + (x_{1,2}-x_{2,2})^2}$$

What if we had more than 2 features for each observation? For example, if we were classifying flowers, we may use sepal width and petal width, but we could also potentially measure the sepal length and petal length as well. If we did that, we would have four total features. In this case, we would want to compare two four dimensional feature vectors $\mathbf{x}_1 = [x_{1,1}, x_{1,2}, x_{1,4}, x_{1,4}]$ and $\mathbf{x}_2 = [x_{2,1}, x_{2,2}, x_{2,3}, x_{2,4}]$ is:
$$d(\mathbf{x}_1,\mathbf{x}_2) = \sqrt{(x_{1,1}-x_{2,1})^2 + (x_{1,2}-x_{2,2})^2 + (x_{1,3}-x_{2,3})^2 + (x_{1,4}-x_{2,4})^2}$$

For step 3 above, we need to measure this distance between the sample we are trying to make the prediction for and EVERY other sample in the training dataset. Below is skeleton code for the predict method. We will explore this more deeply in an exercise. For now, make sure you follow the logic of what is supposed to happen for all three methods.

In [None]:
def predict(self, x, k):
    """
    Predict the class labels for the provided data
    Parameters
    ----------
    x: data to classify
    k: number of neighbors to use

    Returns
    -------
    np.array(y_hat): array of predicted class labels
    """
    y_hat = []  # Variable to store the estimated class labels

    # Calculate the distance from each vector in x to the training data

    # Loop through each of the samples for which we wish to make predictions

    # For each sample, calculate the Euclidean distance to every training sample

    # Determine the k nearest samples

    # Determine which of the k nearest samples was most prevalent and assign that label

    # Append the assigned label to y_hat

    # Return the estimated targets
    return np.array(y_hat)

At this point, you should be able to successfully implement the `__init__` method and the `fit` method. Before moving on, try it out for yourself. 

Before we move on to developing `predict` we need to gather some data to use for training and testing our KNN algorithm and discuss how to evaluate classifier performance. Those will be our next two topics.