# The K-nearest neighbors algorithm

In this notebook you will implement the k-nearest neighbors (KNN) algorithm with k-fold cross validation

We will use the Iris dataset

Let's remember what we need to write the KNN algorithm

* Labeled, multidimensional data (training, and validation set)
* A distance (L1 vs L2)
* A number of neighbors (must be odd)

In [None]:
# The Iris dataset has multiple examples and four features with continuous values.

# let's load the iris dataset
# your code here
X = iris.data
y = iris.target

# We will only use the first two features in the Iris dataset so that we can generate 2D scatter plots
X = X[:,:2]

# finally, we will convert X and y into numpy arrays
import numpy as np
X = np.array(X)
y = np.array(y)

print('The shape of X is: ', X.shape)
print('The shape of y is: ', y.shape)

In [None]:
# because we want to use K-fold cross validation, we must randomly select ~10% of the data as the test set
Ntotal = X.shape[0]
Ntest = X.shape[0]//10
Ntrain = Ntotal - Ntest

# generate a list of Ntest random indices, without repetitions, in the range of Ntotal
# these indices will be the indices of your test data. Use the function np.random.choice()
test_idx = # your code here

# separate training and testing data
X_test = X[test_idx]
y_test = y[test_idx]
X_train = np.delete(X, test_idx, axis=0)
y_train = np.delete(y, test_idx)

print('The shape of X_test is: ', X_test.shape)
print('The shape of y_test is: ', y_test.shape)
print('The shape of X_train is: ', X_train.shape)
print('The shape of y_train is: ', y_train.shape)

# We will forget about the test data for now. We will only use it at the very end

In [None]:
# Now we will split the training data into five folds. We will store each of the folds 
# as an element of a python list. We will have a list for X and another list for y

# we want to randomly split the training data into 5 chunks of equal size
# let's first generate random integers without repetition in the range of Ntrain
train_idx = # your code here

# now let's iterate over the training data to obtain the 
# 5 folds
Nfolds = 5
fold_size = Ntrain//Nfolds
X_folds = [] # this will be a list of 5 numpy arrays with datapoints
y_folds = [] # this will be a list of 5 numpy arrays with the corresponding labels
for i in range(Nfolds):
    
    X_folds # your code here
    y_folds # your code here


In [None]:
# we will use k-fold cross validation to find the best k value for the KNN algorithm
# we will test the following K values

Ks = [1, 3, 5, 11, 21, 51, 101]: # you can try other k values if you want

In [None]:
# we will also need the distance function

def L1_norm(X,a):
    '''
    X is a numpy matrix where each row is a 2D training datapoint.
    a is a numpy vector representing a 2D validation datapoint
    
    Write a function to calculate the norm between X and a
    using np.abs, and np.sum(B,axis=1) where the axis=1 argument
    limits the sum operation to be independently carried out for
    the dimensions of each row in the matrix B
    '''

    norm = # your code here

    return norm

Ok, we have everything we need. Now we can write the KNN algorithm with k-fold cross validation. 

The algorithm is:
* For all k-nearest neighbor values that we want to try
    * For all possible k-fold splits of training and validation data
        * Store the features and classes of the training data
        * For each validation point:
            * Compute the distance between the validation datapoint and all the training datapoints.
            * Find the top k nearest training neighbors
            * The category of the validation datapoint is presumed to be the most common category among the k nearest training neighbors.





In [None]:
# we will import the statistics library to use the mode function
from scipy import stats

# we will have three nested for loops here:

all_knn_fold_acc = []
for k in Ks: # for all the k values that we want to test
    all_fold_acc = []
    for ifold in range(Nfolds): # for all the possible validation folds 
        
        # first, convert the folds into training and validation sets
        X_vl = # your code here
        y_vl = # your code here
        X_tr = # your code here. hint: use list comprehension in this line
        y_tr = # your code here. hint: use list comprehension in this line
        # because you used list comprehension in the two lines above,
        # convert the training set into a numpy array:
        X_tr = np.vstack(X_tr) # do not change this line
        y_tr = np.hstack(y_tr) # do not change this line
        
        KNN = []
        for x_vl in X_vl: # for all datapoints in the validation set
            all_distances = # your code here. hint: use your L1 function
            sorted_distances = sorted([(d,i) for i, d in enumerate(all_distances)])
            K_close_examps = sorted_distances[:k]

            K_close_categories = []
            for examp in K_close_examps:
                
                K_close_categories.append(y_tr[examp[1]])
                
            KNN.append(K_close_categories)

        correct_count = 0
        for iknn, knn in enumerate(KNN):
            
            if stats.mode(knn)[0] == y_vl[iknn]:
                correct_count += 1
        all_fold_acc.append(correct_count/len(y_vl))
    all_knn_fold_acc.append(all_fold_acc)
    
# only write code where you see 'your code here'. Everything else will work on its own

In [None]:
# this cell generates a plot that will let you see what the best k
# for your model is

%matplotlib inline
import matplotlib.pyplot as plt

plt.plot(Ks,all_knn_fold_acc,'ok')
plt.errorbar(Ks,np.mean(all_knn_fold_acc,axis=1),np.std(all_knn_fold_acc,axis=1))
plt.xlabel('K')
plt.ylabel('Accuracy')
plt.show()

In [None]:
# now find which value of k gave you the best results
# use the entire training set to create your model
# and test the performance with the test set that we
# separated at the top

# your code here

# What is the test accuracy that you obtained?

Once you are done, try the following:
* Use the L2 norm instead of L1 
* Use another set of 2 features from the Iris dataset
* Use more than two features. Start using 3 features and generate 3D plots.

Do you understand what is happenning in all the lines of code that were prodivided to you and you did not have to write?