# k-Nearest Neighbour Classification

As discussed in class, the k-Nearest Neighbours method works by exploiting an existing database of _labelled_ observations. To predict the (unknown) value of a new observation, we embed it in the space of the existing observations measure the distance between the new instance and the existing observations, and then use the labels of the $k$ nearest observations to determine the prediction. In the context of classification, the prediction is the majority label found within the labels of the $k$ nearest neighbours.

To work with k-Nearest Neighbours in Python, we an make use of the existing [numpy](https://numpy.org/) and [scikit-learn](https://scikit-learn.org/) libraries. Let's start by importing them (and a [matplotlib](https://matplotlib.org/) for some plotting of the results):

In [None]:
import numpy as np

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

from matplotlib import pyplot as plt
from matplotlib.colors import ListedColormap

## Our "Training" Data

We'll also need some data to act as the historical observations from our problem. For this example, we'll just make some up (but usually, you'd be given this data). In this case, we (the people doing the work) actually know the underlying function $f(X)$, but from the modelling perspective this function is not known:

In [None]:
## This is the true underlying "generating" function of our problem, in this case
## we are using it to separate our data into two classes "a" and "b" depending
## upon where in a 2-D space an instance sits.
##
## AS WITH REGRESSION, LET'S PRETEND THAT WE DON'T KNOW THIS ONE :)
def f(X):
    t = 0.444*(X[:, 0] + 0.5)**2 + 0.5*np.sin(np.pi * X[:, 0])
    return np.where(X[:, 1] > t, 1, 0)
class_labels = np.array([ 'a', 'b' ])

Now, we will use this function to generate some training data (it is still called training data, even though no real training takes place in k-Nearest Neighbours):

In [None]:
rng = np.random.default_rng(1234) ## notice the fixed seed for reproducability

n_points = 50
X_train = rng.uniform(-1, 1, size=(n_points, 2))
y_train = f(X_train)
X_train += rng.normal(0, 0.05, size=X_train.shape) ## let's just add a little noise to our data to make it interesting

## we'll also generate some "test" data use this to test the shape of our learned function shortly
n_test = 50
xx1, xx2 = np.meshgrid(np.linspace(-1.1, 1.1, n_test), np.linspace(-1.1, 1.1, n_test))
X = np.c_[xx1.ravel(), xx2.ravel()]
y = f(X)
Z = y.reshape(xx1.shape)

Let's take a look at the data (and underlying generating function) before moving on to modelling:

In [None]:
a = np.linspace(-1, 1, 500)
t = 0.444*(a + 0.5)**2 + 0.5*np.sin(np.pi * a)
plt.scatter(X_train[:, 0], X_train[:, 1], color=np.where(y_train==1, 'orange', 'cornflowerblue'))
plt.plot(a, t, color='black', label='True Underlying Class Separator - f(X)')
plt.title('Our Sampled Training Data')
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()

Note that the black line ($f(X)$) serves as the point of separation between two classes (although we've added a little noise to our instances, so there's a couple that manage to sneak over to opposite sides of this boundary). Remember, the function that this line represents is not known to k-Nearest Neighbours - its job is to estimate this function from available training data.

## Applying k-Nearest Neighbours

Now that we have a training set of data, we can move onto modelling. We do this by instantiating a KNeighboursRegressor model, calling the fit function, and then making predictions from the resulting model with the predict function. When building a k-Nearest Neighbours model, we need to specify the size of the neighbourhood of similar observations ($k$) - here we will examine this process for a range of $k$ values:

In [None]:
fix, ax = plt.subplots(2, 3, figsize=(16,10))
for (i, k) in enumerate([ 1, 3, 5, 10, 20, n_points ]):
    mdl = KNeighborsClassifier(n_neighbors=k)
    mdl.fit(X_train, y_train)

    y_pred = mdl.predict(X)
    Z = y_pred.reshape((50, 50))

    loss = accuracy_score(y, y_pred)
    r = i // 3
    c = i % 3
    ax[r, c].contourf(xx1, xx2, Z, cmap=ListedColormap(['cornflowerblue', 'orange']), alpha=0.2)
    ax[r, c].scatter(X_train[:, 0], X_train[:, 1], color=np.where(y_train == 1, 'orange', 'cornflowerblue'), label='Sampled Training Data')
    ax[r, c].plot(a, t, color='black', label='True Underlying Class Separator - f(X)')
    ax[r, c].set_title("kNN Performance, k={}, Accuracy={}".format(k, np.round(loss, 2)))
    ax[r, c].set_xlabel('x1')
    ax[r, c].set_ylabel('x2')
    ax[r, c].set_xlim(-1.1, 1.1)
    ax[r, c].set_ylim(-1.1, 1.1)
    ax[r, c].legend()
plt.show()

In each plot, the red line represents the model that was extracted from the training data by the algorithm for a given $k$ value. Notice that when $k=1$, the resulting model is quite sensitive to noise in the training data (it captures a lot of this noise in the model and so the class boundaries that it produces deviate from the true underlying function). As $k$ is increased, the resulting models become less sensitive to the noise in the training data. However, there is a balancing act here: as $k$ becomes very large, k-Nearest Neighbours starts to lose some of the detail in the underlying function. At its extreme ($k$ equals the size of the training data), the algorithm produces a model that is equivalent to the mean of the training data for all cases (hence all instances would be predicted as the same class).

In this case, we used a training set of 50 observations - you may wish to modify the code above so that a larger traning set is used (this can be done by modifying the n_points variable to be larger, say 250 observations). Try this and note the impact that this has on estimating the class boundaries.

## Examining the effect of $k$
The main hyperparameter for k-Nearest Neighbours (indeed, the ONLY hyperparameter for the basic version of k-Nearest Neighbours) is the neighbourhood size $k$. In the previous step, we looked at an arbitrary set of possible values for $k$ - let's now be a little more rigorous and examine the performance of k-Nearest Neighbours over a more thorough sweep of values of $k$:

In [None]:
all_k = np.arange(1, n_points+1)
acc = []
for k in all_k:
    mdl = KNeighborsClassifier(n_neighbors=k)
    mdl.fit(X_train, y_train)
    acc.append(accuracy_score(y, mdl.predict(X)))
best_k = np.argmax(acc)

Finally, we can plot the loss against the neighbourhood size to see how $k$ influences the behaviour of the algorithm and the resulting model:

In [None]:
plt.plot(all_k, acc)
plt.scatter(all_k[best_k], acc[best_k], color='#ce2227', label="Best Accuracy, k={}".format(all_k[best_k]))
plt.xlabel('k')
plt.ylabel('Accuracy')
plt.title('kNN Performance for Neighbourhood Size')
plt.legend()
plt.show()

Note here that we are measuring accuracy, so a larger score is better (unlike in regression, where we were measuring error, so a lower score was better).

In this result, we can see that (for this sample of training data!) the best accuracy can be achieved with $k$ set to 1. However, we can also see that the trend for increasing $k$ is quite noisy up to around $k$ equals 20, so really any of these values would have worked well. After $k=20$, performance gets progressively worse with larger values of $k$ (due to increasing underfitting of the data). We will discuss this (and strategies for algorithmic tuning and model selection) in later lectures.

As mentioned above, you should repeat this step with a larger training set to see if there is any noteworthy change in behaviour.