In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook

# $k$ nearest neighbours for classification

The [$k$ nearest neighbours](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) method is also very useful for classification.

The algorithm is as follows:
1. Given a new datapoint $x_i$, find the $k$ closest datapoints in the dataset $X$.
2. Estimate the class of the point as the most common class among the $k$ closest datapoints.

As for the regression version, the chosen value of $k$ is typically rather small and chosen using cross validation.


## Measuring classification error

When doing regression, one often uses the RMSE to measure how accurate a model is. The RMSE uses the difference between predictions, but for (multi-class) classification, the difference is meaningless - either you make a correct prediction or you do not.

For this reason, the *0-1 loss* is often used:
$$
L(\hat{y}, y) = 
\begin{cases}
    1 & \text{if } \hat{y} \neq y\\
    0 & \text{if } \hat{y} = y
\end{cases}
$$
where $y$ is the true label and $\hat{y}$ is the predicted label.



## Exercises

1. Implement $k$ nearest neighbours classification and test it by fixing the value of $k$ manually.
2. Use cross validation to find the best $k$ for the dataset.

Test the implemented methods on the data in `ex4.dat` and `ex5.dat`.


## Data

Load exercise data.

In [None]:
data = np.loadtxt("../data/ex4.dat")
X = data[:, :2]               # coordinates
y = data[:, 2].astype(int)    # classes

In [None]:
fig, ax = plt.subplots()
ax.scatter(X[y == 0, 0], X[y == 0, 1], color="C0", label="Class 1")
ax.scatter(X[y == 1, 0], X[y == 1, 1], color="C1", label="Class 2")
plt.legend()
plt.show()

## $k$ nearest neighbours

Let's modify the function from before to find the most common class among the neighbours.

In [None]:
def knn(k, xf, X, y):
    """
    Compute the k nearest neighbours' estimate for all x's in `xf`.
    X: training set inputs.
    y: training set labels.
    """
    
    # Allocate an array for predictions:
    yf = np.zeros(len(xf), dtype=int)
    
    for i, x in enumerate(xf):
        # For each point we want a prediction for, calculate the distance
        # to all training points...
        dists = np.linalg.norm(np.subtract(X, x), axis=1)
        
        # ... locate the k nearest...
        idx = np.argsort(dists)[:k]
        
        # ... and find the most common class.
        unique, count = np.unique(y[idx], return_counts=True)
        yf[i] = unique[np.argmax(count)]
        
    return yf

With this function, it is easy to compute the $k$-NN estimate for a range of points:

In [None]:
k = 3

# Create input points and compute outputs:
xf = np.random.rand(10, 2)*2 - 1
yf = knn(k, xf, X, y)

In [None]:
fig, ax = plt.subplots()
# Plot the training data:
ax.scatter(X[y == 0, 0], X[y == 0, 1], color="C0", label="Class 1")
ax.scatter(X[y == 1, 0], X[y == 1, 1], color="C1", label="Class 2")

# Plot the predictions:
ax.scatter(xf[:,0], xf[:,1], marker="x", s=60, color=["C{}".format(c) for c in yf])

plt.legend()
plt.show()

We can make the plot even fancier by plotting the decision regions as a background.

In [None]:
k = 1

# Define resolution for the background:
nx, ny = (100, 100)

fig, ax = plt.subplots()
# Plot the training data:
ax.scatter(X[y == 0, 0], X[y == 0, 1], color="C0", label="Class 1")
ax.scatter(X[y == 1, 0], X[y == 1, 1], color="C1", label="Class 2")

# Get plot bounding box:
l = ax.axis()

# Create grid inside the plotting area:
X1, X2 = np.meshgrid(np.linspace(l[0], l[1], nx),
                     np.linspace(l[2], l[3], ny),
                    )

# Change shape to match what our knn function expects:
xlist = np.dstack([X1, X2]).reshape(nx*ny, 2)

# Make predictions and change shape back to a grid:
C = knn(k, xlist, X, y).reshape(nx, ny)

# Create a background image using a colour scheme with lighter colours:
background = plt.cm.tab20c(C*4 + 3)


# Plot the predictions:
ax.imshow(background, extent=l, origin="lower", interpolation='none', zorder=-100)

# Adjust plot area:
ax.axis(l)

plt.legend()
plt.show()

## Cross validation

Again, we use $K$-fold CV to find the best value for $k$. First, we create random splits:

In [None]:
# Number of folds to use
K = 10

fold_size = len(X)//K

# Create indices and shuffle them
idx = np.arange(len(X))
np.random.shuffle(idx)

# Divide list of indices into chunks of size fold_size
folds = []
for i in range(K):
    folds.append(idx[i * fold_size:(i + 1) * fold_size])

folds = np.array(folds)
print("Indices for each fold:\n", folds)

Using these splits, we can test how many neighbours would be the optimal choice for our $k$-NN and the current dataset.

In [None]:
CV_err = []    # list to store the mean 0-1 loss for each k
k_range = np.arange(1,20, dtype=int)

for k in k_range:
    losses = []
    
    for f, fold in enumerate(folds):
        # Get indices for which folds to use for training:
        train_folds = np.delete(np.arange(K, dtype=int), f)
        # Construct training set by concatenating the indices:
        train_idx = np.concatenate(folds[train_folds])
        
        # Select data for training and tes
        X_train = X[train_idx]
        y_train = y[train_idx]
        X_test = X[fold]
        y_test = y[fold]

        # Get predictions for each test point:
        yp = knn(k, X_test, X_train, y_train)

        # Compute the 0-1 loss and store it
        losses.append(np.sum(yp != y_test))

    print("k = {}: mean 0-1 loss = {:g}".format(k, np.mean(losses)))

    # Save mean and std of the 0-1 loss for each k
    CV_err.append([np.mean(losses), np.std(losses)])
CV_err = np.array(CV_err)

From this, we can now locate the $k$ with the smalles 0-1 loss and use that for further predictions on this dataset.

In [None]:
idx_of_lowest_loss = np.argmin(CV_err[:,0])

print("Smallest loss of {:g} is achieved for k = {}.".format(CV_err[idx_of_lowest_loss, 0], 
                                                             k_range[idx_of_lowest_loss]))

Finally, let's visualise the mean and std of the loss for each $k$:

In [None]:
fig, ax = plt.subplots()
ax.errorbar(k_range, CV_err[:,0], yerr=CV_err[:,1])

ax.set_xticks(k_range)   # Make tick marks for all k
ax.set_xlabel("$k$")
ax.set_ylabel("0-1 loss")
plt.grid()
plt.show()

Like for $k$-NN regression, lower values of $k$ are preferred. For the data in `ex4.dat`, many values of $k$ will give no error, i.e. a perfect fit. That is because the two classes in the dataset are well separated. For datasets where the classes overlap, or nearly overlap as in `ex5.dat`, the classification error will never be zero.