In [None]:
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt

In this notebook, I apply k-nearest neighbours classification (KNN) to a toy dataset. KNN is a non-parametric method that works well with data in which the classification boundary is irregular. To that effect, I will first create such type of dataset by randomly splitting the 2D plane by means of an irregular classification border and then sampling some training data. 

## Create the toy dataset

In [None]:
np.random.seed(0)

We use gaussian mixtures to split the 2D space into areas in which each of the two classes in the toy dataset have a higher likelihood. Each class is defined by a mixture of 5 2D Gaussian distributions with random means and identity covariance matrix.

In [None]:
NUMBER_MEANS = 10
NUMBER_SAMPLES = 200

means_0 = (np.random.random_sample(2 * NUMBER_MEANS) * 10).reshape(NUMBER_MEANS, 2)
means_1 = (np.random.random_sample(2 * NUMBER_MEANS) * 10).reshape(NUMBER_MEANS, 2)

def pdf_2d_norm(y, x, mean):
    dif = np.array([y, x]) - mean
    product = np.dot(np.dot(dif.T, np.linalg.inv(np.identity(2))), dif)
    return (2 * math.pi) ** (-1) * np.linalg.det(np.identity(2)) ** (-0.5) * math.exp(-0.5 * product)

xs = []
ys = []
labels = []
for s in range(NUMBER_SAMPLES):
    ys.append((np.random.random_sample(1) * 10)[0])
    xs.append((np.random.random_sample(1) * 10)[0])
    value_0 = 0
    value_1 = 0
    for i in range(NUMBER_MEANS):
        value_0 = value_0 + pdf_2d_norm(ys[-1], xs[-1], means_0[i, :])
        value_1 = value_1 + pdf_2d_norm(ys[-1], xs[-1], means_1[i, :])
    labels.append(int(value_1 > value_0))

xs = np.array(xs)
ys = np.array(ys)
labels = np.array(labels)
    
index_0 = np.where(labels == 0)[0]
index_1 = np.where(labels == 1)[0]

fig, ax = plt.subplots()
ax.scatter(ys[index_0], xs[index_0])
ax.scatter(ys[index_1], xs[index_1])

## Classification

KNN is not a generalisation method. In fact, it does not even require training a model. We just need to store the training data. To classify a new observation, we just detect its k-nearest trainining observations and assign the class by means of a voting mechanism. 

The usual solution is an uniformly weighted one in which all the nearest observations' votes have the same value.

The value of k is strongly depending on the data. As we increase the value of k we mitigate more the effect of noise, but the classification borders are less distinct.

In this example we observe the effect of increasing the value of k in the test set:



A weighted approach in which the weight of each neighbour's vote for the classification of the test observation depends on the inverse of the distance from the test observation to the test one is recommended in those cases in which the data is not uniformly sampled. Let's see some classification results as we increase the value of k in the case of the weighted approach. 

## Model selection

In this section we apply model selection to detect the best value of k. In order to do so we apply 10-fold cross validation on the training data. We are considering the uniformly-weighted case. 