**NOTE: This notebook is written for the Google Colab platform. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook.** 



In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
!{sys.executable} -m pip install umap-learn
!{sys.executable} -m pip install git+https://github.com/michalgregor/class_utils.git

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
import pandas as pd
import numpy as np

from sklearn.preprocessing import OrdinalEncoder
from sklearn.neighbors import NearestNeighbors
from umap import UMAP

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import numbers

In [None]:
#@title -- Downloading Data -- { display-mode: "form" }
DATA_HOME = "https://github.com/michalgregor/ml_notebooks/blob/main/data/{}?raw=1"

from class_utils.download import download_file_maybe_extract
download_file_maybe_extract(DATA_HOME.format("iris.csv"), directory="data")

# also create a directory for storing any outputs
import os
os.makedirs("output", exist_ok=True)

In [None]:
#@title -- Auxiliary Functions -- { display-mode: "form" }
cmap = 'viridis'

def plot_data(data, colors, alpha=1.0, ax=None,
              edgecolors=None, cmap=cmap, s=50):
    if ax is None:
        ax = plt.gca()
    
    ax.scatter(data[:, 0], data[:, 1], s=s,
               c=colors, edgecolors=edgecolors,
               alpha=alpha, cmap=cmap)

    ax.grid(ls='--')
    ax.set_axisbelow(True)
    ax.set_xlabel("$d_1$")
    ax.set_ylabel("$d_2$")
    
def make_legend(class_names, colors=None, cmap=cmap,
                ax=None, alpha=1.0, num_colors=None):
    if ax is None:
        ax = plt.gca()
        
    if isinstance(alpha, numbers.Number):
        alpha_seq = (alpha for i in range(len(class_names)))
    else:
        alpha_seq = alpha
    
    if num_colors is None:
        num_colors = len(class_names)
    
    if colors is None:
        colors = range(len(class_names))
        
    cm = plt.get_cmap(cmap, num_colors)

    legend_handles = []
    for ic, cn, al in zip(colors, class_names, alpha_seq):
        legend_handles.append(
            mpatches.Patch(color=cm(ic), label=cn, alpha=al)
        )

    ax.legend(handles=legend_handles)

## Classification using KNN: An Illustration

This notebook is meant to illustrate in a graphical way how the $k$ nearest neighbours (KNN) method works. It is not meant to be a reference example of how to use KNN in practice (there is a different notebook for that).

### The Dataset

For the purposes of this illustration we will be using the well-known Iris dataset, which contains measurements of sepal and petal dimensions for 3 different kinds of iris flowers: setosa, virginica and versicolor. The task is to build a classifier, which will be able to tell these 3 different classes of irises apart.

We will first read the data from the CSV file using `pandas`:



In [None]:
df = pd.read_csv("data/iris.csv")
df.head()

As we can see, the original data is 4-dimensional. Since in this illustrational example we will rely heavily on graphical visualizations, we will reduce the dimensionality of the data to 2 (using a method called UMAP, but let's not worry about that now) before we do anything else with it:



In [None]:
data_raw = df.iloc[:, :-1]
umap = UMAP(spread=20.0)
data = umap.fit_transform(data_raw)

We will also extract the last column, which contains the name of the species. The names are given as strings. To make the later stages of processing easier, we will assign a numeric ID to each species and replace the strings with the numbers:



In [None]:
str_labels = df[['species']].values
ordenc = OrdinalEncoder(dtype='int')
num_labels = ordenc.fit_transform(str_labels).flatten()
class_names = ordenc.categories_[0]

Having done that, we are now ready to plot our data in a 2D scatter plot:



In [None]:
plot_data(data, num_labels)
make_legend(class_names)
plt.savefig('output/knn_algo_data.pdf', bbox_inches='tight', pad_inches=0)

### $k$ Nearest Neighbours

The idea behind the $k$ nearest neighbours method is very simple. Whenever we get a new point, we will look back at our dataset and find $k$ datapoints that are closest to our new point (its $k$ nearest neighbours in the input space). The class of the new point will then be determined by voting among these nearest neighbours.

To illustrate this more concretely, let us now pick a new point and visualize its position using a black cross:



In [None]:
point = [-8, -6]
plot_data(data, num_labels)
plt.scatter(point[0], point[1], marker='x', s=75, linewidth=3, c='k')
plt.savefig('output/knn_algo_point.pdf', bbox_inches='tight', pad_inches=0)

Note that before getting that new point, we did not do anything with the data. We did not preprocess it, we did not use to fit the parameters of a model – we just stored it for later use. This is why KNN will sometimes be referred to as a "*lazy* " or "*non-parametric* " method – that is what methods of this kind are called.

In any case, let us now find the 3 nearest neighbours of our new point and highlight them:



In [None]:
knn = NearestNeighbors(n_neighbors=3).fit(data)
dist, ind = knn.kneighbors([point])
neigh_colors = num_labels[ind[0]]

plot_data(data, num_labels, alpha=0.4)
plt.scatter(point[0], point[1], marker='x', s=75, linewidth=3, c='k')
plt.scatter(data[ind[0], 0], data[ind[0], 1], s=90,
            c=num_labels[ind[0]], cmap=cmap,
            edgecolors='k', linewidths=1.5,
            vmin=0, vmax=len(class_names))
plt.savefig('output/knn_algo_neighbours.pdf', bbox_inches='tight', pad_inches=0)

The class of our new point will be determined by voting: we will classify our point into whichever class is most frequent among the nearest neighbours.



In [None]:
point_color = np.bincount(neigh_colors).argmax()
plot_data(data, num_labels, alpha=0.4, edgecolors=None)
plt.scatter([point[0]], [point[1]], marker='x', s=75, linewidth=3,
            c=[point_color], vmin=0, vmax=len(class_names))
plt.savefig('output/knn_algo_class.pdf', bbox_inches='tight', pad_inches=0)