# The *k*-nearest neighbours classifier

In [None]:
# Include some standard imports.

import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

## The KNeighborsClassifier library functions

In the module materials you have seen how the *k*-nearest neighbours algorithm (*k*-NN) can be used as a simple technique for classifying a new object based on how closely it matches the properties of other objects which have already been classified. In this Notebook we will work through some examples of how to use the Python libraries to build and use a *k*-NN classifier.

The `SKLearn` library in Python provides a set of functions for carrying out *k*-nearest neighbours analyses. In this Notebook you will use use this library to carry out some nearest neighbour classification tasks. The library is implemented in the `sklearn.neighbors` library.

To see how to use the library on a simple example, we will start by using the patient data (Part 20, Figure 20.3).

The data has been saved in the file [patients.csv](./data/patients.csv), which we can import as a DataFrame:

In [None]:
patients_df = pd.read_csv('data/patients.csv')
patients_df.head()

The columns `Exercise time (hours)` and `Sleep time (hours)` give the values in hours of the two features of each patient. This DataFrame also contains a column `Patient group` which contains the classification of each of the patients into groups A and B.

To get a feel for the data, we can treat the `Exercise time (hours)` and `Sleep time (hours)` columns as points in a 2-dimensional space, and plot them with a scatter plot:

In [None]:
groupA_df = patients_df[patients_df['Patient group']=='A']
groupB_df = patients_df[patients_df['Patient group']=='B']

ax = groupA_df.plot(x='Exercise time (hours)',
                    y='Sleep time (hours)',
                    kind='scatter', color='DarkBlue', label="Group A", marker="o",
                    title="Scatter plot of patient data")

groupB_df.plot(x='Exercise time (hours)', 
               y='Sleep time (hours)',
               kind='scatter', color='Red', label="Group B", marker='s', ax=ax)


Our aim is, given some new patients, if we're told how much time they have spent exercising, and how much time sleeping, can we classify them according to whether we think that they are of type A or type B?

To carry out the classification, we will use the analyser in the `KNeighborsClassifier` library. Import the library with:

In [None]:
from sklearn.neighbors import KNeighborsClassifier

Now, the first step is to create a classifier instance from the `KNeighborsClassifier` class. In the first instance, we will build a classifier with *k*=3, which is set using the parameter `n_neighbors` in the initialisation. We will also set the chosen metric to be Euclidean separation, as discussed in Section 3.1 of Part 20.

In [None]:
classifier_3NN = KNeighborsClassifier(n_neighbors=3, metric='euclidean')

Next we need to train the classifier on the training data. The `classifier_3NN` object has a method `fit(X, y)`, which takes an array of training data, `X` and a vector of classification values, `y`, to train the classifier.

When we use this library with *pandas*, we will usually pass the training data, `X`, to `fit` as a DataFrame, and the classification values, `y`, as a Series. 

In this case, we want the training data to be the columns `'Exercise time (hours)'` and `'Sleep time (hours)'` of `patients_df`, and the target values to be the column `patients_df['Patient group']`.


In [None]:

trainingData_df = patients_df[['Exercise time (hours)', 'Sleep time (hours)']]
targetValues_ss = patients_df['Patient group']

classifier_3NN.fit(trainingData_df, targetValues_ss)

Our 3-NN classifier is now ready to be used. To use the classifier to classify a new instance, we use the method `predict(X)` where `X` is an array of test data which the classifier will attempt to classify.

In this case, we will try to classify a new patient who has registered an exercise time of 2.5 hours, and a sleep time of 6.5 hours. This test case should be presented in the same format as the training data, so let's define a DataFrame with a single row and columns with the same headings as we used in the training data:

In [None]:
testData_df = pd.DataFrame({'Exercise time (hours)':[2.5],
                            'Sleep time (hours)':[6.5]})
testData_df

We then pass this to `predict`, which returns the class of the submitted data point.

In [None]:
classifier_3NN.predict(testData_df)

In this case, the classifier has predicted that the new patient is of type A.

To classify several instances at once, we use more rows in the test data DataFrame:

In [None]:
testData_df = pd.DataFrame({'Exercise time (hours)':[2.5, 1.7, 2.8, 3],
                            'Sleep time (hours)':[6.5, 6.7, 7.0, 5.5]})
testData_df

When we pass this DataFrame to the classifier, a numpy array is returned with the *n*<sup>th</sup> value in the array being the class of the datapoint represented by the *n*<sup>th</sup> row of the test data. 

In [None]:
classifier_3NN.predict(testData_df)

In fact, because the output of the classifier is a sequence of values rather than just a single value, it can be easier to see the classifications in a single DataFrame.

In [None]:
output_df = testData_df.copy()
output_df['Patient group'] = classifier_3NN.predict(testData_df)

output_df

To see how well the classifier is working, we can plot the test data on the same axes as the training data:

In [None]:
trainGroupA_df = patients_df[patients_df['Patient group']=='A']
trainGroupB_df = patients_df[patients_df['Patient group']=='B']

ax = trainGroupA_df.plot(x='Exercise time (hours)', y='Sleep time (hours)',
                         kind='scatter', color='DarkBlue', label="Group A (train)", marker="o",
                         title="Patient sleep data with test cases")

trainGroupB_df.plot(x='Exercise time (hours)', y='Sleep time (hours)',
                    kind='scatter', color='Red', label="Group B (train)", marker='s', ax=ax)

testData_df.plot(x='Exercise time (hours)', y='Sleep time (hours)',
                 kind='scatter', color='LightGreen', label="Test data",
                 marker='^', ax=ax)

# Extend the x-axis to better accommodate the labelling box:
plt.xlim((0, 8))

pass # Don't show any return values


In the previous scatter plot, the test data is shown as a collection of green triangles. To see how these points are classified, we can make another plot, using triangles to show where the new points are classified. As before, the test cases are shown by triangles, but they are now given the same colour as the class into which they have been classified.

In [None]:
trainGroupA_df = patients_df[patients_df['Patient group']=='A']
trainGroupB_df = patients_df[patients_df['Patient group']=='B']

ax = trainGroupA_df.plot(x='Exercise time (hours)', y='Sleep time (hours)',
                         kind='scatter', color='DarkBlue', label="Group A (train)", marker="o",
                         title="Scatter plot of patient sleep data")

trainGroupB_df.plot(x='Exercise time (hours)', y='Sleep time (hours)',
                    kind='scatter', color='Red', label="Group B (train)", marker='s', ax=ax)

testGroupA_df=output_df[output_df['Patient group']=='A']
testGroupB_df=output_df[output_df['Patient group']=='B']

testGroupA_df.plot(x='Exercise time (hours)', y='Sleep time (hours)',
                   kind='scatter', color='DarkBlue', label="Group A (test)", 
                   marker='^', ax=ax)

testGroupB_df.plot(x='Exercise time (hours)', y='Sleep time (hours)',
                   kind='scatter', color='Red', label="Group B (test)",
                   marker='^', ax=ax)

# Extend the x-axis to better accommodate the labelling box:
plt.xlim((0, 8))

pass # Don't show any return values

### Activity 1
As we discussed in Section 3.4 of Part 20, the choice of *k* for a *k*-NN classifier can affect the results of the classification process.

Use the same training data that we used previously in the Notebook to train a *k*-NN classifier for *k*=2, *k*=4 and *k*=5. Then use these classifiers to classify the test data in the `testData_df` DataFrame.

Which of the data points are classified differently for different values of *k*?

Our solution is below.

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

A simple example which would generate the classifications for the new points at different values of *k* is given here:

In [None]:
output_df = testData_df.copy()

for k in range(2, 6):
    classifier_kNN = KNeighborsClassifier(n_neighbors=k, metric='euclidean')
    classifier_kNN.fit(trainingData_df, targetValues_ss)
    
    output_df['Patient group (k={})'.format(k)] = classifier_kNN.predict(testData_df)

output_df

The test data points at (2.5, 6.5) and (2.8, 7.0) are both sensitive to the size of *k*. This could suggest that they are borderline cases. In a real task, the data analyst might want to single out these points for special consideration.

## Weighted voting

In Section 3.3 of Part 20, we discussed that a possible tweak to the general *k*-NN model might be to use a weighted voting strategy, whereby each node's contribution is scaled according to its proximity to the test node.

A weighted voting scheme has been implemented in the `KNeighborsClassifier` constructor. To use a weighted classifier, the call:

    KNeighborsClassifier(n_neighbors=k, metric='euclidean', weights='distance')
   
returns a classifier for *k* nearest neighbours, with Euclidean distance, and where the contribution of each point is weighted by the inverse of its distance from the new point.

Note: The default value of `weights` is `uniform`, where each of the *k* nearest points contribute equally to the class selection.

### Activity 2
Repeat Activity 20.1, but using a weighted classifier, rather than the uniform classifier.

Which data points appear to be most susceptible to the size of *k* for the weighted classifier?


#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

For the weighted classifier, the data points are all given the same classification for values of *k*. A weighted classification strategy can be more robust than the uniform strategy, but may be less informative about the borderline cases.

A code snippet to generate the figures for the test set is:

In [None]:
output_df = testData_df.copy()

for k in range(2, 9):
    classifier_kNN = KNeighborsClassifier(n_neighbors=k, metric='euclidean', weights='distance')
    classifier_kNN.fit(trainingData_df, targetValues_ss)
    
    output_df['Patient group (k={})'.format(k)] = classifier_kNN.predict(testData_df)

output_df

## What next?

You have now completed this Notebook. You should now be able to tackle iCMA 46 Question 1.

If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to `20.2 The leave-one-out algorithm`.