# $k$-nearest Neighbours <img src="img/logo.png",width=140,height=140, align="right">

$k$-nearest neighbours (kNN) is a simple classification algorithm. It classifies cases based on a similarity measure relying on the labels belonging to the $k$ nearest points in the training set. In this notebook, we will show you how to do it with scikit-Learn, and in the appendix we'll show you how to do it from scratch. Let's start with importing the required libraries for this exercise.

In [None]:
import math
import operator

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn
from sklearn import feature_selection
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

%matplotlib inline

### Data

You're already familiar with the dataset we use for this exercise, as we'll use the wine data again. This time, we have added the labels to the wines. The wines comes from the same region in Tuscany, Italy but are derived from three different cultivars. Did you discover 3 clusters in the last exercise? We're still working with the chemical features of the data, but we've added classes.

The classes are given under the variables "Class" and "Origin" and are respectively:

1. from Siena
2. from Lucca
3. from Pisa

With our classficiaton algorithm we can predict where new, unseen wines are coming from. Let's start by reading in the new data.

In [None]:
data = pd.read_csv('data/wine_data.csv')
data.head()

In [None]:
data.shape

Ok great, so it looks like we've just added two columns to our dataset. Let's have a look at how our classes are divided. This is always important to do in classification problems. We would ideally like each of our classes to be quite equal in size, otherwise we would have to address a problem of unbalanced classes. So let's create a bar plot of our classes. We'll take the variable `Origin` but we could do the same with `Class`, as they're just the numeric and string representation of our classes. 

In [None]:
wine_origin = data['Origin'].value_counts()
plt.figure()
wine_origin.plot(kind='bar')
plt.xlabel('Origin of wine')
plt.ylabel('Number of instances')

So...do you think these classes are unbalanced?

### Cross validation - splitting the data 

Now to test the algorithm, we first need to split the data into a training set and a test set, and convert the two sets to NumPy arrays. We'll use the `train_test_split` module in scikit-learn for this. We'll then have to split the classes out of the data to store it in a separate `Y` variable. We can use `Class` for `train_Y` and the rest of the variables, without `Origin` for `train_X`.

In [None]:
train, test = train_test_split(data, train_size=0.7)

train_X = np.array(train)[:, 2:].astype(float)
train_Y = np.array(train)[:, 0].astype(float)

test_X = np.array(test)[:, 2:].astype(float)
test_Y = np.array(test)[:, 0].astype(float)

Great, now that our data is split into train and test data, let's run the kNN algorithm on our data using the scikit-learn:

In [None]:
K = 1
clf = KNeighborsClassifier(K)
clf.fit(train_X, train_Y)

predictions = clf.predict(test_X)

accuracy = sklearn.metrics.accuracy_score(test_Y, predictions) * 100
print('Accuracy: {:.1f}%'.format(accuracy))

### Normalizing data

If we look at the values in the data, we can see that they have different orders of magnitude for different features. Let's work our feature scaling magic and do a short exercise to scale our `train_X` set to make the different features better aligned.

In [None]:
data.head()

**Exercise:**
Can you scale the `train_X` feature set and `test_X` set using the `StandarScaler` function, like we did before in the $k$-means clustering exercise? Hint: if you get stuck, the answer is at the bottom of this notebook.

In [None]:
scaler1  = # Write here
scaler2  = # Write here

train_Xscaled = # Write here
test_Xscaled= # Write here

Let's put the scaled data into the kNN algorithm and see how it performs.

In [None]:
K = 1
clf = KNeighborsClassifier(K)
clf.fit(train_Xscaled, train_Y)

predictions = clf.predict(test_Xscaled)

accuracy = sklearn.metrics.accuracy_score(test_Y, predictions) * 100
print('Accuracy: {:.1f}%'.format(accuracy))

### Model evaluation

We can also investigate other metrics, such as:

In [None]:
print(sklearn.metrics.classification_report(test_Y, predictions))

** Question:** Do you understand what these metrics mean? Is our model any good at predicting where wines come from? How is the fact that our dataset is relatively small weighing on these results?

We can also try setting weights (i.e. using the distance of neighbours to weigh their relative importance), and see if our performance increases.

In [None]:
K = 1
clf = KNeighborsClassifier(K, weights='distance')
clf.fit(train_X, train_Y)

predictions = clf.predict(test_X)

accuracy = sklearn.metrics.accuracy_score(test_Y, predictions) * 100
print('Accuracy: {:.1f}%'.format(accuracy))

What happens if we increase the number of neighbours taken into account? We can plot the accuracy accordingly.

In [None]:
def plot_vector(train_X, train_Y, test_X, test_Y, weights, upperLim = 100):
    results = []
    for k in range(1, upperLim, 4):
        clf = sklearn.neighbors.KNeighborsClassifier(n_neighbors = k, weights = weights)
        clf = clf.fit(train_X, train_Y)
        preds = clf.predict(test_X)
        accuracy = clf.score(test_X, test_Y)
        results.append([k, accuracy*100])
 
    results = np.array(results)
    return(results)

plt_vector1 = plot_vector(train_X, train_Y, test_X, test_Y, weights='uniform')
plt_vector2 = plot_vector(train_X, train_Y, test_X, test_Y, weights='distance')
plt.plot(plt_vector1[:, 0], plt_vector1[:, 1], label='uniform')
plt.plot(plt_vector2[:, 0], plt_vector2[:, 1], label='distance')
plt.legend(loc='best')
plt.ylim(60, 80)
plt.title('Accuracy with increasing $k$')
plt.show()

This graph looks quite funky, but that's again mostly because we have a very small dataset. Our results tend to jump around quite a bit because of it. Also, having a very high number of $k$ on a very small dataset makes very little sense..

We can also do a step of feature selection, in order to maintain only the most descriptive features. More specifically, the `sklearn.feature_selection` module can be used for feature selection/dimensionality reduction on sample sets, either to improve estimators' accuracy scores or to boost their performance on very high-dimensional datasets. Univariate feature selection works by selecting the best features based on univariate statistical tests.

First, we select the optimal number of features, through cross validation:

In [None]:
percentiles = range(1, 100, 5)
results = []

for i in range(1, 100, 5):
    fs = feature_selection.SelectPercentile(
        feature_selection.f_classif, percentile=i)

    X_train_fs = fs.fit_transform(train_Xscaled, train_Y)

    scores = sklearn.model_selection.cross_val_score(
        clf, X_train_fs, train_Y, cv=5)

    results = np.append(results, scores.mean())

optimal_percentile = np.where(results == results.max())[0]

if len(optimal_percentile) > 1:
    optimal_percentile = optimal_percentile[0]

print('Optimal percentil: {}'.format(optimal_percentile))

# Plot number of features vs. cross-validation scores.
plt.figure()
plt.xlabel('Number of features selected [%]')
plt.ylabel('Cross validation accuracy')
plt.plot(percentiles, results)
print('Mean scores: {}'.format(results))

Then, we select the relevant features and we repeat the kNN algorithm with the transformed data:

In [None]:
fs = sklearn.feature_selection.SelectPercentile(
    feature_selection.f_classif,
    percentile=percentiles[optimal_percentile])
X_train_fs = fs.fit_transform(train_Xscaled, train_Y)

clf = KNeighborsClassifier(5)

clf.fit(X_train_fs, train_Y)
X_test_fs = fs.transform(test_Xscaled)
predictions = clf.predict(X_test_fs)

accuracy = sklearn.metrics.accuracy_score(test_Y, predictions) * 100
print('Accuracy: {:.1f}%'.format(accuracy))

What happens to accuracy if we change the ratio between training and test set?

# Appendix: Designing the kNN algorithm yourself

Since the kNN algorithm is very straightforward, you can easily set it up yourself. We would need to define all the functions we need to implement kNN. We'll do that below, to show you step by step how the kNN algorithm works.

Let's start by calculating the distance between two data instances.

#### Define Euclidean Distance: 

In [None]:
def euclidean_distance(instance1, instance2):
    length = len(instance1)
    # You can also check if instance1 and instance2 have the same length.
    distance = 0
    for l in range(length):
        distance += (instance1[l] - instance2[l])**2
    return math.sqrt(distance)

For example:

In [None]:
data1 = [0, 1, 2]
data2 = [0, 2, 4]
distance = euclidean_distance(data1, data2)
print('Distance: {:.2f}'.format(distance))

Let's define a function to get the $k$ nearest neighbors of a point in a set:

In [None]:
def get_neighbours(data, labels, test_instance, K):
    distances = []
    neighbours = {}
    # Find the distances between all the points and create a list of tuples.
    for i in range(len(data)):
        dist = euclidean_distance(test_instance, data[i, :])
        distances.append([data[i, :], dist])

    # Sort the list of distances by using the second element of the tuple,
    # i.e. the distance.
    idx = np.argsort(np.array(distances)[:, 1])
    neighbours_data = data[idx]
    neighbours_label = labels[idx]
    
    neighbours = {'data': neighbours_data[:K], 'labels': neighbours_label[:K]}
    return neighbours

For example:

In [None]:
# Define the training set: 2 points and 2 labels.
data = np.array([[2, 2, 2], [4, 4, 4]])
labels = np.array([0, 1])

# Define the test instance
test_instance = [5, 5, 5]

# Choose the number of neighbours.
K = 1

# Find and retrieve the k nearest points to the test instance, sorted by the
# distance.
neighbours = get_neighbours(data, labels, test_instance, K)
print(neighbours)

Let's define a response function: this counts the number of times a certain class appears in the set of neighbours that we've found with the previous function. The class with the highest frequency will be the label assigned to the test instance.

In [None]:
def get_response(neighbours):
    class_votes = {}
    # Assign the votes for every class.
    for i in range(len(neighbours)):
        response = neighbours[i]
        if response in class_votes:
            class_votes[response] += 1
        else:
            class_votes[response] = 1
    
    # Use the dictionary to short which class has the most votes.
    sorted_votes = sorted(class_votes.items(), key=operator.itemgetter(1),
                          reverse=True)
    return sorted_votes[0][0]

For example:

In [None]:
# In this case we have two 1s and one 0: class 1 wins.
neighbours['labels'] = np.array([1, 1, 0])
response = get_response(neighbours['labels'])
print(response)

Let's calculate the accuracy of our model: this should be a familiar concept by now, it is a way to test the performance of the model.

In [None]:
def get_accuracy(test_set, predictions):
    correct = 0
    for i in range(len(test_set)):
        # If the label of the test_set and the prediction are the same add one.
        if test_set[i] == predictions[i]:
            correct += 1
    return (float(correct) / float(len(test_set))) * 100.0

For example:

In [None]:
# True labels.
test_set = np.array(['a','a','b'])

# Predicted labels.
predictions = ['a', 'a', 'a']

accuracy = get_accuracy(test_set, predictions)
print(accuracy)

**Exercise:**
Assign a label to the test instance, basing on the following training set:

In [None]:
training_set = np.array([[1, 1, 1], [1, 3, 5], [7, 5, 4], [9, 5, 3]])
training_labels = np.array([1, 2, 1, 2])
test_instance = np.array([4, 4, 4])

# Get K neighbours.
K = # Type here
neighbours = # Type here

# Get the label
label = # Type here

print(label)

# What about the accuracy?

## Answers to the Exercises

**Exercise:**
Can you scale the `train_X` feature set and `test_X` set using the `StandarScaler` function, like we did before in the K-means exercise?

In [None]:
scaler1 = sklearn.preprocessing.StandardScaler().fit(train_X)
scaler2 = sklearn.preprocessing.StandardScaler().fit(test_X)

train_Xscaled = scaler1.transform(train_X)
test_Xscaled = scaler2.transform(test_X)

**Exercise:** Assign a label to the test instance, basing on the following training set:

In [None]:
# Get k neighbours.
K = 1
neighbours = get_neighbours(training_set, training_labels, test_instance, K)

# get the label
label = get_response(neighbours['labels'])

print(label)

Copyright © ASI 2017 All rights reserved