# K-nearest neighbors algorithm

As the name implies, the k-nearest neighbors algorithm works by findinng the nearest neighbors of some give data. For instance, let’s say we have a binary classification problem. If we set k to 10, the KNN modell will look for 10 nearest points to the data presented. If among the 10 neighbors observed, 8 of them have the label 0 and 2 of them are labeled 1, the KNN algorithm will conclude that the label of the provided data is most likely also going to be 0. As we can see, the KNN algorithm is extremely simple, but if we have enough data to feed it, it can produce some highly accurate predictions.

It can also be used for regression, for example we can take the k nearest neighbors and take the average of their values to predict the value of the data presented for a particular feature.


In [1]:
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use("seaborn")

  plt.style.use("seaborn")


## Distance metrics

We need to define a metric that tells us how similar or different are two data points. One of the most popular distance metrics is the Euclidean distance. It is defined as the square root of the sum of the squared differences between the two vectors. For example, if we have two vectors $x$ and $y$, the Euclidean distance between them is defined as:

$$
d(x,y) = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2}
$$

where $n$ is the number of features in the vectors.

In [2]:
def distance(instance1, instance2):
    instance1, instance2 = np.array(instance1), np.array(instance2)
    return np.sqrt(sum((instance1 - instance2)**2))

dataset = [[2.7810836,2.550537003],
           [1.465489372,2.362125076],
           [3.396561688,4.400293529],
           [1.38807019,1.850220317],
           [3.06407232,3.005305973],
           [7.627531214,2.759262235],
           [5.332441248,2.088626775],
           [6.922596716,1.77106367],
           [8.675418651,-0.242068655],
           [7.673756466,3.508563011]]

label = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

assert len(dataset) == len(label)

for data in dataset:
    print(distance(dataset[0], data))


0.0
1.3290173915275787
1.9494646655653247
1.5591439385540549
0.5356280721938492
4.850940186986411
2.592833759950511
4.214227042632867
6.522409988228337
4.985585382449795


## Neighbors selection

Next, we need to select the neighbors. Given an instance, we select the top k nearest neighbors.

In [3]:
def get_neighbors(training_set, test_instance, k):
    distances = [(i, distance(test_instance, instance)) for i, instance in enumerate(training_set)]
    distances.sort(key=lambda x: x[1])
    return [i[0] for i in distances[:k]]

In [4]:
get_neighbors(dataset, dataset[0], k=7)

[0, 4, 1, 3, 2, 6, 7]

## Making predictions

To make a prediction, we take the top k neighbors for an instance and then we can use majority voting to select the class.

In [5]:
def make_prediction(neighbor_index, label):
    label = np.array(label)
    neighbor_label = label[neighbor_index]
    prediction = {}
    for x in neighbor_label:
        if x in prediction:
            prediction[x] += 1
        else:
            prediction[x] = 1
    total = sum(prediction.values())
    probability_prediction = {k: v/total for k, v in prediction.items()}
    return probability_prediction

In [6]:
make_prediction([0, 4, 1, 3, 2, 6, 7], label)

{0: 0.7142857142857143, 1: 0.2857142857142857}

In [7]:
def knn_classifier(training_set, label, test_set, k):
    result = []
    for instance in test_set:
        neighbor_index = get_neighbors(training_set, instance, k)
        prediction = make_prediction(neighbor_index, label)
        result.append(max(prediction, key=prediction.get))
    return np.array(result)

In [8]:
knn_classifier(dataset[1:], label, [dataset[0]], 3)

array([0])