## What is it?

K-Nearest Neighbors or KNN is a Machine Learning algorithm that classifies or assigns a label to a new point or data point 
base on the classes or labels of it's $k$ nearest neighbors. The goal of this algorithm therefore is, given $k$, find the class or label of a new point $X$.

## How it works.

1. Calculate the __Eucledian Distance__ between the new point e.g $(x,y)$ and all other data points e.g $(a, b)$. This example assumes that the dataset only has two features. Formula for the __Eucledian Distance__ is $dist(x, y) = \sqrt{(x-a)^2 + (y-b)^2}$.
2. For calculated distances get the $k$ smallest or shorted ones and their respective classes or labels. The highest class count is the class for the new data point.

## Use Cases.

1. Classification on small datasets.
2. Classification on datasets with minimal noise e.g 2 classes or labels.

## Selecting k.

- $\sqrt{n} \pm{1}$, where $n=size(dataset)$
- $k$ should not be a multiple of the classes.
- $k$ is odd depending on whether classes is even.


## Implemetation.

In [3]:
import numpy as np
import pandas as pd

In [7]:
# simple dataset
dataset = np.array([[1, 2, 0],
          [3, 4, 1],
          [5, 6, 1],
          [7, 8, 0]])

# a look at the dataset
data = pd.DataFrame(dataset, columns=["height", "weight", "class"], dtype=float)

In [8]:
data.head()

Unnamed: 0,height,weight,class
0,1.0,2.0,0.0
1,3.0,4.0,1.0
2,5.0,6.0,1.0
3,7.0,8.0,0.0


In [35]:
# new point
point = np.array([4.5, 6.1])
point

array([4.5, 6.1])

In [46]:
# 1. Calculate the Eucledian Distance
def eucledian_distance(point, dataset):
    eucledian = np.zeros(dataset.shape[0])
    
    for i in range(point.shape[0]):
        eucledian += np.power(point[i] - dataset[:, i], 2)
    eucledian = np.sqrt(eucledian)
    
    return eucledian

In [56]:
e_dist = eucledian_distance(point, dataset)
data["Eucledian Distance"] = e_dist
data.head()

Unnamed: 0,height,weight,class,Eucledian Distance
0,1.0,2.0,0.0,5.390733
1,3.0,4.0,1.0,2.580698
2,5.0,6.0,1.0,0.509902
3,7.0,8.0,0.0,3.140064


In [55]:
# 2. Get suitable k
k = int(np.sqrt(dataset.shape[0]))
if len(set(dataset[:, -1]))%2 == 0: k+=1
print(k)

3


In [69]:
# 3. Get the class
def classify(k, e_dist, dataset):
    # counts for classes
    k_class_counts = {i: 0 for i in set(dataset[:, -1])}
    # class dist pairs sorted by dist
    class_ed_pairs = sorted(zip(dataset[:, -1], e_dist), key=lambda x: x[1])
    # get k counts for the classes
    for i in range(k):
        k_class_counts[class_ed_pairs[i][0]] += 1
    
    return max(k_class_counts, key=k_class_counts.get)

In [70]:
output = classify(k, e_dist, dataset)
print(output)

1
