# Video 20. K-Nearest Neighbors Algorithm
***

- Because KNN is based on feature similarity, we can do classification using KNN Classifier
- K Nearest Neighbors is one of the simplest **Supervised** ML algorithms mostly used for classification
- It classifies a data point based on how its neighbors are classified
- Stores all available cases and classifies new cases based on a similarity measure
- *k* in KNN is a parameter that refers to the number of nearest neighbors to include in the majority voting process, for example:

<div style="display: block;margin-left: auto;margin-right: auto;width: 100%;text-align: center;">
    <img src="img/220319/15.png"><br><a href="https://youtu.be/RmajweUFKvM?list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy"><b>Image Source</b></a></div>
    
- A data point is classified by majority votes from its 5 nearest neighbors
- Here the unknown point would be classified as red, since 4 out of 5 neighbors are red


### How do we choose "k"?
- KNN algorithm is based on **feature similarity**: Choosing the right value of k is a process called parameter tuning, and is important for better accuracy

<div style="display: block;margin-left: auto;margin-right: auto;width: 100%;text-align: center;">
    <img src="img/220319/16.png"><br><a href="https://youtu.be/RmajweUFKvM?list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy"><b>Image Source</b></a></div>
    
- So at `k=3` we can classify `?` as a square
- One way of choosing the right k value is to:
    - Sqrt(n), where n is the total number of data points
    - Odd value of K is selected to avoid confusion between two classes of data
    
    
### When do we use KNN Algorithm?
- When data is labeled
- Data is noise free
- Dataset is small
    - Because KNN is a "lazy learner" i.e. doesn't learn a discriminative function from the training set
    
    
### How does KNN Algorithm work?
- Consider a dataset having two variables: height (cm) and weight (kg) and each point is classidied as Normal or Underweight:

<div style="display: block;margin-left: auto;margin-right: auto;width: 100%;text-align: center;">
    <img src="img/220319/17.png"><br><a href="https://youtu.be/RmajweUFKvM?list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy"><b>Image Source</b></a></div>
    
- On the basis of the given data we have to classify the below set as Normal or Underweight using KNN:

<div style="display: block;margin-left: auto;margin-right: auto;width: 100%;text-align: center;">
    <img src="img/220319/18.png"><br><a href="https://youtu.be/RmajweUFKvM?list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy"><b>Image Source</b></a></div>
    
- To find the nearest neighbors, we will calculate Euclidean distance
    - According to the Euclidean distance formula, the distance between two points in the plane with coordinates (x, y) and (a, b) is given by:
    
$$ \large dist(d) = \sqrt{(x - a)^2 + (y - b)^2} $$

- Let's calculate it to understand clearly:

<div style="display: block;margin-left: auto;margin-right: auto;width: 100%;text-align: center;">
    <img src="img/220319/19.png"><br><a href="https://youtu.be/RmajweUFKvM?list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy"><b>Image Source</b></a></div>
    
$$ dist(d1) = \sqrt{(170 - 167)^2 + (57 - 51)^2} = 6.7 $$<br>
$$ dist(d2) = \sqrt{(170 - 182)^2 + (57 - 62)^2} = 13 $$<br>
$$ dist(d3) = \sqrt{(170 - 176)^2 + (57 - 69)^2} = 13.4 $$<br>

- Similarly, we will calculate Euclidean distance of unknown data points from all the points in the dataset
- Hence, we have calculated the Euclidean distance of unknown data point from all the points as shown:

<div style="display: block;margin-left: auto;margin-right: auto;width: 100%;text-align: center;">
    <img src="img/220319/20.png"><br><a href="https://youtu.be/RmajweUFKvM?list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy"><b>Image Source</b></a></div>
    
- Where (x1, y1) = (57, 170) whose class we have to classify
- Now, let's calculate the nearest neighbor at k=3

<div style="display: block;margin-left: auto;margin-right: auto;width: 100%;text-align: center;">
    <img src="img/220319/21.png"><br><a href="https://youtu.be/RmajweUFKvM?list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy"><b>Image Source</b></a></div>
    
- So, majority of neighbors are pointing towards "Normal"
    - Hence, as per KNN algorithm the class of (57, 170) should be "Normal"
    
***
## Use Case: Predict Diabetes
- Objective: Predict whether a person will be diagnosed with diabetes or not

In [4]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score

In [5]:
data = pd.read_csv("data/220319/diabetes.csv")
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [6]:
print(len(data))

768


Values of columns like "Glucose", "BloodPressure" cannot be accepted as zeros because it will affect the outcome

We can replace such variables with the mean of the respective column:

In [7]:
zero_not_accepted = ["Glucose", "BloodPressure", "SkinThickness", "BMI", "Insulin"]

In [8]:
for column in zero_not_accepted:
    data[column] = data[column].replace(0, np.NaN)
    mean = int(data[column].mean(skipna=True))
    data[column] = data[column].replace(np.NaN, mean)

In [9]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,155.0,33.6,0.627,50,1
1,1,85.0,66.0,29.0,155.0,26.6,0.351,31,0
2,8,183.0,64.0,29.0,155.0,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1


In [10]:
X = data.iloc[:, 0:8]
y = data.iloc[:, 8]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.2)

**RULE OF THUMB**

Any algorithm that computes distance or assumes normality needs the data to be scaled:

In [11]:
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)  # isn't part of training, so we don't fit it

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)
  This is separate from the ipykernel package so we can avoid doing imports until


Choosing the right K value:

In [12]:
import math
math.sqrt(len(y_test))

12.409673645990857

12 is an even number, we need to make it odd so it will be 11.

p = 2 because it can be diabetic or not

In [13]:
classifier = KNeighborsClassifier(n_neighbors=11, p=2, metric="euclidean")
classifier.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
           metric_params=None, n_jobs=None, n_neighbors=11, p=2,
           weights='uniform')

In [14]:
y_pred = classifier.predict(X_test)

In [15]:
# Model evaluation
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[94 13]
 [15 32]]


In [16]:
print(f1_score(y_test, y_pred))  # takes into account false positives

0.6956521739130436


In [17]:
print(accuracy_score(y_test, y_pred))

0.8181818181818182
