<div style="display: block;margin-left: auto;margin-right: auto;width: 100%;text-align: center;">
<br><a href="https://www.youtube.com/watch?v=4HKqjENq9OU&list=PLEiEAq2VkUULYYgj13YHUWmRePqiu8Ddy&index=20"><b> Source</b></a></div>

# 1. K-Nearest Neighbor Analysis
***

### <center> Theoretical part
    
<b>K-NEAREST NEIGHBORS</b>
- is one of the simplest SUPERVISED machine learning algorithm mostly used for:
![](pic/v20/1s.png)

- KNN stores all available cases and classifies new cases based on a similarity measure
- <b>K</b> in KNN is a parameter that refers to the number of nearest neighbors to include in the majority voting process
![](pic/v20/2s.png)

- here the unknown point would be classified as <font color="red">RED</font> since 4 out of 5 neighbors are red

<center> <b>EXAMPLE</b></center> 

![](pic/v20/3s.png)

### How do we choose the factor "K"?
- KNN algorithm is based on feature similarity
- choosing the right value of k is a process called parameter tuning, and is important for better accuracy
![](pic/v20/5s.png)

- the class of unknown data point was <font color="red">RED SQUARE</font> at k=3 but changed at k=7 to <font color="purple">PURPLE TRIANGLE</font>, so which k should we choose?

<b>TO CHOOSE A VALUE OF K:</b>
- sqrt(n), where n is the total number of data ponts
- odd value of K is selected to avoid confusion between two classes of data

#### When do we use KNN algorithm:
- when data is labeled
- when data is "noise" free
- when dataset is small ("lazy learner" algorithm)

### How does KNN algoruthm works?
- consider a data set having two variable:
    - height(cm)
    - weight(kg)
- and each point of that data is classified as Normal or Underweight
![](pic/v20/6s.png)

- on the basis of the given data we have to classify the below set as Normal or Underweight using KNN
![](pic/v20/7s.png)

- according to the <b>EUCLIDEAN DISTANCE</b> formula, the distance between two points in the plane with coordinates (x,y) and (a,b) is given by:
![](pic/v20/8s.png)

- let's calculate it to understand clearly:
![](pic/v20/9s.png)

- we have calculated the Euclidean distance of unknown data point from all the points as shown:
![](pic/v20/10s.png)
![](pic/v20/11s.png)

- majority of neighbors are pointing towards "normal"
- as per KNN algorithm the class of (57,170) should be "Normal"

#### Recap of KNN
![](pic/v20/12s.png)

# 2. K-Nearest Neighbor - practical example
***

### TASK: to predict whether a person will be diagnosed with diabetes or not

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score

In [2]:
data = pd.read_csv("Data/diabetes.csv")
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
len(data)

768

In [4]:
# Values of columns like "Glucose", "BloodPressure" cannot be accepted as zeros because it will affect the outcome
# We can replace such variables with the mean of the respective column:
zero_not_accepted = ["Glucose", "BloodPressure", "SkinThickness", "BMI", "Insulin"]

for column in zero_not_accepted:
    data[column] = data[column].replace(0, np.NaN)
    mean = int(data[column].mean(skipna=True))
    data[column] = data[column].replace(np.NaN, mean)

In [5]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,155.0,33.6,0.627,50,1
1,1,85.0,66.0,29.0,155.0,26.6,0.351,31,0
2,8,183.0,64.0,29.0,155.0,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1


In [6]:
# Split dataset
X = data.iloc[:, 0:8]
y = data.iloc[:, 8]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.2)

#### Rule of thumb: any algorithm that computes distance or assumes normality, SCALE YOUR FEATURES!

In [7]:
# Feature scaling
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test) 

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)
  after removing the cwd from sys.path.


In [8]:
# choosing k value
import math
math.sqrt(len(y_test))

# 12 is even number, so we take 11 as odd number

12.409673645990857

In [9]:
# define the model: itni KNN
classifier = KNeighborsClassifier(n_neighbors=11, p=2, metric="euclidean")
classifier.fit(X_train, y_train)

# p=2, either you get diabetes or not (0 or 1)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
           metric_params=None, n_jobs=None, n_neighbors=11, p=2,
           weights='uniform')

In [10]:
y_pred = classifier.predict(X_test)
y_pred

array([1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
      dtype=int64)

In [11]:
# Model evaluation
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[94 13]
 [15 32]]


In [12]:
print(f1_score(y_test, y_pred))  # takes into account false positives

0.6956521739130436


In [13]:
print(accuracy_score(y_test, y_pred))

0.8181818181818182


#### Accuracy of 80% tells us that it is a pretty fair fit in the model.