<a href="https://colab.research.google.com/github/hl105/deep-learning-practice/blob/main/KNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**K Nearest Neighbors (KNN)**
- Supervised, simple ML algorithm
- Classifies a data point based on how its neighbors are classified

K: parameter that refers to the number of nearest neighbors to include in the majority voting process

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

In [4]:
df = load_iris(as_frame = True).frame
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [5]:
# map the target to iris names
df['target'] = df['target'].map({0:"setosa",1:"versicolor",2:"virginica"})
df['target']

0         setosa
1         setosa
2         setosa
3         setosa
4         setosa
         ...    
145    virginica
146    virginica
147    virginica
148    virginica
149    virginica
Name: target, Length: 150, dtype: object

In [6]:
# check for null values
df.isnull().sum()

sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
target               0
dtype: int64

In [7]:
#split into training and test set

'''
first option:
training_data = df.sample(frac = 0.8, random_state = 25)
testing_data = df.drop(training_data.index)
'''

# using train_test_split
X = df.drop('target',axis = 1) # upper case to denote multi-dimensional
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 25)

**Feature Scaling:**

Scale features for any algorithm that computes distance or assumes normality!

Normalization should be done AFTER splitting data becuase using info from test set is a potential bias.

When normalizing test set, we should use the normalization parameters from the training set instead of recalculating them on the test set becasue it would be inconsistent with the model.

In [9]:
scaler = StandardScaler()

#the scaler object learns the mean & sd needed for scaling
X_train = scaler.fit_transform(X_train)

## we use same scaler object to transform X_test
X_test = scaler.fit_transform(X_test)



In [11]:
# define model
knn = KNeighborsClassifier(n_neighbors=7)

In [15]:
# train the model
knn.fit(X_train, y_train)

#predict
y_pred = knn.predict(X_test)


**Confusion Matrix**

example: target has 3 values 0,1,2. Then we have a 3*3 matrix

> Diagonal: where the algorithm got correct results.


In [16]:
#Evaluate Model
cm = confusion_matrix(y_test,y_pred)
print(cm)

[[ 9  0  0]
 [ 0 10  3]
 [ 0  1  7]]


In [18]:
print(accuracy_score(y_test,y_pred))

0.8666666666666667


**Now, let's try defining KNN ourselves instead of using KNeighborsClassifier**

In [None]:
from collections import Counter
import math

In [None]:
# gets an "unknown" point and classifies it