# Witful ML 05 - Classification with KNN
by Kaan Kabalak, Editor In Chief @ witfuldata.com

# The Data Frame

The dataset is from https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier

In [2]:
#Data frame from .csv
diabetes_df = pd.read_csv("diabetes.csv")
diabetes_df.head(5)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
#Info
diabetes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


# Insantiate & Fit

In [4]:
#Defining X-y variables
X = diabetes_df.drop("Outcome",axis=1).values
y = diabetes_df["Outcome"].values

In [5]:
#Insantiating the model
k_model = KNeighborsClassifier(n_neighbors=3)

The n_neighbors argument stands for the number of closest data points which will be taken into consideration when predicting the classes

In [6]:
#Fitting
k_model.fit(X,y)

KNeighborsClassifier(n_neighbors=3)

# Prediction & Evaluation

In [7]:
#Saving predictions to a variable
pred = k_model.predict(X)

In [8]:
k_model.score(X,y)

0.859375

This 0.85 is our model's score. It means that our model got %85 percent of the predictions right.

## Understanding how the score is measured

I want to explain how the score is actually measured. You will see in the lines below that it is actually very simple calculation which is performed based on the difference between the real and the predicted values and comparing this difference to the overall number of observations. 

In [9]:
#Forming a data frame with real and predicted values
perf = pd.DataFrame({"Real":y,"Predicted":pred})
perf.head(10)

Unnamed: 0,Real,Predicted
0,1,1
1,0,0
2,1,1
3,0,0
4,1,1
5,0,0
6,1,0
7,0,1
8,1,1
9,1,0


In [10]:
#Appending correct predictions to a list
perf_list = []
for i,row in perf.iterrows():
    if row["Real"]-row["Predicted"] == 0:
        perf_list.append("correct")
    else:
        perf_list.append("incorrect")

#Getting the correct predictions to all observations ratio
print(perf_list.count("correct")/768)

0.859375
