# Classification Using KNN

The KNN algorithm is a classic nonparametric supervised machine learning model. Unlike unsupervised machine learning algorithms like K-Means, KNN requires labeled data. The abbreviation stands for "K Nearest Neighbors," and the algorithm predicts the labels of the test data set by looking at the labels of its closest neighbors in the feature space of the training data set. Typically, KNN uses Euclidean distance, though other distance metrics like the Manhattan distance can be used as well. The “K” is the most important hyperparameter that can be tuned to optimize the performance of the model.

KNN is a comparatively simple algorithm that provides good results for a wide range of classification problems. KNN can be applied to both small and large data sets. However, it does have some drawbacks, such as it can be very computationally expensive for large data sets or when a data set has a feature space with a high number of dimensions.

## Import Librairies

In [281]:
from tqdm import tqdm
import numpy as np
import pandas as pd
from itertools import accumulate
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import scale, LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, cohen_kappa_score, confusion_matrix
from sklearn.feature_selection import f_classif
from sklearn.utils import resample

# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

sns.set_context('notebook')
sns.set_style('white')

# 1. Diabetes classification

### Load DATA

The data set that you'll use is about classifying patients into diabetes positive or negative given their medical information such as their cholesterol, glucose levels, age, gender, height, waist, and hip measurements.

In [285]:
df = pd.read_excel('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/X4i8vXLw81g4wEH473zIFA/Diabetes-Classification.xlsx')

### Analyse data

In [287]:
df.head()

Unnamed: 0,Patient number,Cholesterol,Glucose,HDL Chol,Chol/HDL ratio,Age,Gender,Height,Weight,BMI,Systolic BP,Diastolic BP,waist,hip,Waist/hip ratio,Diabetes,Unnamed: 16,Unnamed: 17
0,1,193,77,49,3.9,19,female,61,119,22.5,118,70,32,38,0.84,No diabetes,6.0,6.0
1,2,146,79,41,3.6,19,female,60,135,26.4,108,58,33,40,0.83,No diabetes,,
2,3,217,75,54,4.0,20,female,67,187,29.3,110,72,40,45,0.89,No diabetes,,
3,4,226,97,70,3.2,20,female,64,114,19.6,122,64,31,39,0.79,No diabetes,,
4,5,164,91,67,2.4,20,female,70,141,20.2,122,86,32,39,0.82,No diabetes,,


### Shiw statistics about dataset

In [289]:
df.describe()

Unnamed: 0,Patient number,Cholesterol,Glucose,HDL Chol,Chol/HDL ratio,Age,Height,Weight,BMI,Systolic BP,Diastolic BP,waist,hip,Waist/hip ratio,Unnamed: 16,Unnamed: 17
count,390.0,390.0,390.0,390.0,390.0,390.0,390.0,390.0,390.0,390.0,390.0,390.0,390.0,390.0,1.0,1.0
mean,195.5,207.230769,107.338462,50.266667,4.524615,46.774359,65.951282,177.407692,28.775641,137.133333,83.289744,37.869231,42.992308,0.881385,6.0,6.0
std,112.727548,44.666005,53.798188,17.279069,1.736634,16.435911,3.918867,40.407824,6.600915,22.859528,13.498192,5.760947,5.664342,0.073212,,
min,1.0,78.0,48.0,12.0,1.5,19.0,52.0,99.0,15.2,90.0,48.0,26.0,30.0,0.68,6.0,6.0
25%,98.25,179.0,81.0,38.0,3.2,34.0,63.0,150.25,24.1,122.0,75.0,33.0,39.0,0.83,6.0,6.0
50%,195.5,203.0,90.0,46.0,4.2,44.5,66.0,173.0,27.8,136.0,82.0,37.0,42.0,0.88,6.0,6.0
75%,292.75,229.0,107.75,59.0,5.4,60.0,69.0,200.0,32.275,148.0,90.0,41.0,46.0,0.93,6.0,6.0
max,390.0,443.0,385.0,120.0,19.3,92.0,76.0,325.0,55.8,250.0,124.0,56.0,64.0,1.14,6.0,6.0


Drop the last two columns since they don't seem to be relevant.

In [291]:
df.drop(columns=['Unnamed: 16', 'Unnamed: 17'])
df.head(5)

Unnamed: 0,Patient number,Cholesterol,Glucose,HDL Chol,Chol/HDL ratio,Age,Gender,Height,Weight,BMI,Systolic BP,Diastolic BP,waist,hip,Waist/hip ratio,Diabetes,Unnamed: 16,Unnamed: 17
0,1,193,77,49,3.9,19,female,61,119,22.5,118,70,32,38,0.84,No diabetes,6.0,6.0
1,2,146,79,41,3.6,19,female,60,135,26.4,108,58,33,40,0.83,No diabetes,,
2,3,217,75,54,4.0,20,female,67,187,29.3,110,72,40,45,0.89,No diabetes,,
3,4,226,97,70,3.2,20,female,64,114,19.6,122,64,31,39,0.79,No diabetes,,
4,5,164,91,67,2.4,20,female,70,141,20.2,122,86,32,39,0.82,No diabetes,,


Produce a frenquecy table : calculating the proportion of each unique value in the 'Diabetes' column of the df DataFrame.

In [293]:
frequency_table = df['Diabetes'].value_counts()
props = frequency_table.apply(lambda x: x / len(df['Diabetes']))
props

Diabetes
No diabetes    0.846154
Diabetes       0.153846
Name: count, dtype: float64

In [294]:
# Selecting relevant columns from the dataset
df_reduced = df[["Diabetes", "Cholesterol", "Glucose", "BMI", "Waist/hip ratio", "HDL Chol", "Chol/HDL ratio", "Systolic BP", "Diastolic BP", "Weight"]]

# Extracting only numerical columns (excluding the categorical "Diabetes" column)
numerical_columns = df_reduced.iloc[:, 1:10]

## Standardizing the Data

In [296]:
# Applying scaling
scaler = StandardScaler()
preproc_reduced = scaler.fit(numerical_columns)

df_standardized = preproc_reduced.transform(numerical_columns)

Since <code>transform()</code> returns a NumPy array, we convert it back into a Pandas DataFrame with the original column names.

In [298]:
# Converting the standardized array back to DataFrame
df_standardized = pd.DataFrame(df_standardized, columns=numerical_columns.columns)

This is the new form of data 

In [300]:
df_standardized.describe()

Unnamed: 0,Cholesterol,Glucose,BMI,Waist/hip ratio,HDL Chol,Chol/HDL ratio,Systolic BP,Diastolic BP,Weight
count,390.0,390.0,390.0,390.0,390.0,390.0,390.0,390.0,390.0
mean,7.287618000000001e-17,-1.457524e-16,2.2773810000000003e-17,-6.741046e-16,4.3270230000000006e-17,-6.376666000000001e-17,2.915047e-16,-3.006142e-16,-1.867452e-16
std,1.001285,1.001285,1.001285,1.001285,1.001285,1.001285,1.001285,1.001285,1.001285
min,-2.896986,-1.104399,-2.059272,-2.754229,-2.21747,-1.743891,-2.064517,-2.617764,-1.942901
25%,-0.6328534,-0.4902078,-0.7092421,-0.7027598,-0.7108267,-0.7637287,-0.6628646,-0.6149262,-0.6729533
50%,-0.09484179,-0.3227011,-0.1479938,-0.01893664,-0.2472441,-0.1871623,-0.04964184,-0.0956721,-0.1092203
75%,0.4880041,0.007659498,0.5308134,0.6648866,0.5060777,0.5047173,0.4759777,0.4977612,0.5598254
max,5.285274,5.167799,4.099291,3.536944,4.040895,8.51899,4.943744,3.019853,3.657259


After standardizing the numerical columns, we need to reintroduce the "Diabetes" column (which was removed earlier during scaling).

In [383]:
df_stdize = pd.concat([df_reduced['Diabetes'], df_standardized], axis=1)
df_stdize.head(5)

Unnamed: 0,Diabetes,Cholesterol,Glucose,BMI,Waist/hip ratio,HDL Chol,Chol/HDL ratio,Systolic BP,Diastolic BP,Weight
0,No diabetes,-0.319013,-0.564655,-0.951944,-0.565995,-0.073401,-0.360132,-0.838071,-0.985822,-1.447312
1,No diabetes,-1.372619,-0.527432,-0.360358,-0.70276,-0.536983,-0.533102,-1.276087,-1.875972,-1.05084
2,No diabetes,0.218998,-0.601879,0.079539,0.117828,0.216339,-0.302476,-1.188484,-0.837464,0.237692
3,No diabetes,0.420753,-0.192418,-1.391841,-1.249818,1.143504,-0.763729,-0.662865,-1.430897,-1.571209
4,No diabetes,-0.969111,-0.304089,-1.300828,-0.839524,0.96966,-1.224982,-0.662865,0.201045,-0.902163


## Split dataset

Before training a machine learning model, we need to split the dataset into features (X) and target labels (y).

In [305]:
X = df_stdize.drop(columns=['Diabetes'])
y = df_stdize['Diabetes']

# Split the data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Since the <code>Diabetes</code> column is categorical (e.g., <code>No diabetes</code> / <code>Diabetes</code>), many machine learning models require it to be converted into numerical values. We use LabelEncoder() from sklearn.preprocessing to achieve this.

In [307]:
# Initialize LabelEncoder
label_encoder = LabelEncoder()

y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.fit_transform(y_test)

## Fit KNN

- Initializes and trains a KNN classifier using the training data.
- Predicts labels for the test set.
- Calculates and prints the accuracy of the model.

In [310]:
# Create a KNN classifier
knn = KNeighborsClassifier()

knn.fit(X_train, y_train_encoded)

#calculate overall accuracy
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test_encoded, y_pred)
print(f'Accuracy: {accuracy:.2%}')

Accuracy: 88.46%


## Hyperparameter tuning

- Initializing the KneighborClassifier but instead of specifying the hyperparameter k GridSearch will be used to find the best one.
- The param_grid dictionary contains a range of possible values for n_neighbors (the number of neighbors in KNN).

In [313]:
# Create a KNN classifier
knn = KNeighborsClassifier()

param_grid = {'n_neighbors': range(1, 12)}

- GridSearchCV automatically tests different values of k (from 1 to 11) and selects the best one.
- <code> cv=10</code>: Uses 10-fold cross-validation to ensure reliable performance estimates.
- <code>.fit(X_train, y_train_encoded) </code>: Trains the model using different values of k and evaluates performance.

In [315]:
# Perform grid search with cross-validation
grid_search = GridSearchCV(knn, param_grid, cv=10)
grid_search.fit(X_train, y_train_encoded)

In [316]:
# Best parameters and best score
print("Best parameters found: ", grid_search.best_params_)
print(f"Best accuracy score: , {grid_search.best_score_:.3f}")

# Full results
results = grid_search.cv_results_
for mean_score, std_score, params in zip(results['mean_test_score'], results['std_test_score'], results['params']):
    print(f"Mean accuracy: {mean_score:.3f} (std: {std_score:.3f}) with: {params}")

Best parameters found:  {'n_neighbors': 7}
Best accuracy score: , 0.917
Mean accuracy: 0.875 (std: 0.053) with: {'n_neighbors': 1}
Mean accuracy: 0.820 (std: 0.047) with: {'n_neighbors': 2}
Mean accuracy: 0.901 (std: 0.037) with: {'n_neighbors': 3}
Mean accuracy: 0.897 (std: 0.038) with: {'n_neighbors': 4}
Mean accuracy: 0.913 (std: 0.038) with: {'n_neighbors': 5}
Mean accuracy: 0.913 (std: 0.038) with: {'n_neighbors': 6}
Mean accuracy: 0.917 (std: 0.043) with: {'n_neighbors': 7}
Mean accuracy: 0.917 (std: 0.043) with: {'n_neighbors': 8}
Mean accuracy: 0.917 (std: 0.036) with: {'n_neighbors': 9}
Mean accuracy: 0.917 (std: 0.036) with: {'n_neighbors': 10}
Mean accuracy: 0.913 (std: 0.038) with: {'n_neighbors': 11}


The best parameters found is when k = 7, and the best accuracy score is 0.917.

## ANOVA for feature selection

- f_classif(X, y) calculates the statistical relationship between each feature in X and the target variable y.
    - It returns:
        - <code>fs_score</code>: F-statistic for each feature (higher values indicate more importance).
        - <code>fs_p_value</code>: P-values showing the statistical significance of each feature.

In [320]:
fs_score, fs_p_value = f_classif(X, y)

# Combine scores with feature names
fs_scores = pd.DataFrame({'Feature': X.columns, 'F-Score': fs_score, 'P-Value': fs_p_value})
fs_scores = fs_scores.sort_values(by='F-Score', ascending=False)

fs_scores

Unnamed: 0,Feature,F-Score,P-Value
1,Glucose,350.809177,3.205119e-56
5,Chol/HDL ratio,31.242678,4.298115e-08
0,Cholesterol,16.89338,4.827353e-05
6,Systolic BP,15.931795,7.853024e-05
3,Waist/hip ratio,12.348083,0.0004935038
8,Weight,10.588454,0.001237749
2,BMI,8.365055,0.004040512
4,HDL Chol,5.973355,0.01496812
7,Diastolic BP,0.947292,0.331016


You see that 'Glucose' is the most important feature in your data set for predicting diabetes because its p-value is the smallest (and its F-Score is the highest). You'll do two things here to improve your performance:

1. You'll balance your data set so that you have equal numbers of records with and without diabetes. 

2. You'll see how you do when you use **just** the Glucose readings to calculate the distance to your 'neighbors' in KNN.


## Downsampling

- <code>np.where(condition, 1, 0) </code>replaces "Diabetes" with 1 and everything else with 0.
- Now, Diabetes = 1 represents patients with diabetes, and Diabetes = 0 represents those without.

In [324]:
# Converting Diabetes column into binary (0 for No Diabetes and 1 for Diabetes)
df_stdize['Diabetes'] = np.where(df_stdize['Diabetes'] == 'Diabetes', 1, 0)
df_stdize.head(

Unnamed: 0,Diabetes,Cholesterol,Glucose,BMI,Waist/hip ratio,HDL Chol,Chol/HDL ratio,Systolic BP,Diastolic BP,Weight
0,0,-0.319013,-0.564655,-0.951944,-0.565995,-0.073401,-0.360132,-0.838071,-0.985822,-1.447312
1,0,-1.372619,-0.527432,-0.360358,-0.702760,-0.536983,-0.533102,-1.276087,-1.875972,-1.050840
2,0,0.218998,-0.601879,0.079539,0.117828,0.216339,-0.302476,-1.188484,-0.837464,0.237692
3,0,0.420753,-0.192418,-1.391841,-1.249818,1.143504,-0.763729,-0.662865,-1.430897,-1.571209
4,0,-0.969111,-0.304089,-1.300828,-0.839524,0.969660,-1.224982,-0.662865,0.201045,-0.902163
...,...,...,...,...,...,...,...,...,...,...
385,0,0.443170,-0.043523,-0.542385,-0.018937,-0.363140,0.389404,0.563581,0.497761,-1.298635
386,1,0.420753,3.194941,1.323387,-0.429231,0.100443,-0.129506,0.300771,0.349403,0.361590
387,0,2.102039,-0.322701,-1.073295,-1.660112,3.924999,-1.109668,3.542092,0.497761,-1.546430
388,1,0.555256,1.426814,-0.724411,0.528122,3.693208,-1.455608,1.439613,-0.095672,-1.249076


- **Problem**: The dataset is likely **imbalanced**.
- **Solution**: We **downsample** the `No Diabetes` cases to match the number of `Diabetes` cases.
    - The function `resample()` is used to randomly select **`n_samples=positive_diabetes`** (the same count as the positive diabetes cases).
    - The parameter `replace=False` ensures that the selected cases are **unique**, meaning no duplicates in the downsampled data.
    - The parameter `random_state=42` ensures that the results are **reproducible**, so if the code is run again, the exact same downsampling will occur.

In [326]:
# Number of rows for positive diabetes
positive_diabetes = df_stdize[df_stdize['Diabetes'] == 1].shape[0]
print('Number of rows for positive diabetes: ', positive_diabetes)

# Select only the negative diabetes cases (Diabetes = 0)
negative_diabetes = df_stdize[df_stdize['Diabetes'] == 0]
# Downsample the negative cases to match the number of positive cases
negative_diabetes_downsampled = resample(
    negative_diabetes, 
    replace=False,  # No replacement, so we sample without duplicates
    n_samples=positive_diabetes,  # Match the number of positive cases
    random_state=42  # For reproducibility
)

Number of rows for positive diabetes:  60


- Concatenates the downsampled No Diabetes cases with all Diabetes cases.
- Now, both classes have equal representation in the dataset.
- <code>.sample(5)</code>: Displays 5 random samples from the new balanced dataset.

In [328]:
# Put positive and negative diabetes case into one df -> balanced
balanced = pd.concat([negative_diabetes_downsampled, df_stdize[df_stdize['Diabetes'] == 1]])
balanced.sample(5)

Unnamed: 0,Diabetes,Cholesterol,Glucose,BMI,Waist/hip ratio,HDL Chol,Chol/HDL ratio,Systolic BP,Diastolic BP,Weight
327,1,0.936347,1.668768,-0.087318,-1.113054,-0.36314,0.677687,0.607383,0.126865,-0.456133
188,1,-0.386265,2.245736,0.625619,-0.839524,0.390182,-0.706072,-1.188484,-1.430897,0.510267
234,1,-1.507122,5.167799,-0.512047,0.254593,-1.116461,-0.014192,0.037961,-1.282539,-0.134
298,1,-1.776128,2.152677,0.291904,1.34871,-1.522096,0.447061,-1.188484,-1.13418,0.460708
144,0,-1.238116,-0.080747,3.007132,-1.386583,-1.058514,0.158778,0.475978,-0.095672,0.237692


In [329]:
balanced['Diabetes'].value_counts()

Diabetes
0    60
1    60
Name: count, dtype: int64

## Fitting on simpler model

Select only the Glucose column from the balanced dataset as the feature to be used for prediction.

In [332]:
X_simple = balanced[['Glucose']]
y = balanced['Diabetes']

# Split the data
X_train_simple, X_test_simple, y_train_simple, y_test_simple = train_test_split(X_simple, y, test_size=0.2, random_state=42)

#### Evaluating the model with only <code>Glucose</code> feature.

In [334]:
knn_simple = KNeighborsClassifier()
knn_simple.fit(X_train_simple, y_train_simple)
y_pred_simple = knn_simple.predict(X_test_simple)
accuracy = accuracy_score(y_test_simple, y_pred_simple)
print(f'Accuracy: {accuracy:.2%}')

Accuracy: 91.67%


This time, the accuracy is 91.67%, which is good considering that you are fitting on only the Glucose column instead of the all of the columns.

In [336]:
# Evaluate confusion matrix
cm = confusion_matrix(y_test_encoded, y_pred)

# Print confusion matrix
print("Confusion Matrix:")
print(cm)
accuracy = accuracy_score(y_test_encoded, y_pred)
print(f'Accuracy: {accuracy:.2%}')

Confusion Matrix:
[[ 8  8]
 [ 1 61]]
Accuracy: 88.46%


# 2. Poisonous mushrooms classification

 Use a mushroom data set from Kaggle that classifies mushrooms based on their characteristics. A mushroom is edible if <code>class == 1 </code>and is poisonous if <code>class == 0</code>.

In [339]:
df_mushrooms = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/3qPv1_g8n6KvWjyLOrjXyw/mushroom-cleaned.csv')
df_mushrooms.sample(5)

Unnamed: 0,cap-diameter,cap-shape,gill-attachment,gill-color,stem-height,stem-width,stem-color,season,class
13367,463,6,0,11,0.56857,424,6,0.943195,0
53476,508,1,3,2,0.422199,1772,6,0.027372,0
43000,333,0,0,4,1.563317,557,11,0.943195,1
38284,106,0,0,5,0.800202,92,6,0.027372,1
23020,663,2,6,10,0.80333,1697,11,0.88845,0


In [340]:
df_mushrooms_reduced= df_mushrooms.drop(columns=['class'])

## Standardizing data

In [342]:
# Applying scaling
scaler_2 = StandardScaler()
preproc_reduced_2 = scaler_2.fit(df_mushrooms_reduced)

df_standardized_2 = preproc_reduced_2.transform(df_mushrooms_reduced)

In [343]:
# Converting the standardized array back to DataFrame
df_mushrooms_standardized = pd.DataFrame(df_standardized_2, columns=df_mushrooms_reduced.columns)

In [344]:
df_stdize = pd.concat([df_mushrooms['class'], df_mushrooms_standardized], axis=1)
df_stdize

Unnamed: 0,class,cap-diameter,cap-shape,gill-attachment,gill-color,stem-height,stem-width,stem-color,season
0,1,2.236139,-0.925864,-0.063737,0.834467,4.682845,0.631570,0.791508,2.788402
1,1,2.483444,-0.925864,-0.063737,0.834467,4.682845,0.646914,0.791508,2.788402
2,1,2.233361,-0.925864,-0.063737,0.834467,4.383334,0.658423,0.791508,2.788402
3,1,1.927704,0.925572,-0.063737,0.834467,4.652283,0.658423,0.791508,2.788402
4,1,2.049966,0.925572,-0.063737,0.834467,4.536146,0.527996,0.791508,-0.029348
...,...,...,...,...,...,...,...,...,...
54030,1,-1.373393,0.462713,0.384935,-1.665349,0.197600,-0.616434,1.098064,-0.029348
54031,1,-1.348385,-0.925864,0.384935,-1.665349,0.656036,-0.717450,1.098064,-0.029348
54032,1,-1.348385,0.462713,0.384935,-1.665349,0.240388,-0.597253,1.098064,-0.208490
54033,1,-1.356721,-0.925864,0.384935,-1.665349,0.423762,-0.716172,1.098064,-0.208490


## Splitting data

In [346]:
X= df_mushrooms.drop(columns=['class'])
Y= df_mushrooms["class"]
x_train, x_test, y_train, y_test = train_test_split(X,Y, random_state=42, test_size=0.2)

In [347]:
# TO DO
knn= KNeighborsClassifier()
knn.fit(x_train, y_train)

y_pred= knn.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy is {accuracy:.2%}")

Accuracy is 71.64%


In [348]:
# TO DO
fs_score , fs_p_value= f_classif(X,Y)

fs_scores= pd.DataFrame({'Feature': X.columns , 'F-Score': fs_score, 'P-Value': fs_p_value})
fs_scores= fs_scores.sort_values(by='F-Score', ascending=False)
fs_scores

Unnamed: 0,Feature,F-Score,P-Value
4,stem-height,1879.70974,0.0
5,stem-width,1869.166385,0.0
0,cap-diameter,1524.991095,0.0
1,cap-shape,978.047168,8.49613e-213
6,stem-color,904.881142,3.6433200000000003e-197
7,season,374.07857,4.629833e-83
3,gill-color,221.857661,4.474142999999999e-50
2,gill-attachment,149.575557,2.383446e-34


From the table, stem-height is the most important feature to classify whether or not a mushroom is edible or not.