## Predict Diabetes with K-NN

### Objective: Predict whether a person will be diagnosed with diabetes or not

- Dataset of 768 diagnosed people with or without diabetes

In [16]:
# Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

In [17]:
data = pd.read_csv("diabetes.csv")
print("Length of DF: ", len(data))
print(data.head())

Length of DF:  768
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  


### Data Pre-processing

- Values for 'Glucose' and 'Blood Pressure' etc. cannot be accepted as zeros (why?)
- Replace these columns with the mean of the column
- Replace with average since it is the most common for a person

In [18]:
print(data['Glucose'])

0      148
1       85
2      183
3       89
4      137
      ... 
763    101
764    122
765    121
766    126
767     93
Name: Glucose, Length: 768, dtype: int64


In [19]:
# Replace the zeroes
zero_not_accepted = ['Glucose', 'BloodPressure', 'SkinThickness', 'BMI', 'Insulin']

In [20]:
# Loop through the data and clean it using pandas and numpy tools
for column in zero_not_accepted:
    data[column] = data[column].replace(0, np.NaN)
    mean = int(data[column].mean(skipna=True))
    data[column] = data[column].replace(np.NaN, mean)

In [21]:
print(data['Glucose'])

0      148.0
1       85.0
2      183.0
3       89.0
4      137.0
       ...  
763    101.0
764    122.0
765    121.0
766    126.0
767     93.0
Name: Glucose, Length: 768, dtype: float64


In [22]:
# Split the dataset into train and test sets
x = data.iloc[:, 0:8] # keep all rows, except column 8 (attributes)
y = data.iloc[:, 8] # Label - column 8
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0, test_size=0.2)

In [26]:
# Attribute scaling
sc_x = StandardScaler() # sets all data between -1 and 1
x_train = sc_x.fit_transform(x_train) # train data is fit
x_test = sc_x.transform(x_test)

### Determining the model variables

- The hyperparameter K should be an odd number (why?)
- Use the square root of the length of the y_test set to pick K
- p is the 'diabetic or not'
- Euclidean is the method of measuring the distance between neighbors


In [35]:
# Determine the hyperparameter, K
import math
print(len(y_test))
math.sqrt(len(y_test))

154


12.409673645990857

In [36]:
# Define the model: Init K-NN
classifier = KNeighborsClassifier(n_neighbors=11, p=2, metric='euclidean')
classifier.fit(x_train, y_train)

In [37]:
# Predict the test results
y_pred = classifier.predict(x_test)
y_pred

array([1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
      dtype=int64)

### The Confusion Matrix Analysis Tool

In machine learning, the confusion matrix is a table that is used to evaluate the performance of a classification model, such as a K-Nearest Neighbors (KNN) model. It provides a comprehensive summary of the predictions made by the model on a test dataset, compared to the actual ground truth labels. The confusion matrix has four important metrics: True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).

1. True Positives (TP): The number of instances that are correctly predicted as positive (belong to the positive class).

2. False Positives (FP): The number of instances that are incorrectly predicted as positive but actually belong to the negative class.

3. True Negatives (TN): The number of instances that are correctly predicted as negative (belong to the negative class).

4. False Negatives (FN): The number of instances that are incorrectly predicted as negative but actually belong to the positive class.

<table>
    <tr>
        <th>Actual</th>
        <th>Predicted Positive</th>
        <th>Predicted Negative</th>
    </tr>
    <tr>
        <td>Actual Positive </td>
        <td>TP</td>
        <td>FN</td>
    </tr>
    <tr>
        <td>Actual Negative </td>
        <td>FP</td>
        <td>TN</td>
    </tr>
</table>

Here's how to interpret the confusion matrix:

1. TP (True Positives): The model correctly predicted instances that belong to the positive class.

2. FN (False Negatives): The model incorrectly predicted instances as negative when they actually belong to the positive class.

3. FP (False Positives): The model incorrectly predicted instances as positive when they actually belong to the negative class.

4. TN (True Negatives): The model correctly predicted instances that belong to the negative class.

Using these values, we can compute various evaluation metrics, such as **accuracy**, precision, recall (sensitivity), specificity, and **F1-score**, which help us assess the performance of the KNN model on the test data.

- Accuracy = (TP + TN) / (TP + FP + TN + FN)
- Precision = TP / (TP + FP)
- Recall (Sensitivity) = TP / (TP + FN)
- Specificity = TN / (TN + FP)
- F1-Score = 2 * (Precision * Recall) / (Precision + Recall)

A good KNN model will have a high accuracy, precision, recall, specificity, and F1-Score, indicating that it is making correct predictions for both positive and negative classes.


In [41]:
# Evaluate the model with the confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix")
print(cm)
print(f1_score(y_test, y_pred))

Confusion Matrix
[[94 13]
 [15 32]]
0.6956521739130436


In [42]:
print(accuracy_score(y_test, y_pred))

0.8181818181818182


### To interpret the results in the confusion matrix 
[[94, 13], 
[15, 32]]

```
             Predicted Positive    Predicted Negative
Actual Positive        94                13
Actual Negative        15                32
```

Here's how to interpret the values:

1. True Positives (TP): The number of instances that are correctly predicted as positive. In this case, there are 94 true positives, meaning 94 instances were correctly classified as positive (belong to the positive class).

2. False Positives (FP): The number of instances that are incorrectly predicted as positive but actually belong to the negative class. In this case, there are 13 false positives, meaning 13 instances were incorrectly classified as positive when they actually belong to the negative class.

3. True Negatives (TN): The number of instances that are correctly predicted as negative. In this case, there are 32 true negatives, meaning 32 instances were correctly classified as negative (belong to the negative class).

4. False Negatives (FN): The number of instances that are incorrectly predicted as negative but actually belong to the positive class. In this case, there are 15 false negatives, meaning 15 instances were incorrectly classified as negative when they actually belong to the positive class.

Based on these values, we can calculate various evaluation metrics to assess the performance of the classification model:

1. Accuracy: It is the overall correctness of the model's predictions and is calculated as (TP + TN) / (TP + FP + TN + FN). In this case, accuracy = (94 + 32) / (94 + 13 + 15 + 32) ≈ 0.8095 or 80.95%.

2. Precision: It measures the accuracy of the positive predictions and is calculated as TP / (TP + FP). In this case, precision = 94 / (94 + 13) ≈ 0.8785 or 87.85%.

3. Recall (Sensitivity): It measures the ability of the model to correctly identify positive instances and is calculated as TP / (TP + FN). In this case, recall = 94 / (94 + 15) ≈ 0.8621 or 86.21%.

4. F1-Score: It is the harmonic mean of precision and recall and is given by 2 * (Precision * Recall) / (Precision + Recall). In this case, F1-Score = 2 * (0.8785 * 0.8621) / (0.8785 + 0.8621) ≈ 0.8702 or 87.02%.

Our model has a relatively good accuracy, precision, recall, and F1-Score, indicating it performs reasonably well on the given test dataset. However, further analysis and comparison with other models may be necessary for a comprehensive evaluation.