<a href="https://colab.research.google.com/github/muhammed-shamil/Diabetes-Prediction-with-KNN/blob/main/empleofKNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Diabetes Prediction with KNN**
**Description :**This project utilizes a diabetes dataset to predict diabetes presence using the K-Nearest Neighbors (KNN) classification algorithm. Key steps include data exploration, splitting into training and testing sets, and standardizing features. The KNN classifier is trained, and predictions are evaluated with accuracy and confusion matrix metrics. Rates such as True Positive Rate, False Positive Rate, and Precision are analyzed to understand the model's performance. Additionally, Mean Squared Error and Precision Score are calculated. The project provides a comprehensive guide for diabetes prediction, offering insights into classifier accuracy and performance metrics. Explore the code and results for a detailed understanding of the KNN classification approach in diabetes diagnosis.

**Key Steps:**

Data Loading and Exploration
Feature and Target Variable Separation
Data Standardization with StandardScaler
Model Training using K-Nearest Neighbors Classifier
Prediction and Evaluation using Accuracy and Confusion Matrix
Analysis of True Positive Rate, False Positive Rate, and Precision
Mean Squared Error and Precision Score Calculation
Results and Insights:

* Accuracy: 91%
* Misclassification Rate: 9%
* True Positive Rate (Recall): 95%
* False Positive Rate: 17%
* True Negative Rate (Specificity): 83%
* Precision: 91%
* Prevalence: 64%

The project offers a detailed breakdown of model performance metrics, providing a valuable resource for understanding and improving diabetes prediction using KNN classification.

In [1]:
import pandas as pd
data=pd.read_csv('/content/drive/MyDrive/data science /Machine Learning /Copy of diabetes (1) (2) (1) (2).csv')

In [2]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [4]:
data.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [5]:
x=data.iloc[:,:8]
y=data.iloc[:,-1]

In [6]:
x.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


In [7]:
y.head()

0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64

In [8]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)

In [9]:
x_train.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
603,7,150,78,29,126,35.2,0.692,54
118,4,97,60,23,0,28.2,0.443,22
247,0,165,90,33,680,52.3,0.427,23
157,1,109,56,21,135,25.2,0.833,23
468,8,120,0,0,0,30.0,0.183,38


In [10]:
y_train.head()

603    1
118    0
247    0
157    0
468    1
Name: Outcome, dtype: int64

In [11]:
from sklearn.preprocessing import StandardScaler
sd=StandardScaler()
sd.fit(x_train)
x_train=sd.transform(x_train)
sd.fit(x_test)
x_test=sd.transform(x_test)

In [12]:
from sklearn.neighbors import KNeighborsClassifier
model=KNeighborsClassifier(n_neighbors=5)
model.fit(x_train,y_train)

In [13]:
y_pred=model.predict(x_test)
y_pred

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [14]:
y_test

661    1
122    0
113    0
14     1
529    0
      ..
476    1
482    0
230    1
527    0
380    0
Name: Outcome, Length: 154, dtype: int64

In [15]:
from sklearn.metrics import accuracy_score
score=accuracy_score(y_test,y_pred)
score

0.8181818181818182

In [16]:
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test,y_pred)
cm

array([[96, 11],
       [17, 30]])

**List of Rates**

---

- **Accuracy**: Overall, how often is the classifier correct?
  - (TP+TN)/total = (100+50)/165 = 0.91
- **Misclassification Rate**: Overall, how often is it wrong?
  - (FP+FN)/total = (10+5)/165 = 0.09
  - equivalent to 1 minus Accuracy
  - also known as "Error Rate"
- **True Positive Rate**: When it's actually yes, how often does it predict yes?
  - TP/actual yes = 100/105 = 0.95
  - also known as "Sensitivity" or "Recall"
- **False Positive Rate**: When it's actually no, how often does it predict yes?
  - FP/actual no = 10/60 = 0.17
- **True Negative Rate**: When it's actually no, how often does it predict no?
  - TN/actual no = 50/60 = 0.83
  - equivalent to 1 minus False Positive Rate
  - also known as "Specificity"
- **Precision**: When it predicts yes, how often is it correct?
  - TP/predicted yes = 100/110 = 0.91
- **Prevalence**: How often does the yes condition actually occur in our sample?
  - actual yes/total = 105/165 = 0.64



In [17]:
from sklearn.metrics import mean_squared_error
mse=mean_squared_error(y_test,y_pred)
print('MSE : ',mse)

MSE :  0.18181818181818182


In [18]:
from sklearn.metrics import precision_score
prec=precision_score(y_test,y_pred)
print('Precision : %3f ' %prec)

Precision : 0.731707 


In [19]:
prec=precision_score(y_test,y_pred, labels=[0,1],average='micro')
print('Precision : %3f ' %prec)

Precision : 0.818182 


In [20]:
from sklearn.metrics import recall_score
rec=recall_score(y_test,y_pred,average='binary')
print('Recall : %3f ' %rec)

Recall : 0.638298 
