# Objectif

* développer un model un utilisant KNeighborsClassifier pour la prédiction de diabète ou non diabère

KNN est utilisé pour prédire la classe d'un point de données en fonction des classes des points de données les plus proches dans un espace multidimensionnel. En d'autres termes, il classe un nouvel élément en fonction de la majorité des classes de ses k plus proches voisins.

**Cas d'utilisation courants :**
- Classification de données (exemple : prédire si un e-mail est un spam ou non).

- Reconnaissance de motifs (exemple : reconnaissance d'images ou de texte).

- Systèmes de recommandation (exemple : recommander des produits similaires).

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Importing dataset
dataset = pd.read_csv('diabetes.csv')
print(dataset.shape)

(768, 9)


In [3]:
# Preview data
dataset.head(3)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1


In [4]:
# Features data-type
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [None]:
dataset.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Pregnancies,768.0,3.845052,3.369578,0.0,1.0,3.0,6.0,17.0
Glucose,768.0,120.894531,31.972618,0.0,99.0,117.0,140.25,199.0
BloodPressure,768.0,69.105469,19.355807,0.0,62.0,72.0,80.0,122.0
SkinThickness,768.0,20.536458,15.952218,0.0,0.0,23.0,32.0,99.0
Insulin,768.0,79.799479,115.244002,0.0,0.0,30.5,127.25,846.0
BMI,768.0,31.992578,7.88416,0.0,27.3,32.0,36.6,67.1
DiabetesPedigreeFunction,768.0,0.471876,0.331329,0.078,0.24375,0.3725,0.62625,2.42
Age,768.0,33.240885,11.760232,21.0,24.0,29.0,41.0,81.0
Outcome,768.0,0.348958,0.476951,0.0,0.0,0.0,1.0,1.0


In [None]:

dataset.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [7]:
dataset_new = dataset

In [None]:
dataset_new[["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]] = dataset_new[["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]].replace(0, np.NaN) 

In [None]:
dataset_new.isnull().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

Remplacer les valeurs nuls par la moyenne

In [None]:
dataset_new["Glucose"].fillna(dataset_new["Glucose"].mean(), inplace = True)
dataset_new["BloodPressure"].fillna(dataset_new["BloodPressure"].mean(), inplace = True)
dataset_new["SkinThickness"].fillna(dataset_new["SkinThickness"].mean(), inplace = True)
dataset_new["Insulin"].fillna(dataset_new["Insulin"].mean(), inplace = True)
dataset_new["BMI"].fillna(dataset_new["BMI"].mean(), inplace = True)

In [None]:
dataset_new.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Pregnancies,768.0,3.845052,3.369578,0.0,1.0,3.0,6.0,17.0
Glucose,768.0,121.686763,30.435949,44.0,99.75,117.0,140.25,199.0
BloodPressure,768.0,72.405184,12.096346,24.0,64.0,72.202592,80.0,122.0
SkinThickness,768.0,29.15342,8.790942,7.0,25.0,29.15342,32.0,99.0
Insulin,768.0,155.548223,85.021108,14.0,121.5,155.548223,155.548223,846.0
BMI,768.0,32.457464,6.875151,18.2,27.5,32.4,36.6,67.1
DiabetesPedigreeFunction,768.0,0.471876,0.331329,0.078,0.24375,0.3725,0.62625,2.42
Age,768.0,33.240885,11.760232,21.0,24.0,29.0,41.0,81.0
Outcome,768.0,0.348958,0.476951,0.0,0.0,0.0,1.0,1.0


* Uniformser la data en utilisant MinMaxScaler

In [None]:
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (0, 1))
dataset_scaled = sc.fit_transform(dataset_new)

In [13]:
dataset_scaled = pd.DataFrame(dataset_scaled)

* Utiliser les colonnes Glucose, Insulin , BMI et Age pour prédire si une personne est diabétique ou non

In [None]:
X = dataset_scaled.iloc[:, [1, 4, 5, 7]].values
Y = dataset_scaled.iloc[:, 8].values

* Dévision de dataset

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.20, random_state = 42, stratify = dataset_new['Outcome'] )

In [None]:
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("Y_train shape:", Y_train.shape)
print("Y_test shape:", Y_test.shape)

X_train shape: (614, 4)
X_test shape: (154, 4)
Y_train shape: (614,)
Y_test shape: (154,)


* Inistialiser le KNeighborsClassifier et faire l'entrainement

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 24, metric = 'minkowski', p = 2)
knn.fit(X_train, Y_train)

- n_neighbors=24 : On spécifie le nombre de voisins (k) à considérer pour la classification. Ici, le modèle regarde les 24 points les plus proches pour prendre une décision.

- metric='minkowski' : La métrique utilisée pour calculer la distance entre les points. minkowski est une généralisation des distances euclidienne et de Manhattan.

- p=2 : Lorsque p=2, la distance de Minkowski devient la distance euclidienne (la distance "classique" en ligne droite). Si p=1, ce serait la distance de Manhattan (distance en "blocs" comme dans une grille).

* Faire la prédiction

In [18]:
Y_pred_knn = knn.predict(X_test)

* Evaluer le model

In [19]:
from sklearn.metrics import accuracy_score
accuracy_knn = accuracy_score(Y_test, Y_pred_knn)

In [20]:
print("La précision de K Nearest neighbors est : " + str(accuracy_knn * 100))

La précision de K Nearest neighbors est : 78.57142857142857


* Générer le rapport de classification

In [22]:
from sklearn.metrics import classification_report
print(classification_report(Y_test, Y_pred_knn))

              precision    recall  f1-score   support

         0.0       0.81      0.87      0.84       100
         1.0       0.72      0.63      0.67        54

    accuracy                           0.79       154
   macro avg       0.77      0.75      0.76       154
weighted avg       0.78      0.79      0.78       154



* sauveguarder le model

In [23]:
import joblib

# Save the trained model to a file
joblib.dump(knn, 'knn_model.pkl')


['knn_model.pkl']