### Data Collection and Analysis
Dataset collected from kagle
#### PIMA Diabetes Dataset

In [1]:
import pandas as pd
# loading the diabetes dataset to a pandas DataFrame
diabetes_dataset = pd.read_csv('diabetes.csv') 

In [2]:
# printing the first 5 rows of the dataset
diabetes_dataset.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
# number of rows and Columns in this dataset
diabetes_dataset.shape

(768, 9)

In [4]:
# getting the statistical measures of the data
diabetes_dataset.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [5]:
diabetes_dataset['Outcome'].value_counts()

Outcome
0    500
1    268
Name: count, dtype: int64

0 --> Non-Diabetic

1 --> Diabetic

In [6]:
diabetes_dataset.groupby('Outcome').mean()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


### Setting Up the Classification Enviornment

In [7]:
from pycaret.classification import *

In [10]:
clf_setup = setup(data = diabetes_dataset, 
             target = 'Outcome',
           preprocess='scale', 
                 pca=True)

Unnamed: 0,Description,Value
0,Session id,4424
1,Target,Outcome
2,Target type,Binary
3,Original data shape,"(768, 9)"
4,Transformed data shape,"(768, 9)"
5,Transformed train set shape,"(537, 9)"
6,Transformed test set shape,"(231, 9)"
7,Numeric features,8
8,Preprocess,scale
9,Imputation type,simple


### Soft Voting Ensemble

A soft voting ensemble is a type of ensemble learning method used in machine learning for classification tasks where multiple base models are trained on the same dataset, and they each produce probability estimates for all possible classes.

To make predictions with a soft voting ensemble, the predicted probabilities from each base model are averaged (or weighted averaged) for each class across all base models. The class with the highest average probability is then chosen as the final prediction.

In [11]:
top5 = compare_models(n_select=5, 
                      exclude = ['ridge'])

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7673,0.0,0.5509,0.709,0.6159,0.4549,0.4641,0.24
rf,Random Forest Classifier,0.7672,0.0,0.5772,0.701,0.6316,0.4642,0.4698,0.03
ada,Ada Boost Classifier,0.7671,0.0,0.5708,0.7098,0.6273,0.4616,0.4711,0.014
lda,Linear Discriminant Analysis,0.7654,0.0,0.5404,0.7099,0.6105,0.449,0.4587,0.004
xgboost,Extreme Gradient Boosting,0.7522,0.0,0.5763,0.6684,0.6165,0.4353,0.4395,0.014
gbc,Gradient Boosting Classifier,0.7466,0.0,0.5655,0.6609,0.6059,0.4215,0.4266,0.031
lightgbm,Light Gradient Boosting Machine,0.7465,0.0,0.5708,0.6596,0.6094,0.4236,0.4278,0.153
knn,K Neighbors Classifier,0.7394,0.0,0.5357,0.6637,0.5873,0.4008,0.4096,0.006
nb,Naive Bayes,0.7375,0.0,0.5564,0.64,0.5898,0.3999,0.4056,0.004
qda,Quadratic Discriminant Analysis,0.7374,0.0,0.5561,0.6448,0.5911,0.4008,0.4075,0.004


Processing:   0%|          | 0/65 [00:00<?, ?it/s]

In [12]:
blend_soft = blend_models(estimator_list = top5 ,method = 'soft')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8148,0.0,0.6842,0.7647,0.7222,0.584,0.586
1,0.7963,0.0,0.5789,0.7857,0.6667,0.5248,0.5375
2,0.7222,0.0,0.4737,0.6429,0.5455,0.352,0.3605
3,0.7963,0.0,0.6842,0.7222,0.7027,0.5479,0.5484
4,0.7407,0.0,0.5263,0.6667,0.5882,0.4028,0.4088
5,0.7778,0.0,0.6316,0.7059,0.6667,0.5008,0.5025
6,0.7778,0.0,0.6842,0.6842,0.6842,0.5128,0.5128
7,0.717,0.0,0.4444,0.6154,0.5161,0.3234,0.332
8,0.7925,0.0,0.6667,0.7059,0.6857,0.531,0.5315
9,0.7925,0.0,0.5556,0.7692,0.6452,0.5038,0.5172


Processing:   0%|          | 0/6 [00:00<?, ?it/s]

In [13]:
tuned = tune_model(blend_soft)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8148,0.0,0.6842,0.7647,0.7222,0.584,0.586
1,0.7963,0.0,0.5789,0.7857,0.6667,0.5248,0.5375
2,0.7407,0.0,0.5263,0.6667,0.5882,0.4028,0.4088
3,0.7963,0.0,0.6842,0.7222,0.7027,0.5479,0.5484
4,0.7407,0.0,0.5789,0.6471,0.6111,0.4176,0.419
5,0.7778,0.0,0.6316,0.7059,0.6667,0.5008,0.5025
6,0.7593,0.0,0.6316,0.6667,0.6486,0.4658,0.4661
7,0.717,0.0,0.4444,0.6154,0.5161,0.3234,0.332
8,0.7925,0.0,0.6667,0.7059,0.6857,0.531,0.5315
9,0.7925,0.0,0.5556,0.7692,0.6452,0.5038,0.5172


Processing:   0%|          | 0/7 [00:00<?, ?it/s]

Fitting 10 folds for each of 10 candidates, totalling 100 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


In [14]:
pred = predict_model(tuned)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Voting Classifier,0.7792,0.8402,0.6173,0.7143,0.6623,0.4996,0.5025


### Calibrating the Model

Calibrating a model is the process of adjusting the predicted probabilities output by a model to better reflect the true probabilities of the outcomes.

When a classification model generates probabilities for each class, these probabilities ideally represent the model's confidence in its predictions. However, these predicted probabilities may not always be well-calibrated, meaning they may not accurately reflect the true likelihood of the corresponding outcomes.

Calibration ensures that the probabilities are reliable and can be interpreted meaningfully.

In [15]:
cali_model = calibrate_model(tuned)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8148,0.0,0.6842,0.7647,0.7222,0.584,0.586
1,0.8333,0.0,0.6316,0.8571,0.7273,0.6112,0.626
2,0.7222,0.0,0.4737,0.6429,0.5455,0.352,0.3605
3,0.7963,0.0,0.6842,0.7222,0.7027,0.5479,0.5484
4,0.7222,0.0,0.4737,0.6429,0.5455,0.352,0.3605
5,0.7593,0.0,0.5789,0.6875,0.6286,0.4524,0.4561
6,0.7593,0.0,0.6316,0.6667,0.6486,0.4658,0.4661
7,0.6981,0.0,0.4444,0.5714,0.5,0.2886,0.2933
8,0.7925,0.0,0.6667,0.7059,0.6857,0.531,0.5315
9,0.7736,0.0,0.5556,0.7143,0.625,0.4664,0.474


Processing:   0%|          | 0/6 [00:00<?, ?it/s]

In [16]:
pred = predict_model(cali_model)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Voting Classifier,0.7922,0.8409,0.6296,0.7391,0.68,0.5276,0.5313


### Saving the trained model

In [17]:
import pickle

In [18]:
filename = 'diabetes_model_new1.sav'
pickle.dump(cali_model, open(filename, 'wb'))