# Diabetes Feature Engineering and Modeling

## Contents

- Importing the libraries
- Data Preprocessing
    - Importing the dataset
    - Handling missimg values
    - Creating features and target
    - Splitting the dataset into the Training set and Test set
    - Feature Scaling
- Building the ANN
    - Building the Initial Model (first model)
      - Traing the model
      - Predicting the test result in the model
      - Making the Confusion Matrix and evaluating the model
      - Creating the KerasClassifier
      - Applying K-Fold Cross Validation
      - Applying Grid Search to find the best model and best parameters
    - Building the Best Model (second model)
      - Training the model with EarlyStopping
      - Evaluation the model
      - Predicting the test result in the model
      - Making Confusion Matrix and evaluating the model
    - Building the Improved Model (third model)
      - Applying KerasClassifier to the model
      - Applying K-Fold Cross Validation
      - Training the model with EarlyStopping and ReduceLROnPlateau
      - Making the Confusion Matrix and evaluating the model
- Conclusion Report

Introduction

The dataset is structured with 9 variables for each patient:

Information about dataset attributes -

Pregnancies: To express the Number of pregnancies

Glucose: To express the Glucose level in blood

BloodPressure: To express the Blood pressure measurement

SkinThickness: To express the thickness of the skin

Insulin: To express the Insulin level in blood

BMI: To express the Body mass index

DiabetesPedigreeFunction: To express the Diabetes percentage

Age: To express the age

Outcome: To express the final result 1 is Yes and 0 is No

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.2

### Importing the libraries

In [34]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold, cross_val_score, GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
import seaborn as snsp
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.layers import BatchNormalization, Dropout
from tensorflow.keras.callbacks import ReduceLROnPlateau
import warnings
warnings.filterwarnings('ignore')

## Data Preprocessing

### Importing the dataset

In [2]:
df = pd.read_csv('../data/diabetes.csv')

### Handling missing values

In [3]:
df['SkinThickness'] = df['SkinThickness'].replace(0, np.nan)
df['Insulin'] = df['Insulin'].replace(0, np.nan)
df.fillna(df.mean(), inplace=True)

In [4]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35.0,155.548223,33.6,0.627,50,1
1,1,85,66,29.0,155.548223,26.6,0.351,31,0
2,8,183,64,29.15342,155.548223,23.3,0.672,32,1
3,1,89,66,23.0,94.0,28.1,0.167,21,0
4,0,137,40,35.0,168.0,43.1,2.288,33,1


### Creating features and target

In [5]:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

In [6]:
print(X)

[[  6.    148.     72.    ...  33.6     0.627  50.   ]
 [  1.     85.     66.    ...  26.6     0.351  31.   ]
 [  8.    183.     64.    ...  23.3     0.672  32.   ]
 ...
 [  5.    121.     72.    ...  26.2     0.245  30.   ]
 [  1.    126.     60.    ...  30.1     0.349  47.   ]
 [  1.     93.     70.    ...  30.4     0.315  23.   ]]


In [7]:
print(y)

[1 0 1 0 1 0 1 0 1 1 0 1 0 1 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0
 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 1 0 1 0
 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1
 1 0 0 1 1 1 0 0 0 1 0 0 0 1 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
 0 0 0 0 1 0 1 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0
 1 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 1 1 1 1 0 1 1 1 1
 0 0 0 0 0 1 0 0 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0
 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 0 0
 1 0 1 0 1 1 0 1 0 0 1 0 1 1 0 0 1 0 1 0 0 1 0 1 0 1 1 1 0 0 1 0 1 0 0 0 1
 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 0 1 0 0 1 0 0 1
 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 1 0 1 1 0 1 0 1 0 1
 0 1 1 0 0 0 0 1 1 0 1 0 1 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1
 1 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1
 0 0 0 1 1 0 0 0 0 0 0 0 

### Splitting the dataset into Training and Test set

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [9]:
df.shape

(768, 9)

In [10]:
X_train.shape, y_train.shape

((614, 8), (614,))

In [11]:
X_test.shape, y_test.shape

((154, 8), (154,))

### Feature Scaling

In [12]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Building the ANN

### Building the Initial Model (first model)

In [13]:
def create_model(optimizer='adam', init='uniform'):
    initial_model = Sequential()
    initial_model.add(Dense(12, input_dim=X_train.shape[1], kernel_initializer=init, activation='relu'))
    initial_model.add(Dense(8, kernel_initializer=init, activation='relu'))
    initial_model.add(Dense(1, kernel_initializer=init, activation='sigmoid'))
    initial_model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return initial_model


In [14]:
initial_model = create_model()

#### Training the model

In [15]:
initial_model.fit(X_train, y_train, batch_size=10, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x17bae0390>

#### Predicting the test result in the model

In [16]:
y_pred = initial_model.predict(X_test)
y_pred = (y_pred > 0.5)



#### Making the Confusion Matrix and evaluting the model

In [17]:
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[81 18]
 [20 35]]


0.7532467532467533

In [18]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.80      0.82      0.81        99
           1       0.66      0.64      0.65        55

    accuracy                           0.75       154
   macro avg       0.73      0.73      0.73       154
weighted avg       0.75      0.75      0.75       154



#### Creating the KerasClassifier

In [19]:
initial_model = KerasClassifier(build_fn=create_model, verbose=0)

#### Applying K-Fold Cross Validation

In [20]:
# k-Fold Cross-Validation
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
results = cross_val_score(initial_model, X_train, y_train, cv=kfold)
print("Accuracy: {:.2f} %".format(results.mean()*100))
print("Standard Deviation: {:.2f} %".format(results.std()*100))


Accuracy: 65.31 %
Standard Deviation: 0.34 %


#### Applying Grid Search to find the best model and best parameters

In [23]:
# Grid Search with Cross-Validation
parameters = [{'batch_size':[10,20,40], 'epochs':[50,100,150],'optimizer': ['SGD', 'RMSprop', 'Adam'], 'init' : ['uniform', 'normal', 'he_uniform']}]
grid_search = GridSearchCV(estimator=initial_model,
                          param_grid=parameters,
                          scoring='accuracy',
                          cv=kfold,
                          n_jobs= -1)
grid_search.fit(X_train, y_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_
print("Best Accuracy: {:.2f} %".format(best_accuracy*100))
print("Best Parameters: ", best_parameters)

2024-07-29 11:33:06.202847: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-29 11:33:06.203818: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-29 11:33:06.223873: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-29 11:33:06.265726: I tensorflow/core/pla

Best Accuracy: 78.83 %
Best Parameters:  {'batch_size': 10, 'epochs': 150, 'init': 'he_uniform', 'optimizer': 'Adam'}


### Building the best model (second model)

In [25]:
best_model = create_model(optimizer=best_parameters['optimizer'], init=best_parameters['init'])

#### Training the model with EarlyStopping

In [26]:
early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
history = best_model.fit(X_train, y_train, validation_split=0.2, epochs=best_parameters['epochs'], batch_size=best_parameters['batch_size'], callbacks=[early_stopping], verbose=1)

Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 45/150
Epoch 46/150
Epoch 47/150
Epoch 48/150
Epoch 49/150
Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150
Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150
Epoch 61/150
Epoch 62/150
Epoch 63/150
Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoch 68/150
Epoch 69/150
Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150
Epoch 77/150
Epoch 78

#### Evaluating the model

In [27]:
loss, accuracy = best_model.evaluate(X_test, y_test)
print(f'Final model - Loss: {loss}, Accuracy: {accuracy}')

Final model - Loss: 0.5649195909500122, Accuracy: 0.7207792401313782


#### Predicting the test result in the model

In [28]:
y_best_pred = (best_model.predict(X_test) > 0.5)



#### Making the Confusion Matrix and evaluating the model

In [29]:
cm_best = confusion_matrix(y_test, y_best_pred)
print(cm_best)
accuracy_score(y_test, y_best_pred)

[[76 23]
 [20 35]]


0.7207792207792207

### Building the Improved Model (third model)

In [53]:

def create_improved_model(optimizer='adam', init='uniform'):
    model = Sequential()
    model.add(Dense(12, input_dim=X_train.shape[1], kernel_initializer=init, activation='relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.5))
    model.add(Dense(8, kernel_initializer=init, activation='relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.5))
    model.add(Dense(1, kernel_initializer=init, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model

#### Applying KerasClassifier to the model

In [54]:
# Wrap the Keras model
keras_clf = KerasClassifier(build_fn=create_improved_model, 
                            optimizer=best_parameters['optimizer'], 
                            init=best_parameters['init'], 
                            epochs=best_parameters['epochs'], 
                            batch_size=best_parameters['batch_size'], 
                            verbose=0)

#### Applying K-Fold Cross Validation

In [56]:
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_results = cross_val_score(keras_clf, X_train, y_train, cv=kfold, scoring='accuracy')
print("CV Accuracy: {:.2f} %".format(cv_results.mean()*100))
print("CV Standard Deviation: {:.2f} %".format(cv_results.std()*100))

CV Accuracy: 75.40 %
CV Standard Deviation: 2.24 %


#### Training the Improved Model with EarlyStopping and ReduceLROnPlateau

In [57]:
early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=5, min_lr=0.001)
history = best_model_improved.fit(X_train, y_train, validation_split=0.2, epochs=best_parameters['epochs'], 
                         batch_size=best_parameters['batch_size'], callbacks=[early_stopping, reduce_lr], verbose=1)

Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150


### Making the Confusion Matrix and evaluating the model

In [58]:
# Evaluate the model
y_best_pred = (best_model_improved.predict(X_test) > 0.5)
cm_best = confusion_matrix(y_test, y_best_pred)
print(cm_best)
print(accuracy_score(y_test, y_best_pred))

[[85 14]
 [21 34]]
0.7727272727272727


## Conclusion Report

The code provided implements several machine learning models to predict diabetes based on various patient metrics. Here's a step-by-step analysis of each model and the results obtained:

Data Preprocessing

The dataset is loaded, and missing values are handled by replacing zeros with the mean value in the 'SkinThickness' and 'Insulin' columns. This step ensures that the dataset is complete and can be processed without errors due to missing values. The dataset is then split into features (X) and target (y) variables, where X contains all columns except the target variable (diabetes outcome), and y contains the target variable.
The data is then split into training and testing sets using an 80-20 split. This allows the model to be trained on a portion of the data and tested on another to evaluate its performance. The data is also normalized using StandardScaler to standardize the features, which is a common practice in machine learning to improve model performance.

Building ANN

A simple neural network model is created using Keras. The model consists of three layers: an input layer, one hidden layer, and an output layer. The model is compiled with the 'adam' optimizer and 'binary_crossentropy' loss function, which are suitable for binary classification problems. The model is then trained on the training data for 100 epochs. After training, the model's performance is evaluated on the test data, resulting in an accuracy of 75.32%. The confusion matrix and classification report indicate that the model has a higher precision for predicting non-diabetic patients (class 0) than diabetic patients (class 1).
Next, a KerasClassifier is created to enable the use of scikit-learn's cross-validation and grid search capabilities. The model undergoes k-fold cross-validation with 5 splits to ensure it generalizes well to unseen data. The initial cross-validation accuracy is 65.31%, with a standard deviation of 0.34%, suggesting that the model's performance is relatively stable but can be improved
A grid search is performed to find the best hyperparameters for the model, including batch size, epochs, optimizer, and weight initialization method. The best parameters found are batch size of 10, 150 epochs, 'he_uniform' initialization, and the 'Adam' optimizer. Using these parameters, the model is retrained with early stopping to prevent overfitting. The final model achieves an accuracy of 72.08% on the test set.
The first model's higher accuracy might be due to its simplicity and effective learning dynamics with the 'adam' optimizer. The second model, despite having optimized parameters, faced a slight drop in accuracy possibly due to overfitting, increased model complexity, data split variance, and noise in hyperparameter tuning. This highlights the importance of cross-validation and careful interpretation of hyperparameter optimization results.
To further improve the model, an enhanced neural network is created with additional layers for batch normalization and dropout. These techniques help improve model stability and reduce overfitting. The improved model is also wrapped in a KerasClassifier and evaluated using cross-validation. This model achieves a cross-validation accuracy of 75.40%, with a standard deviation of 2.24%, indicating more consistent performance across different splits of the data.
Finally, early stopping and a learning rate scheduler are implemented to further optimize the training process. The early stopping callback monitors the validation loss and stops training if it doesn't improve for 10 consecutive epochs, while the learning rate scheduler reduces the learning rate if the validation loss plateaus. The improved model achieves an accuracy of 77.27% on the test set, with a confusion matrix showing better performance in predicting both classes compared to the initial models.

In conclusion, the process involves iteratively refining the model by handling missing values, normalizing the data, experimenting with different neural network architectures, and using techniques like grid search, cross-validation, early stopping, and learning rate scheduling. The final model with batch normalization, dropout, and optimized hyperparameters shows significant improvement in accuracy and consistency, demonstrating the importance of systematic model tuning and evaluation in machine learning.

