<a href="https://colab.research.google.com/github/prsdm/Diabetes-Detection/blob/main/Diabetes_predict_tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Diabetes Detection Using TensorFlow and Keras 💊🏥🩺

This notebook will walk you through the process of creating a classification model using TensorFlow to predict if a person has diabetes or not, based on a dataset. You'll learn how to import data, preprocess features, build a machine learning model, and assess its accuracy.

The diabetes dataset is taken from the '/kaggle/input/diabetes-prediction-dataset/diabetes_prediction_dataset.csv'

Let's get started!

In [1]:
# Importing the necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout


- **sklearn.preprocessing.LabelEncoder** is used to encode categorical labels into numeric values


- **sklearn.preprocessing.StandardScaler** is used to standardize features from the data resources


- **sklearn.model_selection.train_test_split** is used to split the dataset into training and test sets


- **tensorflow** is a popular framework for building and training machine learning models


- **tensorflow.keras.Sequential** is a sequential model where the layers are linearly stacked


- **tensorflow.keras.layers.Dense** defines a dense (fully connected) layer of the neural network


- **tensorflow.keras.layers.Dropout** is used to add dropout layers to avoid overfitting

In [2]:
# Loading the dataset
df = pd.read_csv('diabetes_prediction_dataset.csv')
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0


In [63]:
data['smoking_history'].value_counts()

No Info        35816
never          35095
former          9352
current         9286
not current     6447
ever            4004
Name: smoking_history, dtype: int64

In [3]:
# Showing the information of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   gender               100000 non-null  object 
 1   age                  100000 non-null  float64
 2   hypertension         100000 non-null  int64  
 3   heart_disease        100000 non-null  int64  
 4   smoking_history      100000 non-null  object 
 5   bmi                  100000 non-null  float64
 6   HbA1c_level          100000 non-null  float64
 7   blood_glucose_level  100000 non-null  int64  
 8   diabetes             100000 non-null  int64  
dtypes: float64(3), int64(4), object(2)
memory usage: 6.9+ MB


**df.info()** gives information about the dataset such as the number of rows and columns, the data types of each column, and the number of missing values in each column.

In [4]:
# Showing the statistical description of the dataset
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,100000.0,41.885856,22.51684,0.08,24.0,43.0,60.0,80.0
hypertension,100000.0,0.07485,0.26315,0.0,0.0,0.0,0.0,1.0
heart_disease,100000.0,0.03942,0.194593,0.0,0.0,0.0,0.0,1.0
bmi,100000.0,27.320767,6.636783,10.01,23.63,27.32,29.58,95.69
HbA1c_level,100000.0,5.527507,1.070672,3.5,4.8,5.8,6.2,9.0
blood_glucose_level,100000.0,138.05806,40.708136,80.0,100.0,140.0,159.0,300.0
diabetes,100000.0,0.085,0.278883,0.0,0.0,0.0,0.0,1.0


**df.describe().T** calculates descriptive statistics for each column of the dataset, transposing the result to display the statistics in tabular format.

In [6]:
df['diabetes'].value_counts()

0    91500
1     8500
Name: diabetes, dtype: int64

The data is imbalanced, as the number of people who have diabetes (8.5%) is significantly less compared to the number of people who don't have diabetes (91.5%).

In [15]:
df_diabetes_yes= df[df['diabetes']==0]
df_diabetes_no = df[df['diabetes']==1]
df_undersampling = df_diabetes_yes.sample(df_diabetes_no.shape[0])

0    8500
Name: diabetes, dtype: int64

In [17]:
df = pd.concat([df_undersampling, df_diabetes_no], axis=0)
df['diabetes'].value_counts()

0    8500
1    8500
Name: diabetes, dtype: int64

In this part of code we decided to down sample the data to have equal number of people who have diabetes and people who don't have diabetes.

In [18]:
# Label encoding the 'gender' and 'smoking_history' columns
df[['gender', 'smoking_history']] = df[['gender', 'smoking_history']].apply(LabelEncoder().fit_transform)

Here, we are using the LabelEncoder to transform the categorical columns into numerical ones. LabelEncoder is a class in the sklearn.preprocessing module that transforms categorical labels into numbers. This transformation is necessary because many machine learning algorithms only work with numerical data. In the example, we are turning the 'gender' and 'smoking_history' columns of the df dataframe into numerical values.

In [19]:
# Splitting the dataset into features and target
X = df.drop('diabetes', axis = 1)
y = df['diabetes']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Here, we are splitting the data into training and testing sets using the train_test_split function of the sklearn.model_selection module. This function splits the data at a specific ratio (in this case, 80% for training and 20% for testing) and ensures that the split is done randomly, using the value of random_state to control randomness.

Set X contains all the columns of the dataframe, except the 'diabetes' column, which is the target variable we want to predict. Set y contains only the 'diabetes' column. The sets xtrain and ytrain are used to train the model, while xtest and ytest are used to evaluate the performance of the model.

In [20]:
# Standardizing the data before training
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Here, we are using the StandardScaler from the sklearn.preprocessing module to standardize numerical data. Patterning is a common technique when preparing data for training machine learning models. It transforms the data so that the mean is 0 and the standard deviation is 1, ensuring that all features have the same scale. This is important because many machine learning algorithms are sensitive to the scale of the data.

First, we create an instance of StandardScaler called scaler. We then use the fit_transform method to compute the standardization statistics (mean and standard deviation) from the xtrain training set, and then apply the transform to the training and test sets using the transform method. This ensures that the same standardization is applied to both sets, using the statistics computed on the training set.

In [27]:
# Building the model using Keras sequential API
model = Sequential([
    Dense(32, activation = 'relu', input_shape = (X_train.shape[1],)),
    Dense(16, activation = 'relu'),
    Dense(1, activation = 'sigmoid')
])

In this piece of code, we are creating a neural network model using TensorFlow. The model is defined as a sequence of stacked layers. Here is an explanation of each part:

- **Dense(32, activation='relu', input_shape=(xtrain.shape[1],)):** This line creates a dense layer with 32 units (neurons) and ReLU activation function. The layer receives as input a shape tensor (xtrain.shape[1],), which corresponds to the format of the input data of the training set. This layer is the first layer of the model, so we specify the input format.


- **Dropout(0.1):** This line adds a dropout layer with a rate of 0.1. Dropout is a regularization technique that helps prevent overfitting by randomly deactivating a fraction of neurons during training.


- **Dense(32, activation='relu'):** This line creates another dense layer with 32 units and ReLU activation function. This is the second layer of the model, no need to specify the input format as the output from the previous layer is used as input.


- **Dropout(0.5):** This line adds a second dropout layer with a rate of 0.5.


- **Dense(1, activation='sigmoid'):** This line creates the output layer of the model with a single neuron and sigmoid activation function. This layer is responsible for producing the binary output of the model (0 or 1), indicating the target class.

In [28]:
# Compiling the model using 'adam' optimizer and 'binary_crossentropy' loss function
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_6 (Dense)             (None, 32)                288       
                                                                 
 dense_7 (Dense)             (None, 16)                528       
                                                                 
 dense_8 (Dense)             (None, 1)                 17        
                                                                 
Total params: 833 (3.25 KB)
Trainable params: 833 (3.25 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


After creating the model, we need to compile it before training it. On the first line, we are setting the model build options:

- **loss='binary_crossentropy':** We use the binary cross entropy as the loss function. This loss function is suitable for binary classification problems, where we are trying to predict one of two classes.


- **optimizer='adam':** The Adam optimizer will be used to adjust model weights during training. Adam is a popular optimization algorithm that relies on stochastic gradient descent methods.


- **metrics=['accuracy']:** In addition to the loss function, we also want to track the accuracy metric during model training and evaluation. Accuracy is a common measure for evaluating classification model performance.

On the second line, we are printing a model summary, which displays the architecture of the neural network in tabular form. The summary includes information about the input and output format of each layer, the total number of trainable parameters, and the overall model summary.

In [29]:
# Training the model with 20 epochs
model.fit(X_train, y_train, epochs = 20, batch_size = 16, validation_data = (X_test, y_test))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.src.callbacks.History at 0x7b346fb19420>

In this part of the code, we are training the neural network model. Here is an explanation of the different parts:

- **xtrain** and **ytrain** are the training data, where xtrain contains the resources (inputs) and ytrain contains the corresponding labels (outputs). This data is used to adjust model weights during training.


- **epochs** is the number of times the model will go through the entire training set. Each epoch consists of a cycle of going through the training data and adjusting the model weights.


- **batch_size** is the number of training examples used in a single iteration. The training set is divided into smaller batches and adjustment of model weights is performed after each batch.


- **validation_data = (xtest, ytest)** specifies the validation data to be used during training. This data is used to evaluate the model's performance on an independent dataset during training. xtest are the test resources and ytest are the corresponding labels.

In [30]:
# Evaluating the model on test data
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Test loss: {loss:.4f}')
print(f'Test accuracy: {accuracy:.4f}')

Test loss: 0.1917
Test accuracy: 0.9071


In this part of the code, we are evaluating the performance of the trained model using the test data. Here is an explanation of the different parts:

- **model.evaluate(xtest, ytest)** calculates the loss and accuracy of the model in relation to the test data. Loss is a measure of how well the model is performing the task, while accuracy is the proportion of test examples correctly classified by the model.


- **loss** is the loss calculated by the model on the test data.


- **accuracy** is the accuracy calculated by the model on the test data.


- **print(f'Test loss: {loss}')** prints the loss calculated during the evaluation of the test data.


- **print(f'Test accuracy: {accuracy}')** prints the accuracy calculated during the evaluation of the test data.

This information is useful for understanding the performance of the trained model and evaluating its ability to generalize to previously unseen data.

In [32]:
# Saving the model
model.save('diabetes_model')
#tf.keras.models.save_model(model, 'diabetes_model.hdf5')

In this code, we are saving the trained model to a file called 'diabetes_model' in the current directory using the model.save function.

In [57]:
# Create a DataFrame with input values
input_data = pd.DataFrame({
    'gender': ['Male'],
    'age': [28],
    'hypertension': [0],
    'heart_disease': [0],
    'smoking_history': ['never'],
    'bmi': [27.32],
    'HbA1c_level': [5.7],
    'blood_glucose_level': [158],
})

# Convert categorical columns to numerical using LabelEncoder
input_data[['gender', 'smoking_history']] = input_data[['gender', 'smoking_history']].apply(LabelEncoder().fit_transform)

# Standardize the input data
input_data_scaled = scaler.transform(input_data)

# Make predictions using the trained model
prediction_probability = model.predict(input_data_scaled)

# Convert probability to binary prediction
binary_prediction = (prediction_probability > 0.5).astype(int)

# Display the results
print("Prediction Probability:", prediction_probability)
print("Binary Prediction:", binary_prediction)


Prediction Probability: [[0.08983746]]
Binary Prediction: [[0]]


Certainly! Let's go through each part of the code:

```python
new_data[['gender', 'smoking_history']] = new_data[['gender', 'smoking_history']].apply(LabelEncoder().fit_transform)
```
- Here, the 'gender' and 'smoking_history' columns are label-encoded using `LabelEncoder()`. This transformation converts categorical labels into numerical values to ensure consistency with the training data.

```python
new_data_scaled = scaler.transform(new_data.drop('diabetes', axis=1))
```

- The new data is standardized using the same `scaler` that was used during the training phase. Standardization ensures that the features have the same scale as the training data.

```python
predictions = model.predict(new_data_scaled)
```

- The trained model is used to make predictions on the standardized new data using the `predict` method.

```python
binary_predictions = (predictions > 0.5).astype(int)
```

- The predicted probabilities are converted into binary predictions by setting a threshold (0.5 in this case). Values above the threshold are classified as 1, and values below or equal to the threshold are classified as 0.

```python
print("Predictions:")
print(binary_predictions)
```

- Finally, the binary predictions are displayed to the user. This provides the predicted outcome based on the input data.