# Multilabel Classification on a Diabetes Dataset

Goal statement: I use the diabetes_data.csv dataset from Kaggle.com [1] to perform multilabel classification by predicting whether the patient has diabetes, stroke, and high blood pressure using Keras's Functional API. This is a multilabel, binary classification problem.

In [36]:
# Libraries and imports
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import multilabel_confusion_matrix

In [2]:
# Read data
data = pd.read_csv(r'C:\Users\linda\OneDrive\Desktop\diabetes_data.csv')

# data.head()
# data.shape

I already created an Artificial Neural Network model that keeps the BMI as a continuous feature. It will be loaded for comparison against model with BMI as categories.

Feature Engineering:
 - Cut the BMI feature into categories
  - BMI was converted into categories based on the CDC article "Defining Adult Overweight & Obesity" [2].
 - Convert the categories into One-Hot encoded features

In [None]:
# Cut BMI feature into categories
BMI_cuts = pd.cut(x=data['BMI'], bins = [0,18.5,24.9,29.9,34.9,39.9,np.inf],
             labels = list(range(6)))




In [3]:
# Convert dataframe subsets to numpy arrays
X = data.drop(['Stroke','HighBP','Diabetes'], axis = 1).to_numpy() # Predictors
y = data[['Stroke','HighBP','Diabetes']].to_numpy() # 3 targets, binary


In [38]:
TARGET_LABELS = ['Stroke','HighBP','Diabetes']

Recall the 3 targets are whether the patient has Stroke, High blood pressure, and Diabetes.

In [4]:
# Split into training and testing sets
# Sets will by further split into validation, next
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Further split training set into validation set
X_valid, X_train = X_train[:5000], X_train[5000:]
y_valid, y_train = y_train[:5000], y_train[5000:]


# print(X_valid.shape)
# print(X_train.shape)
# print(y_valid.shape)
# print(y_train.shape)

In [5]:
# Creating sequential model
m = keras.models.Sequential([
 keras.layers.InputLayer(15),                  # One input node per feature
 keras.layers.Dense(3375, activation="relu"),  # 15^3 nodes
 keras.layers.Dense(225, activation="relu"),   # 15^2 nodes
 keras.layers.Dense(3, activation="sigmoid")   # one output node per label
])

In [6]:
# Model summary to inspect number of connections
m.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 3375)              54000     
_________________________________________________________________
dense_1 (Dense)              (None, 225)               759600    
_________________________________________________________________
dense_2 (Dense)              (None, 3)                 678       
Total params: 814,278
Trainable params: 814,278
Non-trainable params: 0
_________________________________________________________________


In [7]:
m.compile(loss="binary_crossentropy", # For binary classification
 optimizer="sgd",                     # SGD
 metrics=["accuracy"])                # Display accuracy

In [8]:
history = m.fit(X_train, y_train, epochs=30,
                     validation_data=(X_valid, y_valid))

Epoch 1/30
Epoch 2/30
 164/1612 [==>...........................] - ETA: 7s - loss: 0.4846 - accuracy: 0.4173

KeyboardInterrupt: 

In [95]:
m.save('multilabel_clf_unengineered.h5')

m2 = keras.models.load_model('multilabel_clf_unengineered.h5')

In [None]:
pd.DataFrame(history.history).plot(figsize=(8, 5))
plt.grid(True)
plt.gca().set_ylim(0, 1) # set the vertical range to [0-1]
plt.show()

We see that the loss and accuracy converge without reaching a good solution. We will convert the `Age` and `BMI` features into categories after looking at predictions.

In [23]:
# Predictions, outputs probabilities
predictions_matrix_probs = m.predict(X_valid)

In [32]:
# LIST COMPREHENSION
# to convert probabilities into binary decision
# 1 if the vector index is >= .5, else put 0
# For each vector in the prediction matrix
predictions_matrix_bin = np.array([[1 if i >= .5 else 0 for i in vector] \
                          for vector in predictions_matrix_probs])

In [50]:
# Printing the confusion matrices, one for each label
for i in range(len(TARGET_LABELS)):
    print(f"CONFUSION MATRIX for {TARGET_LABELS[i]}:")
    print("____________________________")
    print(multilabel_confusion_matrix(predictions_matrix_bin, y_valid)[i],end = '\n'*2)

CONFUSION MATRIX for Stroke:
____________________________
[[4701  299]
 [   0    0]]

CONFUSION MATRIX for HighBP:
____________________________
[[1609 1133]
 [ 633 1625]]

CONFUSION MATRIX for Diabetes:
____________________________
[[1834  886]
 [ 654 1626]]



In [53]:
data['BMI']

0        26.0
1        26.0
2        26.0
3        28.0
4        29.0
         ... 
70687    37.0
70688    29.0
70689    25.0
70690    18.0
70691    25.0
Name: BMI, Length: 70692, dtype: float64

In [82]:
BMI_cuts = pd.cut(x=data['BMI'], bins = [0,18.5,24.9,29.9,34.9,39.9,np.inf],
             labels = list(range(6)))

BMI_dummies = pd.get_dummies(cuts, drop_first=True)

In [91]:
data = pd.concat([data.drop('BMI', axis = 1), BMI_dummies], axis=1)

KeyError: "['BMI'] not found in axis"

In [93]:
# Renaming BMI weight for clarity
data.rename(columns = {0:'underW', 1:'healthyW', 2:'overW', 3:'obsC1',
                      4:'obsC1', 5:'obsC2'})

Unnamed: 0,Age,Sex,HighChol,CholCheck,Smoker,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,...,PhysHlth,DiffWalk,Stroke,HighBP,Diabetes,healthyW,overW,obsC1,obsC1.1,obsC2
0,4.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,...,30.0,0.0,0.0,1.0,0.0,0,1,0,0,0
1,12.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,1.0,0.0,0,1,0,0,0
2,13.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,...,10.0,0.0,0.0,0.0,0.0,0,1,0,0,0
3,11.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,...,3.0,0.0,0.0,1.0,0.0,0,1,0,0,0
4,8.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70687,6.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0,0,0,1,0
70688,10.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0,1,0,0,0
70689,13.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,1.0,1.0,0,1,0,0,0
70690,11.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,1.0,0,0,0,0,0


In [87]:
data2.rename(columns = {0:'Code-Name', 
                       1:'Weight in kgs'})

Unnamed: 0,Age,Sex,HighChol,CholCheck,BMI,Smoker,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,...,PhysHlth,DiffWalk,Stroke,HighBP,Diabetes,Weight in kgs,2,3,4,5
0,4.0,1.0,0.0,1.0,26.0,0.0,0.0,1.0,0.0,1.0,...,30.0,0.0,0.0,1.0,0.0,0,1,0,0,0
1,12.0,1.0,1.0,1.0,26.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,1.0,0.0,0,1,0,0,0
2,13.0,1.0,0.0,1.0,26.0,0.0,0.0,1.0,1.0,1.0,...,10.0,0.0,0.0,0.0,0.0,0,1,0,0,0
3,11.0,1.0,1.0,1.0,28.0,1.0,0.0,1.0,1.0,1.0,...,3.0,0.0,0.0,1.0,0.0,0,1,0,0,0
4,8.0,0.0,0.0,1.0,29.0,1.0,0.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70687,6.0,0.0,1.0,1.0,37.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0,0,0,1,0
70688,10.0,1.0,1.0,1.0,29.0,1.0,1.0,0.0,1.0,1.0,...,0.0,1.0,0.0,0.0,1.0,0,1,0,0,0
70689,13.0,0.0,1.0,1.0,25.0,0.0,1.0,0.0,1.0,0.0,...,0.0,1.0,0.0,1.0,1.0,0,1,0,0,0
70690,11.0,0.0,1.0,1.0,18.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,1.0,0,0,0,0,0


References:

[1] https://www.kaggle.com/datasets/prosperchuks/health-dataset

[2] https://www.cdc.gov/obesity/basics/adult-defining.html