# Multilabel Classification on a Diabetes Dataset

Goal statement: I use the diabetes_data.csv dataset from Kaggle.com [1] to perform multilabel classification by predicting whether the patient has diabetes, stroke, and high blood pressure using Keras's Functional API. This is a multilabel, binary classification problem.

In [1]:
# Libraries and imports
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import multilabel_confusion_matrix

In [2]:
# Read data
data = pd.read_csv(r'C:\Users\linda\OneDrive\Desktop\diabetes_data.csv')

# data.head()
# data.shape

I already created an Artificial Neural Network model that keeps the BMI as a continuous feature. It will be loaded for comparison against model with BMI as categories.

Feature Engineering:
 - Cut the BMI feature into categories
  - BMI was converted into categories based on the CDC article "Defining Adult Overweight & Obesity" [2].
 - Convert the categories into One-Hot encoded features
 - Add to new dataset and rename columns

In [3]:
# Cut BMI feature into categories for One-Hot encoding
BMI_cuts = pd.cut(x=data['BMI'], bins = [0,18.5,24.9,29.9,34.9,39.9,np.inf],
             labels = list(range(6)))

# One hot encoding with .get_dummies() function
BMI_dummies = pd.get_dummies(BMI_cuts, drop_first=True)


# Add the one-hot encoded features to the original dataset
# with the original BMI feature dropped
data = pd.concat([data.drop('BMI', axis = 1), BMI_dummies], axis=1)


# Renaming BMI weight for clarity
data.rename(columns = {0:'underW', 1:'healthyW', 2:'overW', 3:'obsC1',
                      4:'obsC1', 5:'obsC2'}, inplace = True)

# Create lists of target labels for usage later
TARGET_LABELS = ['Stroke','HighBP','Diabetes']

In [4]:
# Convert dataframe subsets to numpy arrays
X = data.drop(['Stroke','HighBP','Diabetes'], axis = 1).to_numpy() # Predictors
y = data[['Stroke','HighBP','Diabetes']].to_numpy()                # 3 targets, binary

In [5]:
# Split into training and testing sets
# Sets will by further split into validation, next
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Further split training set into validation set
X_valid, X_train = X_train[:5000], X_train[5000:]
y_valid, y_train = y_train[:5000], y_train[5000:]


# print(X_valid.shape)
# print(X_train.shape)
# print(y_valid.shape)
# print(y_train.shape)

Because I'm creating and evaluating models, then engineering features, then repeating, the ANN hyperparameters will be stored in variables as the layer sizes will change accordingly to the number of features.

In [6]:
# Number of input features; the number of features for any observation
n = X[0].shape[0]


# Creating sequential model
m = keras.models.Sequential([
 keras.layers.InputLayer(n),                   # One input node per feature, n
 keras.layers.Dense(n^3, activation="relu"),   # n^3 nodes
 keras.layers.Dense(n^2, activation="relu"),   # n^2 nodes
 keras.layers.Dense(3, activation="sigmoid")   # one output node per label
])

In [7]:
# Model summary to inspect number of connections
m.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 3375)              54000     
_________________________________________________________________
dense_1 (Dense)              (None, 225)               759600    
_________________________________________________________________
dense_2 (Dense)              (None, 3)                 678       
Total params: 814,278
Trainable params: 814,278
Non-trainable params: 0
_________________________________________________________________


In [8]:
m.compile(loss="binary_crossentropy", # For binary classification
 optimizer="sgd",                     # SGD
 metrics=["accuracy"])                # Display accuracy

In [9]:
history = m.fit(X_train, y_train, epochs=30,
                     validation_data=(X_valid, y_valid))

Epoch 1/30


ValueError: in user code:

    C:\Users\linda\anaconda3\envs\tf\lib\site-packages\keras\engine\training.py:853 train_function  *
        return step_function(self, iterator)
    C:\Users\linda\anaconda3\envs\tf\lib\site-packages\keras\engine\training.py:842 step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    C:\Users\linda\anaconda3\envs\tf\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:1286 run
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    C:\Users\linda\anaconda3\envs\tf\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:2849 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    C:\Users\linda\anaconda3\envs\tf\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:3632 _call_for_each_replica
        return fn(*args, **kwargs)
    C:\Users\linda\anaconda3\envs\tf\lib\site-packages\keras\engine\training.py:835 run_step  **
        outputs = model.train_step(data)
    C:\Users\linda\anaconda3\envs\tf\lib\site-packages\keras\engine\training.py:787 train_step
        y_pred = self(x, training=True)
    C:\Users\linda\anaconda3\envs\tf\lib\site-packages\keras\engine\base_layer.py:1020 __call__
        input_spec.assert_input_compatibility(self.input_spec, inputs, self.name)
    C:\Users\linda\anaconda3\envs\tf\lib\site-packages\keras\engine\input_spec.py:250 assert_input_compatibility
        raise ValueError(

    ValueError: Input 0 of layer sequential is incompatible with the layer: expected axis -1 of input shape to have value 15 but received input with shape (None, 19)


In [None]:
pd.DataFrame(history.history).plot(figsize=(8, 5))
plt.grid(True)
plt.gca().set_ylim(0, 1) # set the vertical range to [0-1]
plt.show()

We see that the loss and accuracy converge without reaching a good solution. We will convert the `Age` and `BMI` features into categories after looking at predictions.

In [None]:
# Predictions, outputs probabilities
predictions_matrix_probs = m.predict(X_valid)

In [None]:
# LIST COMPREHENSION
# to convert probabilities into binary decision
# 1 if the vector index is >= .5, else put 0
# For each vector in the prediction matrix
predictions_matrix_bin = np.array([[1 if i >= .5 else 0 for i in vector] \
                          for vector in predictions_matrix_probs])

In [None]:
# Printing the confusion matrices, one for each label
for i in range(len(TARGET_LABELS)):
    print(f"CONFUSION MATRIX for {TARGET_LABELS[i]}:")
    print("____________________________")
    print(multilabel_confusion_matrix(predictions_matrix_bin, y_valid)[i],end = '\n'*2)

In [None]:
m2 = keras.models.load_model('multilabel_clf_unengineered.h5')

References:

[1] https://www.kaggle.com/datasets/prosperchuks/health-dataset

[2] https://www.cdc.gov/obesity/basics/adult-defining.html