# Multilabel Classification on a Diabetes Dataset

Goal statement: I use the diabetes_data.csv dataset from Kaggle.com [1] to perform multilabel classification by predicting whether the patient has diabetes, stroke, and high blood pressure using Keras's Functional API. This is a multilabel, binary classification problem.

In [36]:
# Libraries and imports
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import multilabel_confusion_matrix

In [2]:
# Read data
data = pd.read_csv(r'C:\Users\linda\OneDrive\Desktop\diabetes_data.csv')

# data.head()
# data.shape

In [3]:
# Convert dataframe subsets to numpy arrays
X = data.drop(['Stroke','HighBP','Diabetes'], axis = 1).to_numpy() # Predictors
y = data[['Stroke','HighBP','Diabetes']].to_numpy() # 3 targets, binary


In [38]:
TARGET_LABELS = ['Stroke','HighBP','Diabetes']

Recall the 3 targets are whether the patient has Stroke, High blood pressure, and Diabetes.

In [4]:
# Split into training and testing sets
# Sets will by further split into validation, next
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Further split training set into validation set
X_valid, X_train = X_train[:5000], X_train[5000:]
y_valid, y_train = y_train[:5000], y_train[5000:]


# print(X_valid.shape)
# print(X_train.shape)
# print(y_valid.shape)
# print(y_train.shape)

In [5]:
# Creating sequential model
m = keras.models.Sequential([
 keras.layers.InputLayer(15),                  # One input node per feature
 keras.layers.Dense(3375, activation="relu"),  # 15^3 nodes
 keras.layers.Dense(225, activation="relu"),   # 15^2 nodes
 keras.layers.Dense(3, activation="sigmoid")   # one output node per label
])

In [6]:
# Model summary to inspect number of connections
m.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 3375)              54000     
_________________________________________________________________
dense_1 (Dense)              (None, 225)               759600    
_________________________________________________________________
dense_2 (Dense)              (None, 3)                 678       
Total params: 814,278
Trainable params: 814,278
Non-trainable params: 0
_________________________________________________________________


In [7]:
m.compile(loss="binary_crossentropy", # For binary classification
 optimizer="sgd",                     # SGD
 metrics=["accuracy"])                # Display accuracy

In [8]:
history = m.fit(X_train, y_train, epochs=30,
                     validation_data=(X_valid, y_valid))

Epoch 1/30
Epoch 2/30
 164/1612 [==>...........................] - ETA: 7s - loss: 0.4846 - accuracy: 0.4173

KeyboardInterrupt: 

In [None]:
pd.DataFrame(history.history).plot(figsize=(8, 5))
plt.grid(True)
plt.gca().set_ylim(0, 1) # set the vertical range to [0-1]
plt.show()

We see that the loss and accuracy converge without reaching a good solution. We will convert the `Age` and `BMI` features into categories after looking at predictions.

In [23]:
# Predictions, outputs probabilities
predictions_matrix_probs = m.predict(X_valid)

In [32]:
# LIST COMPREHENSION
# to convert probabilities into binary decision
# 1 if the vector index is >= .5, else put 0
# For each vector in the prediction matrix
predictions_matrix_bin = np.array([[1 if i >= .5 else 0 for i in vector] \
                          for vector in predictions_matrix_probs])

In [50]:
# Printing the confusion matrices, one for each label
for i in range(len(TARGET_LABELS)):
    print(f"CONFUSION MATRIX for {TARGET_LABELS[i]}:")
    print("____________________________")
    print(multilabel_confusion_matrix(predictions_matrix_bin, y_valid)[i],end = '\n'*2)

CONFUSION MATRIX for Stroke:
____________________________
[[4701  299]
 [   0    0]]

CONFUSION MATRIX for HighBP:
____________________________
[[1609 1133]
 [ 633 1625]]

CONFUSION MATRIX for Diabetes:
____________________________
[[1834  886]
 [ 654 1626]]



In [30]:
np.array(predictions_matrix_bin).shape

(5000, 3)

In [31]:
y_valid.shape

(5000, 3)