# Neural Network Multilabel Classification using TensorFlow

In this notebook, there are two code snippets which both define and evaluate a neural network model for multilabel classification using TensorFlow, but there are several key differences in structure, methodology, and evaluation techniques.

In [None]:
# Import important libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_multilabel_classification
import tensorflow as tf
from sklearn.metrics import accuracy_score, hamming_loss, f1_score
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense
from sklearn.model_selection import RepeatedKFold

In [2]:
import os
# Set the working directory
os.chdir(r'/Users/saram/Desktop/Erdos_Institute/project/Data')

In [3]:
# Read train features
mars_data = pd.read_csv("../Data/train_features_new_with_PCA.csv")
mars_data.set_index(mars_data.sample_id, inplace=True)
mars_data

Unnamed: 0_level_0,sample_id,basalt,carbonate,chloride,iron_oxide,oxalate,oxychlorine,phyllosilicate,silicate,sulfate,...,2.12,0.13,1.13,2.13,0.14,1.14,2.14,0.15,1.15,2.15
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
S0000,S0000,0,0,0,0,0,0,0,0,1,...,-0.09684,-0.983755,-0.177357,-0.178857,-0.559546,-0.15498,-0.039571,-0.362594,2.270000e-15,1.300000e-15
S0001,S0001,0,1,0,0,0,0,0,0,0,...,-0.09684,-0.983755,-0.177357,-0.178857,-0.559546,-0.15498,-0.039571,-0.362594,2.270000e-15,1.300000e-15
S0002,S0002,0,0,0,0,0,1,0,0,0,...,-0.09684,-0.983755,-0.177357,-0.178857,-0.559546,-0.15498,-0.039571,-0.362594,2.270000e-15,1.300000e-15
S0003,S0003,0,1,0,1,0,0,0,0,1,...,-0.09684,-0.983755,-0.177357,-0.178857,-0.559546,-0.15498,-0.039571,-0.362594,2.270000e-15,1.300000e-15
S0004,S0004,0,0,0,1,0,1,1,0,0,...,-0.09684,-0.983755,-0.177357,-0.178857,-0.559546,-0.15498,-0.039571,-0.362594,2.270000e-15,1.300000e-15
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
S0749,S0749,0,0,0,0,0,0,0,0,0,...,-0.09684,-0.983755,-0.177357,-0.178857,-0.559546,-0.15498,-0.039571,-0.362594,2.270000e-15,1.300000e-15
S0750,S0750,0,0,0,0,0,0,1,0,0,...,-0.09684,-0.983755,-0.177357,-0.178857,-0.559546,-0.15498,-0.039571,-0.362594,2.270000e-15,1.300000e-15
S0751,S0751,0,0,0,0,0,0,0,1,0,...,-0.09684,-0.983755,-0.177357,-0.178857,-0.559546,-0.15498,-0.039571,-0.362594,2.270000e-15,1.300000e-15
S0752,S0752,0,0,0,1,0,0,0,0,0,...,-0.09684,-0.983755,-0.177357,-0.178857,-0.559546,-0.15498,-0.039571,-0.362594,2.270000e-15,1.300000e-15


In [4]:
print(mars_data.columns)

Index(['sample_id', 'basalt', 'carbonate', 'chloride', 'iron_oxide', 'oxalate',
       'oxychlorine', 'phyllosilicate', 'silicate', 'sulfate', 'sulfide', '0',
       '1', '2', '0.1', '1.1', '2.1', '0.2', '1.2', '2.2', '0.3', '1.3', '2.3',
       '0.4', '1.4', '2.4', '0.5', '1.5', '2.5', '0.6', '1.6', '2.6', '0.7',
       '1.7', '2.7', '0.8', '1.8', '2.8', '0.9', '1.9', '2.9', '0.10', '1.10',
       '2.10', '0.11', '1.11', '2.11', '0.12', '1.12', '2.12', '0.13', '1.13',
       '2.13', '0.14', '1.14', '2.14', '0.15', '1.15', '2.15'],
      dtype='object')


In [36]:
# Data preprocessing 
# Drop 'sample_id' and separate features and target labels
X = mars_data.drop(columns=['sample_id', 'basalt', 'carbonate', 'chloride', 'iron_oxide', 'oxalate', 'oxychlorine',
                       'phyllosilicate', 'silicate', 'sulfate', 'sulfide'])
y = mars_data[['basalt', 'carbonate', 'chloride', 'iron_oxide', 'oxalate', 'oxychlorine',
          'phyllosilicate', 'silicate', 'sulfate', 'sulfide']]

In [37]:
# Ensure we have correct dimensions
print(X.shape)
print(y.shape)

(754, 48)
(754, 10)


In [14]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the data
scalar = StandardScaler()
X_train = scalar.fit_transform(X_train)
X_test = scalar.transform(X_test)

## First Model

1. Model Definition

    The model is defined directly using `tf.keras.Sequential()`.
    The architecture consists of **two hidden layers** (both with 64 units and ReLU activation), followed by an output layer with 10 units and a sigmoid activation function (for multilabel classification).

    It uses `model.compile()` and `model.fit()` to compile and train the model.

2. Training

    The model is trained with a fixed dataset `(X_train, y_train)` for 20 `epochs` with a `batch size` of 32.
    
    There is no cross-validation involved in this training process.

3. Evaluation Metrics

    After training, the model is evaluated on a test set `(X_test, y_test)` for loss and accuracy using `model.evaluate()`.

    It calculates `Hamming loss` separately after making predictions and converting them to binary using a threshold of 0.5.

4. Model Evaluation and Reporting

    After training, the model is directly evaluated on the test set and reports loss, accuracy, and Hamming loss.
    It uses `.predict()` for predictions and rounds the output for binary classification.

    For multi-label classification, use Hamming loss (good for multi-label accuracy at the label level). This metric works by comparing each predicted label with the actual label for each class (per sample). It counts how many times there is a mismatch (i.e., the prediction is wrong), then divides by the total number of label entries.

In [None]:
# Creating a multilabel neural network using TensorFlow
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(48,)),  # Input layer with 48 features
    tf.keras.layers.Dense(64, activation='relu'),  # Hidden layer
    tf.keras.layers.Dense(10, activation='sigmoid')  # Output layer with 4 nodes and sigmoid activation for multilabel classification
])

# Compiling the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Training the model
model.fit(X_train, y_train, epochs=20, batch_size=32)

In [47]:
# Making predictions
y_pred = model.predict(X_test)
y_pred_binary = (y_pred > 0.5).astype(int)  # Convert to binary predictions

# Evaluating accuracy
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Loss: {loss}")
print(f"Accuracy: {accuracy}")

# Calculating F1-Score (Macro average)
f1 = f1_score(y_test, y_pred_binary, average='macro')
print(f"F1-Score: {f1}")

# Calculating Hamming Loss
hamming = hamming_loss(y_test, y_pred_binary)
print(f"Hamming Loss: {hamming}")

[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step 
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 839us/step - accuracy: 0.3261 - loss: 0.3424
Loss: 0.34491172432899475
Accuracy: 0.33112582564353943
F1-Score: 0.25167838932462117
Hamming Loss: 0.1337748344370861


1. Model Definition

    The model is defined as a function `get_model()` that takes `n_inputs` (number of input features) and `n_outputs` (number of output labels) as arguments. 
    
    Similar to the first, the architecture consists of two hidden layers and a sigmoid output layer.

2. Training

    The model is evaluated using Repeated K-Fold Cross-Validation with 10 splits and 3 repeats. This means that the training and evaluation are done multiple times on different train-test splits, providing a more robust performance assessment.

    The model is trained for 100 epochs in each fold of the cross-validation process.

3. Evaluation Metrics

    The evaluation occurs for each fold in the cross-validation process, where it calculates:
    Accuracy using `accuracy_score()`.
    Hamming Loss using `hamming_loss()`.
    F1-Score (macro average) using `f1_score()` from sklearn.

    The average of these metrics is printed at the end, giving a more comprehensive performance overview across different splits.

4. Model Evaluation and Reporting

    The results are aggregated over multiple cross-validation splits, and it reports average accuracy, average Hamming loss, and average F1-score.
    
    Individual metrics for each fold are printed, making it possible to assess performance variability across different splits.



In [None]:
# Define the model
def get_model(n_inputs, n_outputs):
    model = Sequential()
    model.add(Dense(64, input_dim=n_inputs, activation='relu'))  # First hidden layer (64 units, ReLU)
    model.add(Dense(64, activation='softmax'))  # Second hidden layer (64 units, ReLU)
    model.add(Dense(n_outputs, activation='sigmoid'))  # Output layer with sigmoid activation for multilabel classification
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

# Evaluate the model using repeated k-fold cross-validation
def evaluate_model(X, y):
    results = {
        'accuracy': [],
        'hamming_loss': [],
        'f1_score': []
    }
    n_inputs, n_outputs = X.shape[1], y.shape[1]
    
    # Define evaluation procedure
    cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
    
    # Enumerate folds
    for train_ix, test_ix in cv.split(X):
        # Prepare data
        X_train, X_test = X[train_ix], X[test_ix]
        y_train, y_test = y[train_ix], y[test_ix]
        
        # Define the model
        model = get_model(n_inputs, n_outputs)
        
        # Fit the model
        model.fit(X_train, y_train, verbose=0, epochs=100)
        
        # Make predictions
        yhat = model.predict(X_test)
        
        # Round probabilities to class labels
        yhat = yhat.round()
        
        # Calculate accuracy
        acc = accuracy_score(y_test, yhat)
        results['accuracy'].append(acc)
        
        # Calculate Hamming loss
        hamming = hamming_loss(y_test, yhat)
        results['hamming_loss'].append(hamming)
        
        # Calculate F1-Score (Macro average)
        f1 = f1_score(y_test, yhat, average='macro')
        results['f1_score'].append(f1)
        
        # Print individual metrics for the fold
        print(f'> Accuracy: {acc:.3f}, Hamming Loss: {hamming:.3f}, F1-Score: {f1:.3f}')
    
    return results

# Ensure X and y are numpy arrays
if not isinstance(X, np.ndarray):
    X = X.to_numpy()

if not isinstance(y, np.ndarray):
    y = y.to_numpy()

# Evaluate the model
results = evaluate_model(X, y)

# Summarize performance
print(f'Average Accuracy: {np.mean(results["accuracy"]):.3f}')
print(f'Average Hamming Loss: {np.mean(results["hamming_loss"]):.3f}')
print(f'Average F1-Score: {np.mean(results["f1_score"]):.3f}')

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
> Accuracy: 0.592, Hamming Loss: 0.076, F1-Score: 0.707


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 37ms/step
> Accuracy: 0.487, Hamming Loss: 0.096, F1-Score: 0.656


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step
> Accuracy: 0.447, Hamming Loss: 0.074, F1-Score: 0.664


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step
> Accuracy: 0.513, Hamming Loss: 0.083, F1-Score: 0.721


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step
> Accuracy: 0.533, Hamming Loss: 0.079, F1-Score: 0.737


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 137ms/step
> Accuracy: 0.467, Hamming Loss: 0.079, F1-Score: 0.699


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
> Accuracy: 0.480, Hamming Loss: 0.113, F1-Score: 0.584


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
> Accuracy: 0.387, Hamming Loss: 0.123, F1-Score: 0.578


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 40ms/step
> Accuracy: 0.533, Hamming Loss: 0.073, F1-Score: 0.767


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step
> Accuracy: 0.613, Hamming Loss: 0.069, F1-Score: 0.793


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
> Accuracy: 0.513, Hamming Loss: 0.095, F1-Score: 0.744


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step
> Accuracy: 0.434, Hamming Loss: 0.107, F1-Score: 0.557


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step
> Accuracy: 0.474, Hamming Loss: 0.084, F1-Score: 0.616


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
> Accuracy: 0.526, Hamming Loss: 0.068, F1-Score: 0.719


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 85ms/step
> Accuracy: 0.520, Hamming Loss: 0.081, F1-Score: 0.718


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step
> Accuracy: 0.520, Hamming Loss: 0.081, F1-Score: 0.736


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step
> Accuracy: 0.533, Hamming Loss: 0.080, F1-Score: 0.675


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
> Accuracy: 0.547, Hamming Loss: 0.077, F1-Score: 0.714


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
> Accuracy: 0.480, Hamming Loss: 0.087, F1-Score: 0.711


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step
> Accuracy: 0.587, Hamming Loss: 0.085, F1-Score: 0.682


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
> Accuracy: 0.382, Hamming Loss: 0.137, F1-Score: 0.604


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
> Accuracy: 0.421, Hamming Loss: 0.105, F1-Score: 0.533


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
> Accuracy: 0.461, Hamming Loss: 0.093, F1-Score: 0.712


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
> Accuracy: 0.487, Hamming Loss: 0.074, F1-Score: 0.695


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 24ms/step
> Accuracy: 0.453, Hamming Loss: 0.093, F1-Score: 0.676


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
> Accuracy: 0.587, Hamming Loss: 0.063, F1-Score: 0.734


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 39ms/step
> Accuracy: 0.467, Hamming Loss: 0.093, F1-Score: 0.646


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
> Accuracy: 0.627, Hamming Loss: 0.060, F1-Score: 0.795


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
> Accuracy: 0.613, Hamming Loss: 0.059, F1-Score: 0.756


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step
> Accuracy: 0.560, Hamming Loss: 0.087, F1-Score: 0.690
Average Accuracy: 0.508
Average Hamming Loss: 0.086
Average F1-Score: 0.687


Comparing the two models directly is not entirely fair due to differences in evaluation methods. The first model uses a single train-test split, which may lead to unreliable results depending on the split, while the second model employs cross-validation (Repeated K-Fold), offering a better performance estimate. 

The second model averages metrics across multiple splits, making its evaluation more stable, while the first model's results can vary significantly. Therefore, the second model provides a more accurate assessment of performance and generalization, making direct comparison misleading.

## References:

- [Multi-Label Classification in Python](http://scikit.ml/index.html)
- [Multi-label deep learning with scikit-multilearn](http://scikit.ml/multilabeldnn.html#Multi-class-Keras-classifier)
- [Multi-Class Classification Tutorial with the Keras Deep Learning Library](https://machinelearningmastery.com/multi-class-classification-tutorial-keras-deep-learning-library/)
- [Hamming Loss](https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.hamming_loss.html)
- [Multi-Label Classification with Deep Learning](https://machinelearningmastery.com/multi-label-classification-with-deep-learning/)
