In [52]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from skmultilearn.problem_transform import BinaryRelevance
from skmultilearn.problem_transform import ClassifierChain
from sklearn.metrics import accuracy_score, hamming_loss, jaccard_score, f1_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from scipy.stats import uniform, loguniform
from scipy.stats import randint
import pandas as pd
import scipy as sp
import seaborn as sns

# Problem transformation

## Difference Between Binary Relevance Approach and Classifier Chains Approach

In the binary relevance approach, a multi-classification problem is broken down into multiple binary classification sup-problems by creating a classifier for each label. On the other hand, in the classifier chains approach, a sequence of binary classifiers are created but, unlike the binary relevance approach, the classifiers are not independent as predictions from the previous classifiers are used as inputs for the next classifier. This means that the binary relevance approach is simpler and less computationally expensive than the classifier chains approach. However, the classifier chains approach can capture label correlations better than the binary relevance approach.

## Obtaining a Multi-Label Classifier Using the Binary Relevance Approach

In [53]:
# Read the data
data = pd.read_csv('yeast.csv')

data.drop(0)
X = data.iloc[:, :103].values 
y = data.iloc[:, 103:].values

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,
    random_state=42,
    shuffle=True
)

# Create a base classifier
base_classifier = MLPClassifier(
    hidden_layer_sizes=(100, 50),  # Two hidden layers
    activation='relu',             # ReLU activation function
    solver='adam',                 # Adam optimizer
    max_iter=300,                  # Maximum iterations
    random_state=42,               # For reproducibility
    early_stopping=True,           # Enable early stopping
    validation_fraction=0.1        # Use 10% of training data for validation
)

# Create and train the binary relevance classifier
binary_relevance = BinaryRelevance(
    classifier=base_classifier,
    require_dense=[True, True]     # Both X and y should be dense matrices
)

# Train the model
binary_relevance.fit(X_train, y_train)

# Make predictions
y_pred = binary_relevance.predict(X_test)

# Convert sparse matrix predictions to dense array for evaluation
y_pred_dense = y_pred.toarray()
y_test_dense = y_test

# Calculate performance metrics
accuracy = accuracy_score(y_test_dense, y_pred_dense)
hamming = hamming_loss(y_test_dense, y_pred_dense)
jaccard_score = jaccard_score(y_test_dense, y_pred_dense, average='samples')

print(f"Subset Accuracy: {accuracy:.4f}")
print(f"Hamming Loss: {hamming:.4f}")
print(f"Jaccard Score: {jaccard_score:.4f}")

Subset Accuracy: 0.1405
Hamming Loss: 0.1926
Jaccard Score: 0.5087


In [54]:
# Create a base classifier
base_classifier = MLPClassifier(
    hidden_layer_sizes=(100, 50),  # Two hidden layers
    activation='relu',             # ReLU activation function
    solver='adam',                 # Adam optimizer
    max_iter=300,                  # Maximum iterations
    random_state=42,              # For reproducibility
    early_stopping=True,          # Enable early stopping
    validation_fraction=0.1       # Use 10% of training data for validation
)

# Create and train the classifier chain
classifier_chain = ClassifierChain(
    classifier=base_classifier,
    require_dense=[True, True],    # Both X and y should be dense matrices
)

# Train the model
classifier_chain.fit(X_train, y_train)

# Make predictions
y_pred = classifier_chain.predict(X_test)

# Convert sparse matrix predictions to dense array for evaluation
y_pred_dense = y_pred.toarray()
y_test_dense = y_test

from sklearn.metrics import jaccard_score as js_metric

# Calculate performance metrics
accuracy = accuracy_score(y_test_dense, y_pred_dense)
hamming = hamming_loss(y_test_dense, y_pred_dense)
js = js_metric(y_test_dense, y_pred_dense, average='samples')

print(f"Subset Accuracy: {accuracy:.4f}")
print(f"Hamming Loss: {hamming:.4f}")
print(f"Jaccard Score: {js:.4f}")

Subset Accuracy: 0.2314
Hamming Loss: 0.2050
Jaccard Score: 0.5205


# Adapted algorithm

## Determining Hyperparameters to Optimize

The first hyperparameters chosen to optimize are the number of hidden layers and the number of neurons in each layer. The hidden layer size can impact the model's ability to model the data, which is especially in a multi-label dataset as the model needs to be able to learn the relationships between the inoput features and the multiple labels. Finding the optimal number of hidden layers and neurons is important as it ensures that the model as too few neurons can lead to underfitting, while too many neurons can lead to overfitting.

Another hyperparameter that is selected for optimization is the learning rate. The learning rate is important as it determines how the model learns its weights. If the learning rate is too high it may overshoot the optimal values and fail to converge. On the other hand, if the learning rate is too low, the model may take a long time to converge or may get stuck in a local minimum. Therefore, performing hyperparameter optimization on the learning rate can help to find the optimal learning rate for the model such that learning is balanced between all the labels.

Alpha is another hyperparameter that is selected for optimization. Alpha is the L2 regularization parameter that is used to prevent overfitting. This is especially important for datasets with a large number of features as they are more prone to overfitting and in datasets with multiple labels as to prevent the model from overfitting to one label. 

Lastly, the batch size is selected for optimization. Batch size is the number of samples that are used to update the model's weights. Small batch sizes can lead to better generalization and but can be computationally expensive. On the other hand, large batch sizes can lead to faster training times but can lead to poor generalization. Finding the optimal batch size is important in a multi-label dataset as the batch size needs to be large enough to contain enough label combinations while also being small enough to prevent overfitting.



## Selecting an HPO Technique Supported By the `scikit-learn` Toolkit

The chosen HPO technique is random search. Random search is chosen as it is more computational efficient, especially in the case for large datsets since it randomly samples the search space instead of exploring every possible parameter combination. Additionally, randoms search allows for continuous hyperparameters to be optimized. This is useful when optimizing hyperparameters such as learning rates and alpha as it allows us to find more accurate values for these hyperparameters. Lastly, random search is better able to explore the search space and find important hyperparameters when compared to grid search.

## Choosing a Suitable Value of 𝐾 for Cross-Validation

The value of K chosen for 𝐾-foldcross-validation is 5. 5 was chosen as it ensures we have enough data for training and validation. While choosing a larger value of K could be more benifificial since each fold will train on more data it would also be more computationally expensive. Additonally, the dataset is large and therefore each fold will have enough training samples and will contain suffienct examples for all of the 14 labels despite the smaller value of K.