In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from skmultilearn.problem_transform import BinaryRelevance
from skmultilearn.problem_transform import ClassifierChain
from sklearn.metrics import accuracy_score, hamming_loss, jaccard_score, f1_score
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform

# Problem transformation

## Difference Between Binary Relevance Approach and Classifier Chains Approach

In the binary relevance approach, a multi-classification problem is broken down into multiple binary classification sup-problems by creating a classifier for each label. On the other hand, in the classifier chains approach, a sequence of binary classifiers are created but, unlike the binary relevance approach, the classifiers are not independent as predictions from the previous classifiers are used as inputs for the next classifier. This means that the binary relevance approach is simpler and less computationally expensive than the classifier chains approach. However, the classifier chains approach can capture label correlations better than the binary relevance approach.

## Obtaining a Multi-Label Classifier Using the Binary Relevance Approach

For this assignment, the ratio chosen for splitting the data into training and testing sets is 0.8. This is because it ensures there is enough data for the model to train on while also ensuring that there is enough data for the model to test on and verify its performance.

In [2]:
# Read the data
data = pd.read_csv('yeast.csv')

data.drop(0)
X = data.iloc[:, :103].values 
y = data.iloc[:, 103:].values

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,
    random_state=42,
    shuffle=True
)

# Create a base classifier
base_classifier = MLPClassifier(
    hidden_layer_sizes=(100, 50),  # Two hidden layers
    activation='relu',             # ReLU activation function
    solver='adam',                 # Adam optimizer
    max_iter=300,                  # Maximum iterations
    random_state=42,               # For reproducibility
    early_stopping=True,           # Enable early stopping
    validation_fraction=0.1        # Use 10% of training data for validation
)

# Create the binary relevance classifier
binary_relevance = BinaryRelevance(
    classifier=base_classifier,
    require_dense=[True, True]     
)

# Train the model
binary_relevance.fit(X_train, y_train)

# Make predictions
y_pred = binary_relevance.predict(X_test)

# Convert sparse matrix predictions to dense array for evaluation
y_pred_dense = y_pred.toarray()
y_test_dense = y_test

# Calculate performance metrics
accuracy_binary = accuracy_score(y_test_dense, y_pred_dense)
hamming_binary = hamming_loss(y_test_dense, y_pred_dense)
jaccard_score_binary = jaccard_score(y_test_dense, y_pred_dense, average='samples')
f1_score_binary = f1_score(y_test_dense, y_pred_dense, average='samples')

print(f"Subset Accuracy: {accuracy_binary:.4f}")
print(f"Hamming Loss: {hamming_binary:.4f}")
print(f"Jaccard Score: {jaccard_score_binary:.4f}")
print(f"F1 Score: {f1_score_binary:.4f}")

Subset Accuracy: 0.1405
Hamming Loss: 0.1926
Jaccard Score: 0.5087
F1 Score: 0.6210


In [3]:
# Create a base classifier
base_classifier = MLPClassifier(
    hidden_layer_sizes=(100, 50),  # Two hidden layers
    activation='relu',             # ReLU activation function
    solver='adam',                 # Adam optimizer
    max_iter=300,                  # Maximum iterations
    random_state=42,              # For reproducibility
    early_stopping=True,          # Enable early stopping
    validation_fraction=0.1       # Use 10% of training data for validation
)

# Create the classifier chain
classifier_chain = ClassifierChain(
    classifier=base_classifier,
    require_dense=[True, True],    
)

# Train the model
classifier_chain.fit(X_train, y_train)

# Make predictions
y_pred = classifier_chain.predict(X_test)

# Convert sparse matrix predictions to dense array for evaluation
y_pred_dense = y_pred.toarray()
y_test_dense = y_test

# Calculate performance metrics
accuracy_chains = accuracy_score(y_test_dense, y_pred_dense)
hamming_chains = hamming_loss(y_test_dense, y_pred_dense)
jaccard_score_chains = jaccard_score(y_test_dense, y_pred_dense, average='samples')
f1_score_chains = f1_score(y_test_dense, y_pred_dense, average='samples')

print(f"Subset Accuracy: {accuracy_chains:.4f}")
print(f"Hamming Loss: {hamming_chains:.4f}")
print(f"Jaccard Score: {jaccard_score_chains:.4f}")
print(f"F1 Score: {f1_score_chains:.4f}")



Subset Accuracy: 0.2314
Hamming Loss: 0.2050
Jaccard Score: 0.5205
F1 Score: 0.6136


# Adapted algorithm

## Determining Hyperparameters to Optimize

The first hyperparameters chosen to optimize are the number of hidden layers and the number of neurons in each layer. The hidden layer size can impact the model's ability to model the data, which is especially in a multi-label dataset as the model needs to be able to learn the relationships between the inoput features and the multiple labels. Finding the optimal number of hidden layers and neurons is important as it ensures that the model as too few neurons can lead to underfitting, while too many neurons can lead to overfitting.

Another hyperparameter that is selected for optimization is the learning rate. The learning rate is important as it determines how the model learns its weights. If the learning rate is too high it may overshoot the optimal values and fail to converge. On the other hand, if the learning rate is too low, the model may take a long time to converge or may get stuck in a local minimum. Therefore, performing hyperparameter optimization on the learning rate can help to find the optimal learning rate for the model such that learning is balanced between all the labels.

Alpha is another hyperparameter that is selected for optimization. Alpha is the L2 regularization parameter that is used to prevent overfitting. This is especially important for datasets with a large number of features as they are more prone to overfitting and in datasets with multiple labels as to prevent the model from overfitting to one label. 

Lastly, the batch size is selected for optimization. Batch size is the number of samples that are used to update the model's weights. Small batch sizes can lead to better generalization and but can be computationally expensive. On the other hand, large batch sizes can lead to faster training times but can lead to poor generalization. Finding the optimal batch size is important in a multi-label dataset as the batch size needs to be large enough to contain enough label combinations while also being small enough to prevent overfitting.



## Selecting an HPO Technique Supported By the `scikit-learn` Toolkit

The chosen HPO technique is random search. Random search is chosen as it is more computational efficient, especially in the case for large datsets since it randomly samples the search space instead of exploring every possible parameter combination. Additionally, randoms search allows for continuous hyperparameters to be optimized. This is useful when optimizing hyperparameters such as learning rates and alpha as it allows us to find more accurate values for these hyperparameters. Lastly, random search is better able to explore the search space and find important hyperparameters when compared to grid search.

## Choosing a Suitable Value of 𝐾 for Cross-Validation

The value of K chosen for 𝐾-foldcross-validation is 5. 5 was chosen as it ensures we have enough data for training and validation. While choosing a larger value of K could be more benifificial since each fold will train on more data it would also be more computationally expensive. Additonally, the dataset is large and therefore each fold will have enough training samples and will contain suffienct examples for all of the 14 labels despite the smaller value of K.

## Obtaining a Neural Network Multi-Label Classifier Using HP

In [4]:
# Define parameter distributions 
param_distributions = {
    'hidden_layer_sizes': [(50,), (100,), (50,25), (100,50), (100,50,25)],
    'learning_rate_init': loguniform(1e-4, 1e-1),
    'alpha': loguniform(1e-4, 1e-2),
    'batch_size': [32, 64, 128, 256, 'auto']
}

# Create base neural network 
base_nn = MLPClassifier(
    activation='relu',           # ReLU for hidden layers
    solver='adam',              
    max_iter=300,
    early_stopping=True,
    validation_fraction=0.1,
    random_state=42
)

# Configure RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=base_nn,
    param_distributions=param_distributions,
    n_iter=50,
    cv=5,                      # 5-fold cross-validation
    scoring='f1_samples',      # F1 score for multi-label classification
    n_jobs=-1,
    random_state=42,
    verbose=0
)

# Perform hyperparameter optimization
random_search.fit(X_train, y_train)

# Print the best parameters and score
print("\nBest parameters found:")
print(random_search.best_params_)
print(f"\nBest cross-validation score: {random_search.best_score_:.4f}")

# Get the best model
best_model = random_search.best_estimator_

# Make predictions
y_pred = best_model.predict(X_test)

# Calculate performance metrics
accuracy_hpo = accuracy_score(y_test, y_pred)
hamming_hpo = hamming_loss(y_test, y_pred)
jaccard_score_hpo = jaccard_score(y_test, y_pred, average='samples')
f1_score_hpo = f1_score(y_test, y_pred, average='samples')

print(f"\nTest Set Performance:")
print(f"Subset Accuracy: {accuracy_hpo:.4f}")
print(f"Hamming Loss: {hamming_hpo:.4f}")
print(f"Jaccard Score: {jaccard_score_hpo:.4f}")
print(f"F1 Score: {f1_score_hpo:.4f}")

# Print information about the neural network's last layer
print("\nOutput layer activation function:")
print(best_model.out_activation_)




Best parameters found:
{'alpha': np.float64(0.006187670675880952), 'batch_size': 32, 'hidden_layer_sizes': (100, 50), 'learning_rate_init': np.float64(0.004895834359555106)}

Best cross-validation score: 0.6233

Test Set Performance:
Subset Accuracy: 0.1839
Hamming Loss: 0.1942
Jaccard Score: 0.5323
F1 Score: 0.6367

Output layer activation function:
logistic


The network is adapted so that its outputs are suitable for multi-label classification since `MLPClassifier` automatically adapts for multi-label classification by using a sigmoid function at the output layer. This can be seen in the code above when printinting the output layer activation function, which is a logistic (sigmoid) function. The sigmoid function is important as it allows each neuron to make independent predictions between 0 and 1, which is important for multi-label classification as each label can be predicted independently of the others.

# Performance Evaluation

## Choosing Suitable Evaluation Metrics

The performance metrics chosen are accuracy, hamming loss, jaccard score and f1 score. 

Accuracy is chosen as it allows us to see how many predicitions were able to correctly predict all 14 output labels. However, it is important to note that due to this it can yield low scores and hence it is not a suitable evaluation metric on its own. 

The next metric chosen is hamming loss. Hamming loss is chosen as it gives us a better picture of how accurate our model is. It measures the fraction of labels that are incorrectly predicted. This is important as it allows us to see how many labels were predicted incorrectly and how many were predicted correctly. It allows us to see how our model performs on a label by label basis.

The third metric chosen is the jaccard score. The jaccard score is chosen as it measures the similarity between the predicted labels and the true labels. This means that it is able to consider both partial and complete matches between the predicted and the functional classes.

Lastly, the f1 score is chosen as it is the harmonic mean of precision and recall. This helps us to view how many of the predicted labels were correct and how many of the actual labels were predicted and hence it shows how the model is able to balance between the two.

## Comparing Evaluation Metrics

When comparing the accuracy of the three models, the classifier chains model had the best accuracy at 0.2314, followed the the adapted algorithm model at 0.1839 and lastly the binary relevance model 0.1405. This is expected as the classifier chains model is able to capture the relationships between the labels better since each classifier learn from the predictions of previous classifiers. On the other hand, the binary relevance models low score is also expected since it treats labels as being independent of each other. It is also expected that the accuracy scores are low as the accuracy requires all labels to be predicted correctly for a prediction which can be difficult in a multi-classification problem.

Looking at the hamming loss, we can observe that the binary relevance model had the best hamming loss at 0.1926, followed closedly by the adapted algorithm at 0.1942 and lastly the classifier chains model at 0.2050. All three aproaches had similar scores, indicating that they performed similarly in predicting the individual labels. These scores also highlight the importance of using multiple evaluation metrics as despite the low accuracy scores, they all were able to predict around 80% of the inidividual labels correctly. It can also be observed that while the classifier chains achieved the best accuracy score, it had the worst hamming loss score. Conversely, the binary relevance model had the best hamming loss score but the worst accuracy score. This indicates a possible trade-off between the two metrics.

The adapted algoirthm model had the best jaccard score of 0.5323, followed by the classifier chains at 0.5205 and lastly the binary relevance model with a score of 0.5087. This suggests that all three being able to achieve a reasonable overlap between the predicted and true labels. The higher score of the adapted algorithm model indicates that hyperparameter optimization allowed the model to perform better in predicting the labels.

Finally, considering the f1 score, the adapted algorithm model had the best score at 0.6367, followed by the binary relevance model at 0.6210 and lastly the classifier chains model at 0.6136. The adapted algorithm model having the best score is expected as it used the f1 score during hyperparameter optimization to find the optimal hyperparameters for the model and hence was able to mantain the best balance between precision and recall.

Overall, the adapted algorithm performed the best as it was able to have a good balance of scores and mantain consistency across all the evaluation metrics. This is expected as the model was optimized using hyperparameter optimization and cross-validation and hence was able to find the optimal hyperparameters for the model. However, all three models were able to achieve reasonable scores across all the evaluation metrics, indicating that they were able to perform reasonably well.

# Conclusion

One of the main challenges encountered was trying to achieve good scores across all the evaluation metrics. This was likely due to the dataset containing multiple labels. This was particularly challenging during hyperparameter optimization as it was important to choose an evalaution metric that would be able to optimize hyperparameters while not sacrificing the performance of the model in other evaluation metrics. Ultimately, the f1 score was used as it is able to balance between precision and recall. 

Each approach had its own strengths and weakness when it came to predicting the labels. The binary relevance model was able to predict the individual labels the best but failed to capture the relationship between the labels which lead to the model have the lowest accuracy score. The classifier chains model was able to capture the relationships between the labels the best and hence had the best accuracy score but then had the lowest hamming loss score, indicating a trade off between global and local predication accuracy. The adapted algorithm model was able to achieve a good balance between the two which resulted in it having the best jaccard and f1 scores and also had a good balance between all scores. However, hyperparameter optimization can be computationally expensive and time consuming, especially in large datasets.

Based on these results, one potential improved could be to combine the binary relevance and classifier chains approaches. This could be done by using the binary relevance model to predict the individual labels and then using the classifier chains model to predict the relationships between the labels. Another possible improvement could be to use a different HPO technique such as Bayesian optimization. Bayesian optimization is able to model the search space and hence is might be able to find better hyperparameters than random search. 

# References

- https://scikit-learn.org/stable/

- http://scikit.ml

- J.D. Kelleher, B. Mac Namee and A. D’Arcy, “Fundamentals of Machine Learning for Predictive Data Analytics”, MIT Press.

- https://en.wikipedia.org/wiki/Multi-label_classification

- https://www.geeksforgeeks.org/comparing-randomized-search-and-grid-search-for-hyperparameter-estimation-in-scikit-learn/

- https://medium.com/biased-algorithms/evaluation-metrics-for-classification-models-b995f9980716

- https://www.kdnuggets.com/hyperparameter-tuning-gridsearchcv-and-randomizedsearchcv-explained   

- AI tools were used to aid in clarification of concepts, help with debugging (e.g. aiding in understanding what was causing an error) and further explain how the scikit-learn methods used for this assignment work (e.g. what parameters they take and what they do).