# Modeling Notebook

Welcome to the Modeling Notebook. This notebook focuses on building and evaluating machine learning models based on the insights gained from EDA.

## Objectives:
- **Model Selection**: Trying out different models to see which one performs the best.
- **Model Evaluation**: Assessing the performance of the model using appropriate metrics.
- **Model Tuning**: Optimizing the performance of the selected model.

## Dataset:
In this notebook, we will be working with the cleaned dataset located at `./data/features/movies_dataset.parquet`, which is the result of feature engineering process in the preceding EDA Notebook.

I'll begin by setting the stage with a knowledge section to explain my approach and choices.

**Knowledge Section for Readers:**

I'm exploring various models, each with unique strengths in multilabel classification:

1. **RidgeClassifier**: Ideal for high-dimensional datasets, supports multilabel classification out of the box.
2. **RandomForestClassifier**: Handles complex datasets well and provides feature importance insights, supports multilabel classification out of the box..
3. **CatBoostClassifier**: Specializes in categorical data and complex data scenarios, supports multilabel classification out of the box..

1. **OneVsRestClassifier**: Trains a separate model for each label. In this Notebook applied with LogisticRegression, RidgeClassifier, RandomForestClassifier, and LinearSVC.
2. **LabelPowerset**: Considers each label combination as a unique class. In this Notebook used with LogisticRegression, RidgeClassifier, RandomForestClassifier, and LinearSVC.
3. **BinaryRelevance**: Treats each label as a separate binary problem. In this Notebook used with LogisticRegression, RidgeClassifier, RandomForestClassifier, and LinearSVC.

In the section on model selection, it's worth mentioning that LinearSVC is included specifically because it uses Hamming Loss (read below about this metric), making it a suitable choice for our analysis.

Additionally, multilabel support is added to any classifier with:

1. **MultiOutputClassifier**: Treats each label as an independent binary classification. In this Notebook utilized with LogisticRegression, RidgeClassifier, RandomForestClassifier, XGBClassifier, and LinearSVC.
2. **ClassifierChain**: Combines binary classifiers into a single multi-label model, leveraging target correlations. In this Notebook used with LogisticRegression, RidgeClassifier, RandomForestClassifier, XGBClassifier, and LinearSVC.

You can read about multiclass and multioutput algorithms [here](https://scikit-learn.org/stable/modules/multiclass.html)

In addition to the model selection, it's important to understand that evaluation metrics for multilabel classification, especially with an imbalanced dataset, differ from standard metrics. I'll be focusing on Hamming Loss, F1 Score (Micro), and F1 Score (Macro) for evaluation. Here's what each of them means and why they are important:

1. **Hamming Loss**: This measures the fraction of incorrect labels to the total number of labels. It's especially useful in multilabel classification as it considers the prediction error across all labels. Lower the value better the model.

2. **F1 Score (Micro)**: This calculates the F1 Score globally by counting the total true positives, false negatives, and false positives. It's beneficial when dealing with imbalanced datasets, as it aggregates the contributions of all classes to compute the average.

3. **F1 Score (Macro)**: This computes the F1 Score for each label and finds their unweighted mean. This doesn't take label imbalance into account, which makes it useful for understanding performance on each label.

It's important to note why the accuracy score is not the best choice for multilabel classification and imbalanced datasets. Accuracy can be misleading in these contexts, as it might show high values even when the model is performing poorly on minority classes. In imbalanced datasets, a model could skew towards the majority class, leading to high accuracy but poor model performance in terms of real-world applicability. Therefore, relying on Hamming Loss and F1 Scores provides a more realistic evaluation of the model's effectiveness in handling multilabel, imbalanced data.

You can read about metrics for multilabel classification [here](https://mmuratarat.github.io/2020-01-25/multilabel_classification_metrics)

Since my dataset isn't evenly distributed across different labels, I'll use [iterative_train_test_split](http://scikit.ml/api/skmultilearn.model_selection.iterative_stratification.html). This method will help keep a balanced mix of labels in both training and test sets, ensuring my model's results are more accurate and fair. The idea behind this stratification method is to assign label combinations to folds based on how much a given combination is desired by a given fold, as more and more assignments are made, some folds are filled and positive evidence is directed into other folds, in the end negative evidence is distributed based on a folds desirability of size.

In [121]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score, hamming_loss
from skmultilearn.model_selection import iterative_train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from catboost import CatBoostClassifier, Pool, cv
from sklearn.linear_model import RidgeClassifier, LogisticRegression
from sklearn.model_selection import GridSearchCV
from skmultilearn.model_selection import IterativeStratification
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from skmultilearn.problem_transform import LabelPowerset
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.multioutput import MultiOutputClassifier, ClassifierChain
from xgboost import XGBClassifier

#### Prepare training, validation, test sets

I'll split my data into training, validation, and test sets. I'll use [DictVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html) to turn text data into numbers and [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to make sure all features are on a similar scale, which helps the models work better.


In [173]:
dataset_df = pd.read_parquet("../data/cleaned/selected_features.parquet")
labels = dataset_df[["numerical_ROI_category", 'numerical_rating_category', 'numerical_award_category']]
dataset_df.drop(["numerical_ROI_category", 'numerical_rating_category', 'numerical_award_category'], axis=1, inplace=True)

X_np = dataset_df.to_numpy()
y_np = labels.to_numpy()

X_full_train, y_full_train, X_test, y_test = iterative_train_test_split(X_np, y_np, test_size = 0.2)
X_train, y_train, X_val, y_val = iterative_train_test_split(X_full_train, y_full_train, test_size = 0.25)

X_train_df = pd.DataFrame(X_train, columns=dataset_df.columns)  
X_val_df = pd.DataFrame(X_val, columns=dataset_df.columns) 
X_test_df = pd.DataFrame(X_test, columns=dataset_df.columns) 

y_train_df = pd.DataFrame(y_train, columns=labels.columns)  
y_val_df = pd.DataFrame(y_val, columns=labels.columns) 
y_test_df = pd.DataFrame(y_test, columns=labels.columns) 

dv = DictVectorizer(sparse=False)
X_train_df_t = dv.fit_transform(X_train_df.to_dict(orient='records'))
X_val_df_t = dv.transform(X_val_df.to_dict(orient='records'))
X_test_df_t = dv.transform(X_test_df.to_dict(orient='records'))

scaler = StandardScaler()
X_train_df_t = scaler.fit_transform(X_train_df_t)
X_val_df_t = scaler.transform(X_val_df_t)
X_test_df_t = scaler.transform(X_test_df_t)


In [174]:
print(X_train_df.shape)
print(y_train.shape)
print(X_val_df.shape)
print(y_val.shape)
print(X_test_df.shape)
print(y_test.shape)

(4491, 20)
(4491, 3)
(1497, 20)
(1497, 3)
(1497, 20)
(1497, 3)


#### Models

#### RidgeClassifier

In [177]:
ridge_classifier = RidgeClassifier(random_state=42)
ridge_classifier.fit(X_train_df_t, y_train)

y_pred = ridge_classifier.predict(X_val_df_t)
print('Ridge Classifier Metrics:')
print(f'F1 Score (Micro): {f1_score(y_val, y_pred, average="micro"):.2f}')
print(f'F1 Score (Macro): {f1_score(y_val, y_pred, average="macro"):.2f}')
print(f'Hamming Loss: {hamming_loss(y_val, y_pred):.2f}')
print(f'accuracy_score: {accuracy_score(y_val, rf_y_pred):.2f}')
print('-' * 40)

Ridge Classifier Metrics:
F1 Score (Micro): 0.34
F1 Score (Macro): 0.34
Hamming Loss: 0.23
accuracy_score: 0.56
----------------------------------------


#### RandomForestClassifier

In [176]:
random_forest_classifier = RandomForestClassifier(random_state=42)
random_forest_classifier.fit(X_train_df_t, y_train)
rf_y_pred = random_forest_classifier.predict(X_val_df_t)

print('Random Forest Classifier Metrics:')
print(f'F1 Score (Micro): {f1_score(y_val, rf_y_pred, average="micro"):.2f}')
print(f'F1 Score (Macro): {f1_score(y_val, rf_y_pred, average="macro"):.2f}')
print(f'Hamming Loss: {hamming_loss(y_val, rf_y_pred):.2f}')
print(f'accuracy_score: {accuracy_score(y_val, rf_y_pred):.2f}')

Random Forest Classifier Metrics:
F1 Score (Micro): 0.42
F1 Score (Macro): 0.41
Hamming Loss: 0.21
accuracy_score: 0.56


#### CatBoostClassifier

For the CatBoostClassifier, I'm using loss_function='MultiLogloss' and eval_metric='HammingLoss' due to the multilabel nature of my outputs.

MultiLogloss: This is a loss function suitable for multilabel classification tasks. It calculates the logarithmic loss for each label and then averages these values. It's effective in situations where each instance can belong to multiple labels simultaneously, as it penalizes wrong predictions more severely. 

You can read about multilabel classification objectives and metrics [here](https://catboost.ai/en/docs/concepts/loss-functions-multilabel-classification)

In [175]:
train_pool = Pool(X_train_df_t, y_train)
val_pool = Pool(X_val_df_t, y_val)

catboost_classifier = CatBoostClassifier(loss_function='MultiLogloss',
    eval_metric='HammingLoss',
    iterations=500, random_state=42)
catboost_classifier.fit(train_pool, eval_set=val_pool, metric_period=10, plot=True, verbose=50)

val_predict = catboost_classifier.predict(X_val_df_t)
from catboost.utils import eval_metric
accuracy = eval_metric(y_val, val_predict, 'Accuracy')[0]
print(f'Accuracy: {accuracy}')

accuracy_per_class = eval_metric(y_val, val_predict, 'Accuracy:type=PerClass')
for cls, value in zip(catboost_classifier.classes_, accuracy_per_class):
    print(f'Accuracy for class {cls}: {value}')

hamming = eval_metric(y_val, val_predict, 'HammingLoss')[0]
print(f'HammingLoss: {hamming:.4f}')
mean_accuracy_per_class = sum(accuracy_per_class) / len(accuracy_per_class)
print(f'MeanAccuracyPerClass: {mean_accuracy_per_class:.4f}')
print(f'HammingLoss + MeanAccuracyPerClass = {hamming + mean_accuracy_per_class}')

for metric in ('Precision', 'Recall', 'F1'):
    print(metric)
    values = eval_metric(y_val, val_predict, metric)
    for cls, value in zip(catboost_classifier.classes_, values):
        print(f'class={cls}: {value:.4f}')
    print()



MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Learning rate set to 0.062085
0:	learn: 0.2303867	test: 0.2382543	best: 0.2382543 (0)	total: 52.2ms	remaining: 26s
50:	learn: 0.1879314	test: 0.2126475	best: 0.2126475 (50)	total: 449ms	remaining: 3.96s
100:	learn: 0.1649224	test: 0.2086395	best: 0.2084168 (90)	total: 789ms	remaining: 3.12s
150:	learn: 0.1482966	test: 0.2066355	best: 0.2052995 (140)	total: 1.11s	remaining: 2.58s
200:	learn: 0.1319676	test: 0.2081942	best: 0.2046315 (160)	total: 1.48s	remaining: 2.2s
250:	learn: 0.1158614	test: 0.2064128	best: 0.2046315 (160)	total: 1.79s	remaining: 1.78s
300:	learn: 0.1031693	test: 0.2048542	best: 0.2046315 (160)	total: 2.11s	remaining: 1.4s
350:	learn: 0.0923328	test: 0.2024048	best: 0.2024048 (350)	total: 2.44s	remaining: 1.03s
400:	learn: 0.0820159	test: 0.2030728	best: 0.2019595 (370)	total: 2.78s	remaining: 687ms
450:	learn: 0.0731834	test: 0.2019595	best: 0.2008461 (430)	total: 3.12s	remaining: 339ms
499:	learn: 0.0652416	test: 0.2039635	best: 0.2008461 (430)	total: 3.45s	remaini

Class 0 (ROI):  This label, representing ROI, is the most complex to predict, resulting in a lower score compared to others.
Class 1 and 2 (Rating and Awards): These labels, related to ratings and awards, are somewhat easier to predict than ROI, hence their higher scores.

#### OneVsRestClassifier, LabelPowerset, BinaryRelevance

All those classificators transforms a multi-label classification problem to a multi-class problem

[One-vs-Rest Strategy](https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html#:~:text=One%2Dvs%2Dthe%2Drest,against%20all%20the%20other%20classes.)

This involves training a separate classifier for each class, using data from that class as positive samples and all other data as negatives. It's similar to Binary Relevance but works with mutually exclusive labels.

[Label Powerset](http://scikit.ml/api/skmultilearn.problem_transform.lp.html)

This method considers correlations between class labels. It converts each combination of labels into a single label and then trains a single-label classifier. With more classes, the unique label combinations increase exponentially.

[Binary Relevance](http://scikit.ml/api/skmultilearn.problem_transform.br.html)

This approach trains an ensemble of single-label binary classifiers independently on the original dataset for each class. For example, with q labels, it creates q new datasets, each focusing on one label, and trains a single-label classifier on each.

In [178]:
models = {
    "OneVsRestClassifier-LogisticRegression": OneVsRestClassifier(LogisticRegression(random_state=42)),
    "OneVsRestClassifier-RidgeClassifier": OneVsRestClassifier(RidgeClassifier(random_state=42)),
    "OneVsRestClassifier-LinearSVC": OneVsRestClassifier(LinearSVC(random_state=42, max_iter=30000)),
    "OneVsRestClassifier-RandomForestClassifier": OneVsRestClassifier(RandomForestClassifier(random_state=42)),
    "LabelPowerset-LogisticRegression": LabelPowerset(LogisticRegression(random_state=42)),
    "LabelPowerset-RidgeClassifier": LabelPowerset(RidgeClassifier(random_state=42)),
    "LabelPowerset-LinearSVC": LabelPowerset(LinearSVC(random_state=42, max_iter=30000)),
    "LabelPowerset-RandomForestClassifier": LabelPowerset(RandomForestClassifier(random_state=42)),
    "BinaryRelevance-LogisticRegression": BinaryRelevance(LogisticRegression(random_state=42)),
    "BinaryRelevance-RidgeClassifier": BinaryRelevance(RidgeClassifier(random_state=42)),
    "BinaryRelevance-LinearSVC": BinaryRelevance(LinearSVC(random_state=42, max_iter=30000)),
    "BinaryRelevance-RandomForestClassifier": BinaryRelevance(RandomForestClassifier(random_state=42)),
}

for name, model in models.items():
    model.fit(X_train_df_t, y_train)
    y_pred = model.predict(X_val_df_t)

    f1_micro = f1_score(y_val, y_pred, average='micro')
    f1_macro = f1_score(y_val, y_pred, average='macro')
    hamming = hamming_loss(y_val, y_pred)
    print(f'{name} model metrics')
    print(f'F1 Score (Micro): {f1_micro:.2f}')
    print(f'F1 Score (Macro): {f1_macro:.2f}')
    print(f'Hamming Loss: {hamming:.2f}')
    print(f'accuracy_score: {accuracy_score(y_val, y_pred):.2f}')


OneVsRestClassifier-LogisticRegression model metrics
F1 Score (Micro): 0.38
F1 Score (Macro): 0.38
Hamming Loss: 0.23
accuracy_score: 0.55
OneVsRestClassifier-RidgeClassifier model metrics
F1 Score (Micro): 0.34
F1 Score (Macro): 0.34
Hamming Loss: 0.23
accuracy_score: 0.55




OneVsRestClassifier-LinearSVC model metrics
F1 Score (Micro): 0.36
F1 Score (Macro): 0.35
Hamming Loss: 0.23
accuracy_score: 0.55
OneVsRestClassifier-RandomForestClassifier model metrics
F1 Score (Micro): 0.43
F1 Score (Macro): 0.43
Hamming Loss: 0.21
accuracy_score: 0.56
LabelPowerset-LogisticRegression model metrics
F1 Score (Micro): 0.35
F1 Score (Macro): 0.35
Hamming Loss: 0.23
accuracy_score: 0.56
LabelPowerset-RidgeClassifier model metrics
F1 Score (Micro): 0.26
F1 Score (Macro): 0.26
Hamming Loss: 0.23
accuracy_score: 0.56




LabelPowerset-LinearSVC model metrics
F1 Score (Micro): 0.29
F1 Score (Macro): 0.29
Hamming Loss: 0.23
accuracy_score: 0.56
LabelPowerset-RandomForestClassifier model metrics
F1 Score (Micro): 0.40
F1 Score (Macro): 0.40
Hamming Loss: 0.21
accuracy_score: 0.58
BinaryRelevance-LogisticRegression model metrics
F1 Score (Micro): 0.38
F1 Score (Macro): 0.38
Hamming Loss: 0.23
accuracy_score: 0.55
BinaryRelevance-RidgeClassifier model metrics
F1 Score (Micro): 0.34
F1 Score (Macro): 0.34
Hamming Loss: 0.23
accuracy_score: 0.55




BinaryRelevance-LinearSVC model metrics
F1 Score (Micro): 0.36
F1 Score (Macro): 0.35
Hamming Loss: 0.23
accuracy_score: 0.55
BinaryRelevance-RandomForestClassifier model metrics
F1 Score (Micro): 0.43
F1 Score (Macro): 0.43
Hamming Loss: 0.21
accuracy_score: 0.56


#### MultiOutputClassifier and ClassifierChain

[MultiOutputClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html)

This strategy consists of fitting one classifier per target. This is a simple strategy for extending classifiers that do not natively support multi-target classification.

[ClassifierChain](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.ClassifierChain.html)

A multi-label model that arranges binary classifiers into a chain.

Each model makes a prediction in the order specified by the chain using all of the available features provided to the model plus the predictions of models that are earlier in the chain.

In [179]:
models = {
    "MultiOutputClassifier-LogisticRegression": MultiOutputClassifier(LogisticRegression(random_state=42)),
    "MultiOutputClassifier-RidgeClassifier": MultiOutputClassifier(RidgeClassifier(random_state=42)),
    "MultiOutputClassifier-LinearSVC": MultiOutputClassifier(LinearSVC(random_state=42, max_iter=30000)),
    "MultiOutputClassifier-RandomForestClassifier": MultiOutputClassifier(RandomForestClassifier(random_state=42)),
    "MultiOutputClassifier-XGBClassifier": MultiOutputClassifier(XGBClassifier(eval_metric='logloss', random_state=42)),
    "ClassifierChain-LogisticRegression": ClassifierChain(LogisticRegression(random_state=42)),
    "ClassifierChain-RidgeClassifier": ClassifierChain(RidgeClassifier(random_state=42)),
    "ClassifierChain-LinearSVC": ClassifierChain(LinearSVC(random_state=42, max_iter=30000)),
    "ClassifierChain-RandomForestClassifier": ClassifierChain(RandomForestClassifier(random_state=42)),
    "ClassifierChain-XGBClassifier": ClassifierChain(XGBClassifier(eval_metric='logloss', random_state=42)),
}

for name, model in models.items():
    model.fit(X_train_df_t, y_train)

    y_pred = model.predict(X_val_df_t)

    f1_micro = f1_score(y_val, y_pred, average='micro')
    f1_macro = f1_score(y_val, y_pred, average='macro')
    hamming = hamming_loss(y_val, y_pred)
    print(f'{name} model metrics')
    print(f'F1 Score (Micro): {f1_micro:.2f}')
    print(f'F1 Score (Macro): {f1_macro:.2f}')
    print(f'Hamming Loss: {hamming:.2f}')
    print(f'accuracy_score: {accuracy_score(y_val, y_pred):.2f}')


MultiOutputClassifier-LogisticRegression model metrics
F1 Score (Micro): 0.38
F1 Score (Macro): 0.38
Hamming Loss: 0.23
accuracy_score: 0.55
MultiOutputClassifier-RidgeClassifier model metrics
F1 Score (Micro): 0.34
F1 Score (Macro): 0.34
Hamming Loss: 0.23
accuracy_score: 0.55




MultiOutputClassifier-LinearSVC model metrics
F1 Score (Micro): 0.36
F1 Score (Macro): 0.35
Hamming Loss: 0.23
accuracy_score: 0.55
MultiOutputClassifier-RandomForestClassifier model metrics
F1 Score (Micro): 0.43
F1 Score (Macro): 0.43
Hamming Loss: 0.21
accuracy_score: 0.56
MultiOutputClassifier-XGBClassifier model metrics
F1 Score (Micro): 0.49
F1 Score (Macro): 0.49
Hamming Loss: 0.21
accuracy_score: 0.53
ClassifierChain-LogisticRegression model metrics
F1 Score (Micro): 0.34
F1 Score (Macro): 0.33
Hamming Loss: 0.23
accuracy_score: 0.56
ClassifierChain-RidgeClassifier model metrics
F1 Score (Micro): 0.28
F1 Score (Macro): 0.28
Hamming Loss: 0.23
accuracy_score: 0.56




ClassifierChain-LinearSVC model metrics
F1 Score (Micro): 0.30
F1 Score (Macro): 0.30
Hamming Loss: 0.23
accuracy_score: 0.56
ClassifierChain-RandomForestClassifier model metrics
F1 Score (Micro): 0.40
F1 Score (Macro): 0.40
Hamming Loss: 0.21
accuracy_score: 0.57
ClassifierChain-XGBClassifier model metrics
F1 Score (Micro): 0.49
F1 Score (Macro): 0.49
Hamming Loss: 0.21
accuracy_score: 0.56


"Looking at the metrics, it's evident that the RandomForestClassifier, along with its variants using different multilabel strategies (OneVsRestClassifier, BinaryRelevance, MultiOutputClassifier), consistently performs well, outdoing other models.

MultiOutputClassifier and ClassifierChain with XGBClassifier performs even better than RandomForestClassifier in terms of F-score.

The CatBoostClassifier also shows promising results. So, I'll proceed with cross-validation for these better-performing models to further evaluate their effectiveness."

#### Cross Validation

We can't use Stratified K-Fold Cross-Validation here because is designed for binary classification and multiclass classification where each instance is assigned to exactly one class. For multilabel classification, where each instance can belong to multiple classes simultaneously, Stratified K-Fold Cross-Validation does not directly apply, as it's challenging to maintain the same proportions of each label in every fold due to the label combinations.

For multilabel data, you can use alternative strategies such as Iterative Stratification. This is a more advanced form of stratification suitable for multilabel data. It attempts to distribute each class's instances evenly across the folds while also respecting the label combinations.

In [183]:
models = {
    "RandomForestClassifier": RandomForestClassifier(random_state=42),
    "OneVsRestClassifier-RandomForestClassifier": OneVsRestClassifier(RandomForestClassifier(random_state=42)),
    "LabelPowerset-RandomForestClassifier": LabelPowerset(RandomForestClassifier(random_state=42)),
    "BinaryRelevance-RandomForestClassifier": BinaryRelevance(RandomForestClassifier(random_state=42)),
    "MultiOutputClassifier-RandomForestClassifier": MultiOutputClassifier(RandomForestClassifier(random_state=42)),
    "MultiOutputClassifier-XGBClassifier": MultiOutputClassifier(XGBClassifier(eval_metric='logloss', random_state=42)),
    "ClassifierChain-RandomForestClassifier": ClassifierChain(RandomForestClassifier(random_state=42)),
    "ClassifierChain-XGBClassifier": ClassifierChain(XGBClassifier(eval_metric='logloss', random_state=42)),
}

for name, model in models.items():

    n_splits = 5
    iterative_stratification = IterativeStratification(n_splits=n_splits, order=1)

    f1_scores = []
    hamming_losses = []
    for train_index, test_index in iterative_stratification.split(X_train_df_t, y_train_df):
        X_train_cv, X_val_cv = X_train_df_t[train_index], X_train_df_t[test_index]
        y_train_cv, y_val_cv = y_train_df.iloc[train_index], y_train_df.iloc[test_index]
        
        model.fit(X_train_cv, y_train_cv)
        predictions = model.predict(X_val_cv)
        f1 = f1_score(y_val_cv, predictions, average='macro')
        f1_scores.append(f1)

        hamming = hamming_loss(y_val_cv, predictions)
        hamming_losses.append(hamming)
    print(f'{name} model metrics')
    print(f"Average F1-Score: {sum(f1_scores) / len(f1_scores)}")
    print(f"Average Hamming Loss: {sum(hamming_losses) / len(hamming_losses)}")

RandomForestClassifier model metrics
Average F1-Score: 0.4471038433175464
Average Hamming Loss: 0.20224971673471245
OneVsRestClassifier-RandomForestClassifier model metrics
Average F1-Score: 0.46194984202513234
Average Hamming Loss: 0.20366233534881503
LabelPowerset-RandomForestClassifier model metrics
Average F1-Score: 0.394234601352342
Average Hamming Loss: 0.21101509436307256
BinaryRelevance-RandomForestClassifier model metrics
Average F1-Score: 0.46807366878858314
Average Hamming Loss: 0.20069569966647288
MultiOutputClassifier-RandomForestClassifier model metrics
Average F1-Score: 0.47165408241706286
Average Hamming Loss: 0.20069773367055882
MultiOutputClassifier-XGBClassifier model metrics
Average F1-Score: 0.5138563266201747
Average Hamming Loss: 0.21182948983248512
ClassifierChain-RandomForestClassifier model metrics
Average F1-Score: 0.4308313106879632
Average Hamming Loss: 0.20649082088062637
ClassifierChain-XGBClassifier model metrics
Average F1-Score: 0.5121011179957432
Aver

Given these metrics, the ClassifierChain-XGBClassifier, MultiOutputClassifier-RandomForestClassifier and MultiOutputClassifier-XGBClassifier stand out. 

Let's see on CatBoostClassifier cross validation:

In [184]:

# CatBoostClassifier cross validation
params = {
    'loss_function': 'MultiLogloss', 
    'eval_metric': 'HammingLoss',    #
    'custom_metric': ['F1'],
    'iterations': 100,             
    'random_seed': 1,
    'verbose': False,
}


cv_results = cv(
    params=params,
    pool=train_pool,
    fold_count=5,     
    partition_random_seed=1,  
    shuffle=True,            
    stratified=False,         
    plot=True                
)

print(cv_results)


MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Training on fold [0/5]

bestTest = 0.2195031516
bestIteration = 99

Training on fold [1/5]

bestTest = 0.1974758723
bestIteration = 98

Training on fold [2/5]

bestTest = 0.2082405345
bestIteration = 95

Training on fold [3/5]

bestTest = 0.206013363
bestIteration = 98

Training on fold [4/5]

bestTest = 0.2145508537
bestIteration = 99

    iterations  test-HammingLoss-mean  test-HammingLoss-std  \
0            0               0.234986              0.010263   
1            1               0.230755              0.009198   
2            2               0.230681              0.007844   
3            3               0.229198              0.006200   
4            4               0.228678              0.006131   
..         ...                    ...                   ...   
95          95               0.210270              0.009092   
96          96               0.210344              0.008875   
97          97               0.209602              0.008531   
98          98               0.

#### Hyperparameter tuning

#### For MultiOutputClassifier with RandomForestClassifier:

In [151]:
param_grid = {
    'estimator__n_estimators': [100, 200, 300],
    'estimator__max_depth': [10, 20, 30],
    'estimator__min_samples_split': [2, 5, 10],
    'estimator__min_samples_leaf': [1, 2, 4],
}

classifier =  MultiOutputClassifier(RandomForestClassifier(random_state=42), n_jobs=-1)

iterative_stratification = IterativeStratification(n_splits=5, order=1)

grid_search = GridSearchCV(estimator=classifier, param_grid=param_grid, 
                           cv=iterative_stratification, scoring='f1_macro', n_jobs=-1, verbose=2)

grid_search.fit(X_train_df_t, y_train_df)

print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)


Fitting 5 folds for each of 81 candidates, totalling 405 fits
[CV] END estimator__max_depth=10, estimator__min_samples_leaf=1, estimator__min_samples_split=2, estimator__n_estimators=100; total time=   2.6s
[CV] END estimator__max_depth=10, estimator__min_samples_leaf=1, estimator__min_samples_split=2, estimator__n_estimators=100; total time=   2.7s
[CV] END estimator__max_depth=10, estimator__min_samples_leaf=1, estimator__min_samples_split=2, estimator__n_estimators=100; total time=   2.7s
[CV] END estimator__max_depth=10, estimator__min_samples_leaf=1, estimator__min_samples_split=2, estimator__n_estimators=100; total time=   2.7s
[CV] END estimator__max_depth=10, estimator__min_samples_leaf=1, estimator__min_samples_split=2, estimator__n_estimators=100; total time=   2.7s
[CV] END estimator__max_depth=10, estimator__min_samples_leaf=1, estimator__min_samples_split=2, estimator__n_estimators=200; total time=   6.0s
[CV] END estimator__max_depth=10, estimator__min_samples_leaf=1, est

#### For MultiOutputClassifier with XGBClassifier:

In [192]:

param_grid = {
    'estimator__n_estimators': [100, 200, 300, 400],
    'estimator__learning_rate': [0.01, 0.1],
    'estimator__max_depth': [3, 5, 10],
    'estimator__subsample': [0.5, 0.7, 1]
}

classifier =  MultiOutputClassifier(XGBClassifier(eval_metric='logloss', random_state=42), n_jobs=-1)

iterative_stratification = IterativeStratification(n_splits=5, order=1)

grid_search = GridSearchCV(estimator=classifier, param_grid=param_grid, 
                           cv=iterative_stratification, scoring='f1_macro', n_jobs=-1, verbose=2)

grid_search.fit(X_train_df_t, y_train_df)

print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)


Fitting 5 folds for each of 72 candidates, totalling 360 fits
[CV] END estimator__learning_rate=0.01, estimator__max_depth=3, estimator__n_estimators=100, estimator__subsample=0.7; total time=   0.4s
[CV] END estimator__learning_rate=0.01, estimator__max_depth=3, estimator__n_estimators=100, estimator__subsample=0.5; total time=   0.4s
[CV] END estimator__learning_rate=0.01, estimator__max_depth=3, estimator__n_estimators=100, estimator__subsample=0.5; total time=   0.4s
[CV] END estimator__learning_rate=0.01, estimator__max_depth=3, estimator__n_estimators=100, estimator__subsample=0.5; total time=   0.4s
[CV] END estimator__learning_rate=0.01, estimator__max_depth=3, estimator__n_estimators=100, estimator__subsample=0.7; total time=   0.4s
[CV] END estimator__learning_rate=0.01, estimator__max_depth=3, estimator__n_estimators=100, estimator__subsample=0.5; total time=   0.4s
[CV] END estimator__learning_rate=0.01, estimator__max_depth=3, estimator__n_estimators=100, estimator__subsam

#### For ClassifierChain with XGBClassifier:

In [160]:

param_grid = {
    'base_estimator__n_estimators': [200, 300, 400],
    'base_estimator__learning_rate': [0.01, 0.1],
    'base_estimator__max_depth': [3, 5, 10],
    'base_estimator__subsample': [0.5, 0.7, 1]
}

classifier = ClassifierChain(
    base_estimator=XGBClassifier(eval_metric='logloss', random_state=42)
)

iterative_stratification = IterativeStratification(n_splits=5, order=1)

grid_search = GridSearchCV(estimator=classifier, param_grid=param_grid, 
                           cv=iterative_stratification, scoring='f1_macro', n_jobs=-1, verbose=2)

grid_search.fit(X_train_df_t, y_train_df)

print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)


Fitting 5 folds for each of 54 candidates, totalling 270 fits
[CV] END base_estimator__learning_rate=0.01, base_estimator__max_depth=3, base_estimator__n_estimators=200, base_estimator__subsample=0.5; total time=   0.7s
[CV] END base_estimator__learning_rate=0.01, base_estimator__max_depth=3, base_estimator__n_estimators=200, base_estimator__subsample=0.5; total time=   0.9s
[CV] END base_estimator__learning_rate=0.01, base_estimator__max_depth=3, base_estimator__n_estimators=200, base_estimator__subsample=0.5; total time=   0.9s
[CV] END base_estimator__learning_rate=0.01, base_estimator__max_depth=3, base_estimator__n_estimators=200, base_estimator__subsample=0.5; total time=   1.0s
[CV] END base_estimator__learning_rate=0.01, base_estimator__max_depth=3, base_estimator__n_estimators=200, base_estimator__subsample=0.5; total time=   1.0s
[CV] END base_estimator__learning_rate=0.01, base_estimator__max_depth=3, base_estimator__n_estimators=200, base_estimator__subsample=0.7; total tim

#### For CatBoostClassifier

In [185]:
param_grid = {
    'depth': [6, 8, 10],
    'learning_rate': [0.01, 0.05, 0.1],
    'iterations': [200, 300, 400],
}
train_pool = Pool(X_train_df_t, y_train)

catboost_classifier = CatBoostClassifier(loss_function='MultiLogloss',
    eval_metric='HammingLoss', random_state=42)

iterative_stratification = IterativeStratification(n_splits=5, order=1)

grid_search = GridSearchCV(estimator=catboost_classifier, param_grid=param_grid, 
                           cv=iterative_stratification, scoring='f1_macro', n_jobs=-1, verbose=2)

grid_search.fit(X_train_df_t, y_train_df)

print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)


Fitting 5 folds for each of 27 candidates, totalling 135 fits
0:	learn: 0.2327394	total: 73.6ms	remaining: 14.6s
1:	learn: 0.2287491	total: 94.4ms	remaining: 9.34s
0:	learn: 0.2313474	total: 115ms	remaining: 22.8s
2:	learn: 0.2265219	total: 115ms	remaining: 7.56s
0:	learn: 0.2296771	total: 115ms	remaining: 22.8s
0:	learn: 0.2317186	total: 115ms	remaining: 22.9s
0:	learn: 0.2302558	total: 119ms	remaining: 23.6s
0:	learn: 0.2317186	total: 116ms	remaining: 23s
0:	learn: 0.2296771	total: 120ms	remaining: 23.8s
3:	learn: 0.2241091	total: 137ms	remaining: 6.74s
1:	learn: 0.2274499	total: 133ms	remaining: 13.2s
0:	learn: 0.2327394	total: 129ms	remaining: 25.7s
1:	learn: 0.2275427	total: 141ms	remaining: 13.9s
1:	learn: 0.2270115	total: 144ms	remaining: 14.2s
1:	learn: 0.2268931	total: 148ms	remaining: 14.6s
4:	learn: 0.2280995	total: 149ms	remaining: 5.81s
1:	learn: 0.2273571	total: 151ms	remaining: 14.9s
2:	learn: 0.2239235	total: 156ms	remaining: 10.3s
2:	learn: 0.2236745	total: 164ms	remai

### Metrics on tuned models

In [193]:
models = {
    "MultiOutputClassifier-RandomForestClassifier": MultiOutputClassifier(RandomForestClassifier(random_state=42, max_depth=30,
                                                                    min_samples_leaf=1, min_samples_split=5, n_estimators=300),),
    "MultiOutputClassifier-XGBClassifier": MultiOutputClassifier(XGBClassifier(random_state=42, eval_metric='logloss', learning_rate=0.1,
                                                                    max_depth=10, n_estimators=300, subsample =  0.5)),
    "ClassifierChain-XGBClassifier": ClassifierChain(XGBClassifier(random_state=42, eval_metric='logloss', max_depth=3, learning_rate=0.1,
                                                                    n_estimators=200, subsample=0.5)),
}


print("On validation set")
for name, model in models.items():
    model.fit(X_train_df_t, y_train_df)

    y_pred = model.predict(X_val_df_t)

    f1_micro = f1_score(y_val, y_pred, average='micro')
    f1_macro = f1_score(y_val, y_pred, average='macro')
    hamming = hamming_loss(y_val, y_pred)
    print(f'{name} model metrics')
    print(f'F1 Score (Micro): {f1_micro:.2f}')
    print(f'F1 Score (Macro): {f1_macro:.2f}')
    print(f'Hamming Loss: {hamming:.2f}')
    print(f'accuracy_score: {accuracy_score(y_val, y_pred):.2f}')

print("On test set")
for name, model in models.items():
    y_pred = model.predict(X_test_df_t)

    f1_micro = f1_score(y_test, y_pred, average='micro')
    f1_macro = f1_score(y_test, y_pred, average='macro')
    hamming = hamming_loss(y_test, y_pred)
    print(f'{name} model metrics')
    print(f'F1 Score (Micro): {f1_micro:.2f}')
    print(f'F1 Score (Macro): {f1_macro:.2f}')
    print(f'Hamming Loss: {hamming:.2f}')
    print(f'accuracy_score: {accuracy_score(y_test, y_pred):.2f}')


On validation set
MultiOutputClassifier-RandomForestClassifier model metrics
F1 Score (Micro): 0.45
F1 Score (Macro): 0.44
Hamming Loss: 0.20
accuracy_score: 0.57
MultiOutputClassifier-XGBClassifier model metrics
F1 Score (Micro): 0.49
F1 Score (Macro): 0.49
Hamming Loss: 0.21
accuracy_score: 0.55
ClassifierChain-XGBClassifier model metrics
F1 Score (Micro): 0.48
F1 Score (Macro): 0.48
Hamming Loss: 0.21
accuracy_score: 0.56
On test set
MultiOutputClassifier-RandomForestClassifier model metrics
F1 Score (Micro): 0.42
F1 Score (Macro): 0.42
Hamming Loss: 0.22
accuracy_score: 0.54
MultiOutputClassifier-XGBClassifier model metrics
F1 Score (Micro): 0.46
F1 Score (Macro): 0.45
Hamming Loss: 0.23
accuracy_score: 0.51
ClassifierChain-XGBClassifier model metrics
F1 Score (Micro): 0.43
F1 Score (Macro): 0.43
Hamming Loss: 0.22
accuracy_score: 0.54


In [190]:
train_pool = Pool(X_train_df_t, y_train)
val_pool = Pool(X_val_df_t, y_val)


catboost_classifier = CatBoostClassifier(loss_function='MultiLogloss',
    eval_metric='HammingLoss',
    iterations=400, depth=6, learning_rate=0.1, random_state=42)
catboost_classifier.fit(train_pool, eval_set=val_pool, metric_period=10, plot=True, verbose=50)

def predict(catboost_classifier, x_set, y_labels):
    predict = catboost_classifier.predict(x_set)
    from catboost.utils import eval_metric
    accuracy = eval_metric(y_labels, predict, 'Accuracy')[0]
    print(f'Accuracy: {accuracy}')

    accuracy_per_class = eval_metric(y_labels, val_predict, 'Accuracy:type=PerClass')
    for cls, value in zip(catboost_classifier.classes_, accuracy_per_class):
        print(f'Accuracy for class {cls}: {value}')

    hamming = eval_metric(y_labels, predict, 'HammingLoss')[0]
    print(f'HammingLoss: {hamming:.4f}')
    mean_accuracy_per_class = sum(accuracy_per_class) / len(accuracy_per_class)
    print(f'MeanAccuracyPerClass: {mean_accuracy_per_class:.4f}')
    print(f'HammingLoss + MeanAccuracyPerClass = {hamming + mean_accuracy_per_class}')

    f1_scores = []
    for metric in ('Precision', 'Recall', 'F1'):
        print(metric)
        values = eval_metric(y_labels, predict, metric)
        for cls, value in zip(catboost_classifier.classes_, values):
            print(f'class={cls}: {value:.4f}')
            if metric == 'F1':
                f1_scores.append(value)
        print()

    # Calculating and printing the average F1 score
    average_f1_score = sum(f1_scores) / len(f1_scores)
    print(f'Average F1 Score: {average_f1_score:.4f}')

print("\n-----------------------")
print("On validation set")
predict(catboost_classifier, X_val_df_t, y_val)
print("\n-----------------------")
print("On test set")
predict(catboost_classifier,X_test_df_t, y_test)


MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

0:	learn: 0.2303867	test: 0.2382543	best: 0.2382543 (0)	total: 25.5ms	remaining: 10.2s
50:	learn: 0.1714540	test: 0.2110888	best: 0.2110888 (50)	total: 469ms	remaining: 3.21s
100:	learn: 0.1418392	test: 0.2052995	best: 0.2052995 (100)	total: 815ms	remaining: 2.41s
150:	learn: 0.1176427	test: 0.2032955	best: 0.2032955 (150)	total: 1.15s	remaining: 1.89s
200:	learn: 0.0990871	test: 0.2064128	best: 0.2026275 (160)	total: 1.47s	remaining: 1.46s
250:	learn: 0.0804572	test: 0.2015141	best: 0.2015141 (250)	total: 1.79s	remaining: 1.06s
300:	learn: 0.0679136	test: 0.2052995	best: 0.2015141 (250)	total: 2.11s	remaining: 696ms
350:	learn: 0.0555184	test: 0.2066355	best: 0.2015141 (250)	total: 2.44s	remaining: 340ms
399:	learn: 0.0461664	test: 0.2068582	best: 0.2015141 (250)	total: 2.74s	remaining: 0us

bestTest = 0.2015141394
bestIteration = 250

Shrink model to first 251 iterations.

-----------------------
On validation set
Accuracy: 0.569806279225117
Accuracy for class 0: 0.7802271209084837
A

### Conclusion:

* Accuracy & Balance: The CatBoostClassifier shows a slightly higher overall accuracy compared to the MultiOutputClassifier-XGBClassifier's accuracy score. However, the F1 scores (which balance precision and recall) are comparable, with the XGBClassifier slightly ahead.

* Hamming Loss: Both models show similar performance in terms of Hamming Loss, but the CatBoostClassifier has a marginally lower Hamming Loss, suggesting slightly better performance in label predictions.

* Performance Across Classes: The CatBoostClassifier demonstrates more balanced performance across different classes in terms of accuracy, which might be beneficial.

Since these two models are very close in performance I just would like to choose a model with less humming loss - CatBoostClassifier - as my final model.

In [195]:
#import pickle

#with open('../models_binary/catboost_classifier_model.pkl', 'wb') as f_model:
#    pickle.dump(catboost_classifier, f_model)

