# Modeling Notebook

Welcome to the Modeling Notebook. This notebook focuses on building and evaluating machine learning models based on the insights gained from EDA.

## Objectives:
- **Model Selection**: Trying out different models to see which one performs the best.
- **Model Tuning**: Optimizing the performance of the selected model.
- **Model Evaluation**: Assessing the performance of the model using appropriate metrics.

## Dataset:
In this notebook, we will be working with the cleaned dataset located at `./data/features/movies_dataset.parquet`, which is the result of feature engineering process in the preceding EDA Notebook.

I'll begin by setting the stage with a knowledge section to explain my approach and choices.

**Knowledge Section for Readers:**

I'm exploring various models, each with unique strengths in multilabel classification:

1. **RidgeClassifier**: Ideal for high-dimensional datasets, supports multilabel classification out of the box.
2. **RandomForestClassifier**: Handles complex datasets well and provides feature importance insights, supports multilabel classification out of the box..
3. **CatBoostClassifier**: Specializes in categorical data and complex data scenarios, supports multilabel classification out of the box..

1. **OneVsRestClassifier**: Trains a separate model for each label. In this Notebook applied with LogisticRegression, RidgeClassifier, RandomForestClassifier, and LinearSVC.
2. **LabelPowerset**: Considers each label combination as a unique class. In this Notebook used with LogisticRegression, RidgeClassifier, RandomForestClassifier, and LinearSVC.
3. **BinaryRelevance**: Treats each label as a separate binary problem. In this Notebook used with LogisticRegression, RidgeClassifier, RandomForestClassifier, and LinearSVC.

In the section on model selection, it's worth mentioning that LinearSVC is included specifically because it uses Hamming Loss (read below about this metric), making it a suitable choice for our analysis.

Additionally, multilabel support is added to any classifier with:

1. **MultiOutputClassifier**: Treats each label as an independent binary classification. In this Notebook utilized with LogisticRegression, RidgeClassifier, RandomForestClassifier, LinearSVC, and XGBClassifier.
2. **ClassifierChain**: Combines binary classifiers into a single multi-label model, leveraging target correlations. In this Notebook used with LogisticRegression, RidgeClassifier, RandomForestClassifier, LinearSVC, and XGBClassifier.

You can read about Multiclass and multioutput algorithms [here](https://scikit-learn.org/stable/modules/multiclass.html)

In addition to the model selection, it's important to understand that evaluation metrics for multilabel classification, especially with an imbalanced dataset, differ from standard metrics. I'll be focusing on Hamming Loss, F1 Score (Micro), and F1 Score (Macro) for evaluation. Here's what each of them means and why they are important:

1. **Hamming Loss**: This measures the fraction of incorrect labels to the total number of labels. It's especially useful in multilabel classification as it considers the prediction error across all labels. Lower the value better the model.

2. **F1 Score (Micro)**: This calculates the F1 Score globally by counting the total true positives, false negatives, and false positives. It's beneficial when dealing with imbalanced datasets, as it aggregates the contributions of all classes to compute the average.

3. **F1 Score (Macro)**: This computes the F1 Score for each label and finds their unweighted mean. This doesn't take label imbalance into account, which makes it useful for understanding performance on each label.

It's important to note why the accuracy score is not the best choice for multilabel classification and imbalanced datasets. Accuracy can be misleading in these contexts, as it might show high values even when the model is performing poorly on minority classes. In imbalanced datasets, a model could skew towards the majority class, leading to high accuracy but poor model performance in terms of real-world applicability. Therefore, relying on Hamming Loss and F1 Scores provides a more realistic evaluation of the model's effectiveness in handling multilabel, imbalanced data.

You can read about metrics for multilabel classification [here](https://mmuratarat.github.io/2020-01-25/multilabel_classification_metrics)

Since my dataset isn't evenly distributed across different labels, I'll use [iterative_train_test_split](http://scikit.ml/api/skmultilearn.model_selection.iterative_stratification.html). This method will help keep a balanced mix of labels in both training and test sets, ensuring my model's results are more accurate and fair. The idea behind this stratification method is to assign label combinations to folds based on how much a given combination is desired by a given fold, as more and more assignments are made, some folds are filled and positive evidence is directed into other folds, in the end negative evidence is distributed based on a folds desirability of size.

In [121]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score, hamming_loss
from skmultilearn.model_selection import iterative_train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from catboost import CatBoostClassifier, Pool, cv
from sklearn.linear_model import RidgeClassifier, LogisticRegression
from sklearn.model_selection import GridSearchCV
from skmultilearn.model_selection import IterativeStratification
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from skmultilearn.problem_transform import LabelPowerset
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.multioutput import MultiOutputClassifier, ClassifierChain
from xgboost import XGBClassifier

#### Prepare training, validation, test sets

I'll split my data into training, validation, and test sets. I'll use [DictVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html) to turn text data into numbers and [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to make sure all features are on a similar scale, which helps the models work better.


In [135]:
dataset_df = pd.read_parquet("../data/cleaned/selected_features.parquet")
labels = dataset_df[["numerical_ROI_category", 'numerical_rating_category', 'numerical_award_category']]
dataset_df.drop(["numerical_ROI_category", 'numerical_rating_category', 'numerical_award_category'], axis=1, inplace=True)

X_np = dataset_df.to_numpy()
y_np = labels.to_numpy()

X_full_train, y_full_train, X_test, y_test = iterative_train_test_split(X_np, y_np, test_size = 0.2)
X_train, y_train, X_val, y_val = iterative_train_test_split(X_full_train, y_full_train, test_size = 0.25)

X_train_df = pd.DataFrame(X_train, columns=dataset_df.columns)  
X_val_df = pd.DataFrame(X_val, columns=dataset_df.columns) 
X_test_df = pd.DataFrame(X_test, columns=dataset_df.columns) 

y_train_df = pd.DataFrame(y_train, columns=labels.columns)  
y_val_df = pd.DataFrame(y_val, columns=labels.columns) 
y_test_df = pd.DataFrame(y_test, columns=labels.columns) 

dv = DictVectorizer(sparse=False)
X_train_df_t = dv.fit_transform(X_train_df.to_dict(orient='records'))
X_val_df_t = dv.transform(X_val_df.to_dict(orient='records'))
X_test_df_t = dv.transform(X_test_df.to_dict(orient='records'))

scaler = StandardScaler()
X_train_df_t = scaler.fit_transform(X_train_df_t)
X_val_df_t = scaler.transform(X_val_df_t)
X_test_df_t = scaler.transform(X_test_df_t)


In [129]:
print(X_train_df.shape)
print(y_train.shape)
print(X_val_df.shape)
print(y_val.shape)
print(X_test_df.shape)
print(y_test.shape)

(4491, 20)
(4491, 3)
(1497, 20)
(1497, 3)
(1497, 20)
(1497, 3)


#### Models

#### RidgeClassifier

In [138]:
ridge_classifier = RidgeClassifier()
ridge_classifier.fit(X_train_df_t, y_train)

y_pred = ridge_classifier.predict(X_val_df_t)
print('Ridge Classifier Metrics:')
print(f'F1 Score (Micro): {f1_score(y_val, y_pred, average="micro"):.2f}')
print(f'F1 Score (Macro): {f1_score(y_val, y_pred, average="macro"):.2f}')
print(f'Hamming Loss: {hamming_loss(y_val, y_pred):.2f}')

print('-' * 40)

Ridge Classifier Metrics:
F1 Score (Micro): 0.35
F1 Score (Macro): 0.35
Hamming Loss: 0.22
----------------------------------------


#### RandomForestClassifier

In [137]:
random_forest_classifier = RandomForestClassifier(random_state=1)
random_forest_classifier.fit(X_train_df_t, y_train)
rf_y_pred = random_forest_classifier.predict(X_val_df_t)

print('Random Forest Classifier Metrics:')
print(f'F1 Score (Micro): {f1_score(y_val, rf_y_pred, average="micro"):.2f}')
print(f'F1 Score (Macro): {f1_score(y_val, rf_y_pred, average="macro"):.2f}')
print(f'Hamming Loss: {hamming_loss(y_val, rf_y_pred):.2f}')
print(f'accuracy_score: {accuracy_score(y_val, rf_y_pred):.2f}')

Random Forest Classifier Metrics:
F1 Score (Micro): 0.44
F1 Score (Macro): 0.43
Hamming Loss: 0.21
accuracy_score: 0.56


#### CatBoostClassifier

For the CatBoostClassifier, I'm using loss_function='MultiLogloss' and eval_metric='HammingLoss' due to the multilabel nature of my outputs.

MultiLogloss: This is a loss function suitable for multilabel classification tasks. It calculates the logarithmic loss for each label and then averages these values. It's effective in situations where each instance can belong to multiple labels simultaneously, as it penalizes wrong predictions more severely. 

You can read about multilabel classification objectives and metrics [here](https://catboost.ai/en/docs/concepts/loss-functions-multilabel-classification)

In [136]:
train_pool = Pool(X_train_df_t, y_train)
val_pool = Pool(X_val_df_t, y_val)

catboost_classifier = CatBoostClassifier(loss_function='MultiLogloss',
    eval_metric='HammingLoss',
    iterations=500, random_state=1)
catboost_classifier.fit(train_pool, eval_set=val_pool, metric_period=10, plot=True, verbose=50)

val_predict = catboost_classifier.predict(X_val_df_t)
from catboost.utils import eval_metric
accuracy = eval_metric(y_val, val_predict, 'Accuracy')[0]
print(f'Accuracy: {accuracy}')

accuracy_per_class = eval_metric(y_val, val_predict, 'Accuracy:type=PerClass')
for cls, value in zip(catboost_classifier.classes_, accuracy_per_class):
    print(f'Accuracy for class {cls}: {value}')

hamming = eval_metric(y_val, val_predict, 'HammingLoss')[0]
print(f'HammingLoss: {hamming:.4f}')
mean_accuracy_per_class = sum(accuracy_per_class) / len(accuracy_per_class)
print(f'MeanAccuracyPerClass: {mean_accuracy_per_class:.4f}')
print(f'HammingLoss + MeanAccuracyPerClass = {hamming + mean_accuracy_per_class}')

for metric in ('Precision', 'Recall', 'F1'):
    print(metric)
    values = eval_metric(y_val, val_predict, metric)
    for cls, value in zip(catboost_classifier.classes_, values):
        print(f'class={cls}: {value:.4f}')
    print()



MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Learning rate set to 0.062085
0:	learn: 0.2282342	test: 0.2324649	best: 0.2324649 (0)	total: 23.4ms	remaining: 11.7s
50:	learn: 0.1877088	test: 0.2106435	best: 0.2106435 (50)	total: 422ms	remaining: 3.71s
100:	learn: 0.1646998	test: 0.2090848	best: 0.2075262 (80)	total: 778ms	remaining: 3.07s
150:	learn: 0.1490388	test: 0.2059675	best: 0.2055222 (140)	total: 1.14s	remaining: 2.64s
200:	learn: 0.1315223	test: 0.2086395	best: 0.2055222 (140)	total: 1.48s	remaining: 2.19s
250:	learn: 0.1150449	test: 0.2079715	best: 0.2055222 (140)	total: 1.81s	remaining: 1.8s
300:	learn: 0.1034662	test: 0.2050768	best: 0.2050768 (300)	total: 2.14s	remaining: 1.42s
350:	learn: 0.0930750	test: 0.2079715	best: 0.2037408 (310)	total: 2.68s	remaining: 1.14s
400:	learn: 0.0827581	test: 0.2055222	best: 0.2037408 (310)	total: 3.01s	remaining: 744ms
450:	learn: 0.0743710	test: 0.2052995	best: 0.2037408 (310)	total: 3.33s	remaining: 362ms
499:	learn: 0.0679878	test: 0.2061902	best: 0.2037408 (310)	total: 3.65s	rema

#### OneVsRestClassifier, LabelPowerset, BinaryRelevance

All those classificators transforms a multi-label classification problem to a multi-class problem

[One-vs-Rest Strategy](https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html#:~:text=One%2Dvs%2Dthe%2Drest,against%20all%20the%20other%20classes.)

This involves training a separate classifier for each class, using data from that class as positive samples and all other data as negatives. It's similar to Binary Relevance but works with mutually exclusive labels.

[Label Powerset](http://scikit.ml/api/skmultilearn.problem_transform.lp.html)

This method considers correlations between class labels. It converts each combination of labels into a single label and then trains a single-label classifier. With more classes, the unique label combinations increase exponentially.

[Binary Relevance](http://scikit.ml/api/skmultilearn.problem_transform.br.html)

This approach trains an ensemble of single-label binary classifiers independently on the original dataset for each class. For example, with q labels, it creates q new datasets, each focusing on one label, and trains a single-label classifier on each.

In [139]:
models = {
    "OneVsRestClassifier-LogisticRegression": OneVsRestClassifier(LogisticRegression(random_state=42)),
    "OneVsRestClassifier-RidgeClassifier": OneVsRestClassifier(RidgeClassifier(random_state=42)),
    "OneVsRestClassifier-LinearSVC": OneVsRestClassifier(LinearSVC(random_state=42, max_iter=30000)),
    "OneVsRestClassifier-RandomForestClassifier": OneVsRestClassifier(RandomForestClassifier(random_state=42)),
    "LabelPowerset-LogisticRegression": LabelPowerset(LogisticRegression(random_state=42)),
    "LabelPowerset-RidgeClassifier": LabelPowerset(RidgeClassifier(random_state=42)),
    "LabelPowerset-LinearSVC": LabelPowerset(LinearSVC(random_state=42, max_iter=30000)),
    "LabelPowerset-RandomForestClassifier": LabelPowerset(RandomForestClassifier(random_state=42)),
    "BinaryRelevance-LogisticRegression": BinaryRelevance(LogisticRegression(random_state=42)),
    "BinaryRelevance-RidgeClassifier": BinaryRelevance(RidgeClassifier(random_state=42)),
    "BinaryRelevance-LinearSVC": BinaryRelevance(LinearSVC(random_state=42, max_iter=30000)),
    "BinaryRelevance-RandomForestClassifier": BinaryRelevance(RandomForestClassifier(random_state=42)),
}

for name, model in models.items():
    model.fit(X_train_df_t, y_train)
    y_pred = model.predict(X_val_df_t)

    f1_micro = f1_score(y_val, y_pred, average='micro')
    f1_macro = f1_score(y_val, y_pred, average='macro')
    hamming = hamming_loss(y_val, y_pred)
    print(f'{name} model metrics')
    print(f'F1 Score (Micro): {f1_micro:.2f}')
    print(f'F1 Score (Macro): {f1_macro:.2f}')
    print(f'Hamming Loss: {hamming:.2f}')
    print(f'accuracy_score: {accuracy_score(y_val, y_pred):.2f}')


OneVsRestClassifier-LogisticRegression model metrics
F1 Score (Micro): 0.38
F1 Score (Macro): 0.38
Hamming Loss: 0.23
accuracy_score: 0.55
OneVsRestClassifier-RidgeClassifier model metrics
F1 Score (Micro): 0.35
F1 Score (Macro): 0.35
Hamming Loss: 0.22
accuracy_score: 0.56




OneVsRestClassifier-LinearSVC model metrics
F1 Score (Micro): 0.36
F1 Score (Macro): 0.36
Hamming Loss: 0.22
accuracy_score: 0.55
OneVsRestClassifier-RandomForestClassifier model metrics
F1 Score (Micro): 0.46
F1 Score (Macro): 0.46
Hamming Loss: 0.21
accuracy_score: 0.56
LabelPowerset-LogisticRegression model metrics
F1 Score (Micro): 0.36
F1 Score (Macro): 0.36
Hamming Loss: 0.23
accuracy_score: 0.56
LabelPowerset-RidgeClassifier model metrics
F1 Score (Micro): 0.27
F1 Score (Macro): 0.27
Hamming Loss: 0.23
accuracy_score: 0.56




LabelPowerset-LinearSVC model metrics
F1 Score (Micro): 0.29
F1 Score (Macro): 0.29
Hamming Loss: 0.23
accuracy_score: 0.56
LabelPowerset-RandomForestClassifier model metrics
F1 Score (Micro): 0.39
F1 Score (Macro): 0.39
Hamming Loss: 0.21
accuracy_score: 0.58
BinaryRelevance-LogisticRegression model metrics
F1 Score (Micro): 0.38
F1 Score (Macro): 0.38
Hamming Loss: 0.23
accuracy_score: 0.55
BinaryRelevance-RidgeClassifier model metrics
F1 Score (Micro): 0.35
F1 Score (Macro): 0.35
Hamming Loss: 0.22
accuracy_score: 0.56




BinaryRelevance-LinearSVC model metrics
F1 Score (Micro): 0.36
F1 Score (Macro): 0.36
Hamming Loss: 0.22
accuracy_score: 0.55
BinaryRelevance-RandomForestClassifier model metrics
F1 Score (Micro): 0.46
F1 Score (Macro): 0.46
Hamming Loss: 0.21
accuracy_score: 0.56


#### MultiOutputClassifier and ClassifierChain

[MultiOutputClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html)

This strategy consists of fitting one classifier per target. This is a simple strategy for extending classifiers that do not natively support multi-target classification.

[ClassifierChain](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.ClassifierChain.html)

A multi-label model that arranges binary classifiers into a chain.

Each model makes a prediction in the order specified by the chain using all of the available features provided to the model plus the predictions of models that are earlier in the chain.

In [140]:
models = {
    "MultiOutputClassifier-LogisticRegression": MultiOutputClassifier(LogisticRegression(random_state=42)),
    "MultiOutputClassifier-RidgeClassifier": MultiOutputClassifier(RidgeClassifier(random_state=42)),
    "MultiOutputClassifier-LinearSVC": MultiOutputClassifier(LinearSVC(random_state=42, max_iter=30000)),
    "MultiOutputClassifier-RandomForestClassifier": MultiOutputClassifier(RandomForestClassifier(random_state=42)),
    "MultiOutputClassifier-XGBClassifier": MultiOutputClassifier(XGBClassifier(eval_metric='logloss')),
    "ClassifierChain-LogisticRegression": ClassifierChain(LogisticRegression(random_state=42)),
    "ClassifierChain-RidgeClassifier": ClassifierChain(RidgeClassifier(random_state=42)),
    "ClassifierChain-LinearSVC": ClassifierChain(LinearSVC(random_state=42, max_iter=30000)),
    "ClassifierChain-RandomForestClassifier": ClassifierChain(RandomForestClassifier(random_state=42)),
    "ClassifierChain-XGBClassifier": ClassifierChain(XGBClassifier(eval_metric='logloss')),
}

for name, model in models.items():
    model.fit(X_train_df_t, y_train)

    y_pred = model.predict(X_val_df_t)

    f1_micro = f1_score(y_val, y_pred, average='micro')
    f1_macro = f1_score(y_val, y_pred, average='macro')
    hamming = hamming_loss(y_val, y_pred)
    print(f'{name} model metrics')
    print(f'F1 Score (Micro): {f1_micro:.2f}')
    print(f'F1 Score (Macro): {f1_macro:.2f}')
    print(f'Hamming Loss: {hamming:.2f}')
    print(f'accuracy_score: {accuracy_score(y_val, y_pred):.2f}')


MultiOutputClassifier-LogisticRegression model metrics
F1 Score (Micro): 0.38
F1 Score (Macro): 0.38
Hamming Loss: 0.23
accuracy_score: 0.55
MultiOutputClassifier-RidgeClassifier model metrics
F1 Score (Micro): 0.35
F1 Score (Macro): 0.35
Hamming Loss: 0.22
accuracy_score: 0.56




MultiOutputClassifier-LinearSVC model metrics
F1 Score (Micro): 0.36
F1 Score (Macro): 0.36
Hamming Loss: 0.22
accuracy_score: 0.55
MultiOutputClassifier-RandomForestClassifier model metrics
F1 Score (Micro): 0.46
F1 Score (Macro): 0.46
Hamming Loss: 0.21
accuracy_score: 0.56
MultiOutputClassifier-XGBClassifier model metrics
F1 Score (Micro): 0.51
F1 Score (Macro): 0.51
Hamming Loss: 0.22
accuracy_score: 0.53
ClassifierChain-LogisticRegression model metrics
F1 Score (Micro): 0.35
F1 Score (Macro): 0.35
Hamming Loss: 0.22
accuracy_score: 0.56
ClassifierChain-RidgeClassifier model metrics
F1 Score (Micro): 0.30
F1 Score (Macro): 0.30
Hamming Loss: 0.23
accuracy_score: 0.57




ClassifierChain-LinearSVC model metrics
F1 Score (Micro): 0.32
F1 Score (Macro): 0.32
Hamming Loss: 0.22
accuracy_score: 0.57
ClassifierChain-RandomForestClassifier model metrics
F1 Score (Micro): 0.43
F1 Score (Macro): 0.43
Hamming Loss: 0.21
accuracy_score: 0.57
ClassifierChain-XGBClassifier model metrics
F1 Score (Micro): 0.49
F1 Score (Macro): 0.49
Hamming Loss: 0.22
accuracy_score: 0.55


#### Cross Validation

We can't use Stratified K-Fold Cross-Validation here because is designed for binary classification and multiclass classification where each instance is assigned to exactly one class. For multilabel classification, where each instance can belong to multiple classes simultaneously, Stratified K-Fold Cross-Validation does not directly apply, as it's challenging to maintain the same proportions of each label in every fold due to the label combinations.

For multilabel data, you can use alternative strategies such as Iterative Stratification. This is a more advanced form of stratification suitable for multilabel data. It attempts to distribute each class's instances evenly across the folds while also respecting the label combinations.

In [82]:
rf_classifier = RandomForestClassifier(random_state=1)

n_splits = 5
iterative_stratification = IterativeStratification(n_splits=n_splits, order=1)

f1_scores = []
hamming_losses = []
for train_index, test_index in iterative_stratification.split(X_train_df_t, y_train_df):
    X_train, X_test = X_train_df_t[train_index], X_train_df_t[test_index]
    y_train, y_test = y_train_df.iloc[train_index], y_train_df.iloc[test_index]
    
    rf_classifier.fit(X_train, y_train)
    predictions = rf_classifier.predict(X_test)
    f1 = f1_score(y_test, predictions, average='macro')
    f1_scores.append(f1)

    hamming = hamming_loss(y_test, predictions)
    hamming_losses.append(hamming)

print(f"Average F1-Score: {sum(f1_scores) / len(f1_scores)}")
print(f"Average Hamming Loss: {sum(hamming_losses) / len(hamming_losses)}")

Average F1-Score: 0.4320879737347384
Average Hamming Loss: 0.20338095504419348


In [92]:
params = {
    'loss_function': 'MultiLogloss', 
    'eval_metric': 'HammingLoss',    #
    'iterations': 100,             
    'random_seed': 1,
    'verbose': False,
}

# Perform cross-validation
cv_results = cv(
    params=params,
    pool=train_pool,
    fold_count=5,     
    partition_random_seed=1,  
    shuffle=True,            
    stratified=False,         
    plot=True                
)

print(cv_results)


MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Training on fold [0/5]

bestTest = 0.2150537634
bestIteration = 96

Training on fold [1/5]

bestTest = 0.2130660728
bestIteration = 96

Training on fold [2/5]

bestTest = 0.1956198961
bestIteration = 95

Training on fold [3/5]

bestTest = 0.21640683
bestIteration = 89

Training on fold [4/5]

bestTest = 0.1974758723
bestIteration = 94

    iterations  test-HammingLoss-mean  test-HammingLoss-std  \
0            0               0.233577              0.007670   
1            1               0.229791              0.009493   
2            2               0.228159              0.007078   
3            3               0.227268              0.006732   
4            4               0.227565              0.005942   
..         ...                    ...                   ...   
95          95               0.208192              0.010398   
96          96               0.207747              0.009802   
97          97               0.208341              0.010164   
98          98               0.2

#### Hyperparameter tuning

In [85]:


param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
}

rf_classifier = RandomForestClassifier(random_state=1)

iterative_stratification = IterativeStratification(n_splits=5, order=1)

grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, 
                           cv=iterative_stratification, scoring='f1_macro', n_jobs=-1, verbose=2)

grid_search.fit(X_train_df_t, y_train_df)

print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)


Fitting 5 folds for each of 81 candidates, totalling 405 fits
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   1.2s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   1.2s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   1.2s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   1.2s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   1.2s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   2.4s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   2.4s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   2.4s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   1.2s
[CV] END max_depth=10, min_sa

In [87]:
random_forest_classifier = RandomForestClassifier(max_depth=30, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=1)
random_forest_classifier.fit(X_train_df_t, y_train_df)
rf_y_pred = random_forest_classifier.predict(X_val_df_t)

print('Random Forest Classifier Metrics:')
print(f'F1 Score (Micro): {f1_score(y_val, rf_y_pred, average="micro"):.2f}')
print(f'F1 Score (Macro): {f1_score(y_val, rf_y_pred, average="macro"):.2f}')
print(f'Hamming Loss: {hamming_loss(y_val, rf_y_pred):.2f}')
print(f'accuracy_score: {accuracy_score(y_val, rf_y_pred):.2f}')

Random Forest Classifier Metrics:
F1 Score (Micro): 0.44
F1 Score (Macro): 0.43
Hamming Loss: 0.20
accuracy_score: 0.57


In [104]:
param_grid = {
    'depth': [6, 8, 10],
    'learning_rate': [0.01, 0.05, 0.1],
    'iterations': [100, 200, 400],
}
train_pool = Pool(X_train_df_t, y_train)

catboost_classifier = CatBoostClassifier(loss_function='MultiLogloss',
    eval_metric='HammingLoss', random_state=1)

iterative_stratification = IterativeStratification(n_splits=5, order=1)

grid_search = GridSearchCV(estimator=catboost_classifier, param_grid=param_grid, 
                           cv=iterative_stratification, scoring='f1_macro', n_jobs=-1, verbose=2)

grid_search.fit(X_train_df_t, y_train_df)

print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)


Fitting 5 folds for each of 36 candidates, totalling 180 fits
0:	learn: 0.2304195	total: 25.1ms	remaining: 2.48s
1:	learn: 0.2254083	total: 45.4ms	remaining: 2.23s
0:	learn: 0.2246659	total: 39ms	remaining: 3.87s
2:	learn: 0.2273571	total: 72.8ms	remaining: 2.35s
0:	learn: 0.2314609	total: 32.4ms	remaining: 3.21s
1:	learn: 0.2226244	total: 77ms	remaining: 3.77s
3:	learn: 0.2238307	total: 94.6ms	remaining: 2.27s
1:	learn: 0.2256211	total: 55ms	remaining: 2.69s
4:	learn: 0.2244803	total: 114ms	remaining: 2.16s
2:	learn: 0.2253155	total: 101ms	remaining: 3.27s
2:	learn: 0.2295143	total: 95.8ms	remaining: 3.1s
0:	learn: 0.2278211	total: 41.1ms	remaining: 4.07s
5:	learn: 0.2227171	total: 156ms	remaining: 2.44s
3:	learn: 0.2251299	total: 142ms	remaining: 3.4s
0:	learn: 0.2276355	total: 55.6ms	remaining: 5.51s
4:	learn: 0.2247587	total: 189ms	remaining: 3.59s
3:	learn: 0.2259918	total: 182ms	remaining: 4.36s
0:	learn: 0.2304195	total: 74.2ms	remaining: 7.35s
1:	learn: 0.2255939	total: 130ms	r

In [105]:
train_pool = Pool(X_train_df_t, y_train)
val_pool = Pool(X_val_df_t, y_val)

catboost_classifier = CatBoostClassifier(loss_function='MultiLogloss',
    eval_metric='HammingLoss',
    iterations=400, depth=6, learning_rate=0.1, random_state=1)
catboost_classifier.fit(train_pool, eval_set=val_pool, metric_period=10, plot=True, verbose=50)

val_predict = catboost_classifier.predict(X_val_df_t)
from catboost.utils import eval_metric
accuracy = eval_metric(y_val, val_predict, 'Accuracy')[0]
print(f'Accuracy: {accuracy}')

accuracy_per_class = eval_metric(y_val, val_predict, 'Accuracy:type=PerClass')
for cls, value in zip(catboost_classifier.classes_, accuracy_per_class):
    print(f'Accuracy for class {cls}: {value}')

hamming = eval_metric(y_val, val_predict, 'HammingLoss')[0]
print(f'HammingLoss: {hamming:.4f}')
mean_accuracy_per_class = sum(accuracy_per_class) / len(accuracy_per_class)
print(f'MeanAccuracyPerClass: {mean_accuracy_per_class:.4f}')
print(f'HammingLoss + MeanAccuracyPerClass = {hamming + mean_accuracy_per_class}')

for metric in ('Precision', 'Recall', 'F1'):
    print(metric)
    values = eval_metric(y_val, val_predict, metric)
    for cls, value in zip(catboost_classifier.classes_, values):
        print(f'class={cls}: {value:.4f}')
    print()



MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

0:	learn: 0.2252653	test: 0.2342463	best: 0.2342463 (0)	total: 27ms	remaining: 10.8s
50:	learn: 0.1716025	test: 0.2046315	best: 0.2046315 (50)	total: 409ms	remaining: 2.8s
100:	learn: 0.1442886	test: 0.2024048	best: 0.2024048 (70)	total: 828ms	remaining: 2.45s
150:	learn: 0.1174942	test: 0.2039635	best: 0.2024048 (70)	total: 1.22s	remaining: 2.02s
200:	learn: 0.0977511	test: 0.2024048	best: 0.2024048 (70)	total: 1.65s	remaining: 1.63s
250:	learn: 0.0824612	test: 0.2021821	best: 0.2008461 (210)	total: 2.04s	remaining: 1.21s
300:	learn: 0.0692496	test: 0.1997328	best: 0.1997328 (300)	total: 2.41s	remaining: 792ms
350:	learn: 0.0561864	test: 0.2001781	best: 0.1986195 (320)	total: 2.74s	remaining: 382ms
399:	learn: 0.0463149	test: 0.1999555	best: 0.1977288 (370)	total: 3.08s	remaining: 0us

bestTest = 0.1977287909
bestIteration = 370

Shrink model to first 371 iterations.
Accuracy: 0.5704742818971276
Accuracy for class 0: 0.7895791583166333
Accuracy for class 1: 0.8136272545090181
Accuracy