# Modeling Notebook

Welcome to the Modeling Notebook. This notebook focuses on building and evaluating machine learning models based on the insights gained from EDA.

## Objectives:
- **Model Selection**: Trying out different models to see which one performs the best.
- **Model Tuning**: Optimizing the performance of the selected model.
- **Model Evaluation**: Assessing the performance of the model using appropriate metrics.

## Dataset:
In this notebook, we will be working with the cleaned dataset located at `./data/features/movies_dataset.parquet`, which is the result of feature engineering process in the preceding EDA Notebook.


In [30]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score, hamming_loss
from skmultilearn.model_selection import iterative_train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from catboost import CatBoostClassifier, Pool


In [31]:
dataset_df = pd.read_parquet("../data/cleaned/selected_features.parquet")
labels = dataset_df[["numerical_ROI_category", 'numerical_rating_category', 'numerical_award_category']]
dataset_df.drop(["numerical_ROI_category", 'numerical_rating_category', 'numerical_award_category'], axis=1, inplace=True)

X_np = dataset_df.to_numpy()
y_np = labels.to_numpy()

X_full_train, y_full_train, X_test, y_test = iterative_train_test_split(X_np, y_np, test_size = 0.2)
X_train, y_train, X_val, y_val = iterative_train_test_split(X_full_train, y_full_train, test_size = 0.25)

X_train_df = pd.DataFrame(X_train, columns=dataset_df.columns)  
X_val_df = pd.DataFrame(X_val, columns=dataset_df.columns) 

y_train_df = pd.DataFrame(y_train, columns=labels.columns)  
y_val_df = pd.DataFrame(y_val, columns=labels.columns) 

dv = DictVectorizer(sparse=True)
X_train_df_t = dv.fit_transform(X_train_df.to_dict(orient='records'))
X_val_df_t = dv.transform(X_val_df.to_dict(orient='records'))

scaler = StandardScaler(with_mean=False)
X_train_df_t = scaler.fit_transform(X_train_df_t)
X_val_df_t = scaler.transform(X_val_df_t)


In [32]:
random_forest_classifier = RandomForestClassifier()
random_forest_classifier.fit(X_train_df_t, y_train)
rf_y_pred = random_forest_classifier.predict(X_val_df_t)

print('Random Forest Classifier Metrics:')
print(f'F1 Score (Micro): {f1_score(y_val, rf_y_pred, average="micro"):.2f}')
print(f'F1 Score (Macro): {f1_score(y_val, rf_y_pred, average="macro"):.2f}')
print(f'Hamming Loss: {hamming_loss(y_val, rf_y_pred):.2f}')
print(f'accuracy_score: {accuracy_score(y_val, rf_y_pred):.2f}')

Random Forest Classifier Metrics:
F1 Score (Micro): 0.43
F1 Score (Macro): 0.42
Hamming Loss: 0.20
accuracy_score: 0.57


In [33]:
train_pool = Pool(X_train_df_t, y_train)
val_pool = Pool(X_val_df_t, y_val)

catboost_classifier = CatBoostClassifier(loss_function='MultiLogloss',
    eval_metric='HammingLoss',
    iterations=500, random_state=1)
catboost_classifier.fit(train_pool, eval_set=val_pool, metric_period=10, plot=True, verbose=50)

val_predict = catboost_classifier.predict(X_val_df_t)
from catboost.utils import eval_metric
accuracy = eval_metric(y_val, val_predict, 'Accuracy')[0]
print(f'Accuracy: {accuracy}')

accuracy_per_class = eval_metric(y_val, val_predict, 'Accuracy:type=PerClass')
for cls, value in zip(catboost_classifier.classes_, accuracy_per_class):
    print(f'Accuracy for class {cls}: {value}')

hamming = eval_metric(y_val, val_predict, 'HammingLoss')[0]
print(f'HammingLoss: {hamming:.4f}')
mean_accuracy_per_class = sum(accuracy_per_class) / len(accuracy_per_class)
print(f'MeanAccuracyPerClass: {mean_accuracy_per_class:.4f}')
print(f'HammingLoss + MeanAccuracyPerClass = {hamming + mean_accuracy_per_class}')

for metric in ('Precision', 'Recall', 'F1'):
    print(metric)
    values = eval_metric(y_val, val_predict, metric)
    for cls, value in zip(catboost_classifier.classes_, values):
        print(f'class={cls}: {value:.4f}')
    print()


MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Learning rate set to 0.062085
0:	learn: 0.2306836	test: 0.2358049	best: 0.2358049 (0)	total: 10.9ms	remaining: 5.42s
50:	learn: 0.1887479	test: 0.2077488	best: 0.2077488 (50)	total: 454ms	remaining: 4s
100:	learn: 0.1653678	test: 0.2021821	best: 0.2012915 (90)	total: 852ms	remaining: 3.37s
150:	learn: 0.1523788	test: 0.2024048	best: 0.2012915 (90)	total: 1.24s	remaining: 2.86s
200:	learn: 0.1333779	test: 0.2019595	best: 0.2004008 (160)	total: 1.76s	remaining: 2.62s
250:	learn: 0.1166778	test: 0.2024048	best: 0.1999555 (230)	total: 2.15s	remaining: 2.13s
300:	learn: 0.1040600	test: 0.2010688	best: 0.1999555 (230)	total: 2.51s	remaining: 1.66s
350:	learn: 0.0941884	test: 0.2010688	best: 0.1999555 (230)	total: 2.86s	remaining: 1.22s
400:	learn: 0.0843910	test: 0.2001781	best: 0.1992875 (360)	total: 3.25s	remaining: 802ms
450:	learn: 0.0750390	test: 0.1997328	best: 0.1992875 (360)	total: 3.63s	remaining: 394ms
499:	learn: 0.0680621	test: 0.2004008	best: 0.1992875 (360)	total: 3.98s	remaini