# Sklearn models

In this notebook, a usage demo of all the chosen scikit-learn models is presented. In particular, the following have been added to those in the task:
- DecisionTreeClassifier
- QuadraticDiscriminantAnalysis

Furthermore, in this demo the saving of results will be ignored to avoid overwriting the outputs.

## Imports

In [None]:
import os
import warnings
from typing import Dict

import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold, GridSearchCV

from src.models.config import best_param_grid_model
from src.models.sklearn_models import balance, preprocess, test_eval, save_fold_model
from src.utils.const import DATA_DIR, SEED, NUM_BINS
from src.utils.util_models import fix_random

In [None]:
warnings.filterwarnings('ignore', category=UserWarning)

### Useful path to data

In [None]:
ROOT_DIR = os.path.join(os.getcwd(), '..')
PROCESSED_DIR = os.path.join(ROOT_DIR, DATA_DIR, 'processed')

### Fix random seed

In [None]:
fix_random(SEED)

## Import final dataset

In [None]:
final_stored = pd.read_parquet(os.path.join(PROCESSED_DIR, 'final.parquet'))
final = (final_stored
         .assign(rating_discrete=pd.cut(final_stored.loc[:, 'rating_mean'], bins=NUM_BINS, labels=False))
         .astype({'rating_discrete': 'int32'})
         .drop(columns=['rating_mean']))

## Train & Test

Trains and tests are performed in the following function. To verify that a certain configuration works well with different test sets and to perform hyperparameter optimization, 5-fold cross validation was implemented internally and externally. The internal one in the sklearn models is handled by the GridSearchCV, which performs the hyperparameter search automatically. Finally, the ` test_eval` function loads the previously saved model and evaluates its performance metrics, also printing the multiclass roc plot when it's possible.


In [None]:
def train_test(df: pd.DataFrame, model_group: str, model_idx: int, param_grid: Dict):
    data = df.loc[:, df.columns != 'rating_discrete']
    target = df['rating_discrete']

    N_SPLITS = 5

    cv_outer = StratifiedKFold(n_splits=N_SPLITS, shuffle=True)

    correct_param_grid = [param_grid[model_group][model_idx]]

    for model_name, estimator, param_grid in correct_param_grid:
        outer_results = []
        outer_f1_results = []
        for fold, (train_idx, test_idx) in enumerate(cv_outer.split(data, y=target), 1):
            print(f'Fold {fold}')
            train_data, test_data = data.iloc[train_idx, :], data.iloc[test_idx, :]
            train_target, test_target = target[train_idx], target[test_idx]

            cv_inner = StratifiedKFold(n_splits=N_SPLITS, shuffle=True)

            train_data_smt, train_target_smt = balance(train_data, train_target)
            train_data_proc, test_data_proc = preprocess(train_data_smt, test_data)

            search = GridSearchCV(estimator=estimator,
                                  param_grid=param_grid,
                                  scoring='f1_weighted',
                                  cv=cv_inner,
                                  refit=True,
                                  return_train_score=True,
                                  n_jobs=-1,
                                  verbose=3)

            search.fit(train_data_proc, train_target_smt)
            print(f"[train] f1-score={search.cv_results_['mean_train_score'][0]} - [val] f1-score={search.cv_results_['mean_test_score'][0]}")

            best_model = search.best_estimator_
            save_fold_model(fold, model_name, best_model, notebook=True)

            acc, loss, f1_test = test_eval(fold, model_name, test_data_proc, test_target, notebook=True)
            outer_results.append(acc)
            outer_f1_results.append(f1_test)

            print(f'[test] loss={loss:3f}, acc={acc:3f} ,f1-score={f1_test:3f}, cfg={search.best_params_}')

        print(
            f'[{model_name}] [mean_test] Mean accuracy: {np.mean(outer_results):3f} - Mean f1-score: {np.mean(outer_f1_results):3f}')

Since this is a demo, it was decided to use only the best configuration of each model as the hyperparameter space.

## Tree methods

### RandomForestClassifier

In [None]:
train_test(final, 'tree_based', 0, best_param_grid_model)

### DecisionTreeClassifier

In [None]:
train_test(final, 'tree_based', 1, best_param_grid_model)

## Naive Bayes methods

### GaussianNB

In [None]:
train_test(final, 'naive_bayes', 0, best_param_grid_model)

### QuadraticDiscriminantAnalysis

In [None]:
train_test(final, 'naive_bayes', 1, best_param_grid_model)

## SVM

### SVC

In [None]:
train_test(final, 'svm', 0, best_param_grid_model)