# Sklearn models

In this notebook, a usage demo of all the chosen scikit-learn models is presented. In particular DecisionTreeClassifier has been added to those in the task requests.

Furthermore, in this demo the saving of results will be ignored to avoid overwriting the outputs.

## Imports

In [1]:
import os
import warnings
from typing import Dict

import numpy as np
import pandas as pd
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.preprocessing import MinMaxScaler

from src.models.config import best_param_grid_model
from src.models.sklearn_models import test_eval, save_fold_model
from src.utils.const import DATA_DIR, SEED, NUM_BINS
from src.utils.util_models import fix_random

In [2]:
warnings.filterwarnings('ignore', category=UserWarning)

### Useful path to data

In [3]:
ROOT_DIR = os.path.join(os.getcwd(), '..')
PROCESSED_DIR = os.path.join(ROOT_DIR, DATA_DIR, 'processed')

### Fix random seed

In [4]:
fix_random(SEED)

## Import final dataset

In [5]:
final_stored = pd.read_parquet(os.path.join(PROCESSED_DIR, 'final.parquet'))
final = (final_stored
         .assign(rating_discrete=pd.cut(final_stored.loc[:, 'rating_mean'], bins=NUM_BINS, labels=False))
         .astype({'rating_discrete': 'int32'})
         .drop(columns=['rating_mean']))

## Train & Test

Trains and tests are performed in the following function. To verify that a certain configuration works well with different test sets and to perform hyperparameter optimization, 5-fold cross validation was implemented internally and externally. The internal one in the sklearn models is handled by the GridSearchCV, which performs the hyperparameter search automatically. Finally, the ` test_eval` function loads the previously saved model and evaluates its performance metrics.


In [6]:
def train_test(df: pd.DataFrame, model_group: str, model_idx: int, param_grid: Dict):
    data = df.loc[:, df.columns != 'rating_discrete']
    target = df['rating_discrete']

    n_splits = 5

    cv_outer = StratifiedKFold(n_splits=n_splits, shuffle=True)

    correct_param_grid = [param_grid[model_group][model_idx]]

    for model_name, estimator, param_grid in correct_param_grid:
        outer_results = []
        outer_f1_results = []
        for fold, (train_idx, test_idx) in enumerate(cv_outer.split(data, y=target), 1):
            print(f'Fold {fold}')
            train_data, test_data = data.iloc[train_idx, :], data.iloc[test_idx, :]
            train_target, test_target = target[train_idx], target[test_idx]

            k_neighbors = (np.min(train_target.value_counts()) * 4) / 5
            k_neighbors_approx = int(np.floor(k_neighbors)) - 1

            steps = [
                ('over', SMOTE(k_neighbors=k_neighbors_approx)),
                ('scaling', MinMaxScaler()),
                ('model', estimator)
            ]

            pipeline = Pipeline(steps=steps)

            cv_inner = StratifiedKFold(n_splits=n_splits, shuffle=True)

            search = GridSearchCV(estimator=pipeline,
                                  param_grid=param_grid,
                                  scoring='f1_weighted',
                                  cv=cv_inner,
                                  refit=True,
                                  return_train_score=True,
                                  n_jobs=-1,
                                  verbose=3)

            search.fit(train_data, train_target)
            print(f"[train] f1-score={search.cv_results_['mean_train_score'][0]} - [val] f1-score={search.cv_results_['mean_test_score'][0]}")

            best_model = search.best_estimator_
            save_fold_model(fold, model_name, best_model, notebook=True)

            acc, loss, f1_test = test_eval(fold, model_name, test_data, test_target, notebook=True)
            outer_results.append(acc)
            outer_f1_results.append(f1_test)

            print(f'[test] loss={loss:3f}, acc={acc:3f} ,f1-score={f1_test:3f}, cfg={search.best_params_}')

        print(
            f'[{model_name}] [mean_test] Mean accuracy: {np.mean(outer_results):3f} - Mean f1-score: {np.mean(outer_f1_results):3f}')

Since this is a demo, it was decided to use only the best configuration of each model as the hyperparameter space.

## Tree methods

### RandomForestClassifier

In [7]:
train_test(final, 'tree_based', 0, best_param_grid_model)

Fold 1
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[train] f1-score=0.9647755840141343 - [val] f1-score=0.7542962549733164
[test] loss=0.242966, acc=0.757034 ,f1-score=0.756145, cfg={'model__max_depth': 10, 'model__max_features': 'sqrt', 'model__n_estimators': 700}
Fold 2
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[train] f1-score=0.9601059944112533 - [val] f1-score=0.7497175598252221
[test] loss=0.240684, acc=0.759316 ,f1-score=0.758474, cfg={'model__max_depth': 10, 'model__max_features': 'sqrt', 'model__n_estimators': 700}
Fold 3
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[train] f1-score=0.9628100688627288 - [val] f1-score=0.7529944664185985
[test] loss=0.245340, acc=0.754660 ,f1-score=0.754512, cfg={'model__max_depth': 10, 'model__max_features': 'sqrt', 'model__n_estimators': 700}
Fold 4
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[train] f1-score=0.9621767014452969 - [val] f1-score=0.7539738827720461
[test] loss=0.25332

### DecisionTreeClassifier

In [8]:
train_test(final, 'tree_based', 1, best_param_grid_model)

Fold 1
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[train] f1-score=0.7971648550017153 - [val] f1-score=0.6595023342039948
[test] loss=0.313688, acc=0.686312 ,f1-score=0.688095, cfg={'model__criterion': 'gini', 'model__max_depth': 10}
Fold 2
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[train] f1-score=0.7888871206585344 - [val] f1-score=0.665936980965929
[test] loss=0.335741, acc=0.664259 ,f1-score=0.663943, cfg={'model__criterion': 'gini', 'model__max_depth': 10}
Fold 3
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[train] f1-score=0.7884672932846353 - [val] f1-score=0.6730207260357671
[test] loss=0.330164, acc=0.669836 ,f1-score=0.670767, cfg={'model__criterion': 'gini', 'model__max_depth': 10}
Fold 4
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[train] f1-score=0.8220645351555069 - [val] f1-score=0.6701473447726369
[test] loss=0.339673, acc=0.660327 ,f1-score=0.661487, cfg={'model__criterion': 'gini', 'model__max_depth': 10}
F

## Naive Bayes methods

### GaussianNB

In [9]:
train_test(final, 'naive_bayes', 0, best_param_grid_model)

Fold 1
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[train] f1-score=0.5095377203746801 - [val] f1-score=0.4496615788650139
[test] loss=0.552471, acc=0.447529 ,f1-score=0.449454, cfg={'model__var_smoothing': 1.873817422860383e-06}
Fold 2
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[train] f1-score=0.5170005290581624 - [val] f1-score=0.4551310919642086
[test] loss=0.541825, acc=0.458175 ,f1-score=0.455238, cfg={'model__var_smoothing': 1.873817422860383e-06}
Fold 3
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[train] f1-score=0.5141008726457633 - [val] f1-score=0.45234112269550114
[test] loss=0.550780, acc=0.449220 ,f1-score=0.449401, cfg={'model__var_smoothing': 1.873817422860383e-06}
Fold 4
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[train] f1-score=0.5061942639452535 - [val] f1-score=0.44574310242581383
[test] loss=0.564854, acc=0.435146 ,f1-score=0.436418, cfg={'model__var_smoothing': 1.873817422860383e-06}
Fold 5
Fitting 5 f

## SVM

### SVC

In [11]:
train_test(final, 'svm', 0, best_param_grid_model)

Fold 1
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[train] f1-score=0.9999762318968998 - [val] f1-score=0.8304085616437821
[test] loss=0.167300, acc=0.832700 ,f1-score=0.832594, cfg={'model__C': 100, 'model__gamma': 0.01, 'model__kernel': 'rbf'}
Fold 2
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[train] f1-score=1.0 - [val] f1-score=0.8253218735548261
[test] loss=0.171483, acc=0.828517 ,f1-score=0.827915, cfg={'model__C': 100, 'model__gamma': 0.01, 'model__kernel': 'rbf'}
Fold 3
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[train] f1-score=1.0 - [val] f1-score=0.820245231309849
[test] loss=0.156714, acc=0.843286 ,f1-score=0.842672, cfg={'model__C': 100, 'model__gamma': 0.01, 'model__kernel': 'rbf'}
Fold 4
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[train] f1-score=0.9999762318957363 - [val] f1-score=0.8282374019669956
[test] loss=0.172309, acc=0.827691 ,f1-score=0.827299, cfg={'model__C': 100, 'model__gamma': 0.01, 'model__ker