# Neural Network

In this notebook we are going to show the main components of the _Feed Forward Neural Network_.
Again, statistics will not be saved, in order to avoid overwriting.

## Imports

In [1]:
import itertools
import os
from typing import Tuple, List

import numpy as np
import pandas as pd
import torch
from sklearn import preprocessing
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import normalize
from torch import nn
from torch import utils
from torch.nn import CrossEntropyLoss
from torch.utils.data import Dataset
from torchinfo import summary

from src.data.dataset import MovieDataset
from src.models.config import best_param_layers, best_param_grid_mlp
from src.models.network.mlp import execute
from src.models.network.validate import test_eval
from src.utils.const import DATA_DIR, SEED, NUM_BINS, NETWORK_RESULTS_DIR
from src.utils.util_models import fix_random, balancer

### Useful path to data

In [2]:
ROOT_DIR = os.path.join(os.getcwd(), '..')
PROCESSED_DIR = os.path.join(ROOT_DIR, DATA_DIR, 'processed')
INTERIM_MODEL_FOLDER = os.path.join('..', NETWORK_RESULTS_DIR, 'mlp')
if not os.path.exists(INTERIM_MODEL_FOLDER):
    os.mkdir(INTERIM_MODEL_FOLDER)

### Fix random seed

In [3]:
fix_random(SEED)

### Set device

Given PyTorch's support for the use of GPUs, we check which hardware the training will run on.

In [4]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
if device.type == 'cuda':
    print('Using device:', torch.cuda.get_device_name(device))

Using device: NVIDIA GeForce GTX 1070


### Import final dataset

In [5]:
final_stored = pd.read_parquet(os.path.join(PROCESSED_DIR, 'final.parquet'))
final = MovieDataset(final_stored)

## Dataset

The MovieDataset class object represents the dataset that is used within the neural network. This class implements various methods, which are useful for applying data transformations using indices.
In addition, the constructor performs the following operations:
- creating a dictionary to easily access the indices of a specific feature
- split between data and target feature
- target feature discretization
- conversion of data and target features into tensors

In [6]:
class MovieDataset(Dataset):
    def __init__(self, df: pd.DataFrame):
        self.idx_column = {}
        for idx, col_name in enumerate(df.columns):
            self.idx_column[col_name] = idx

        X, y_continuous = self.data_target_split(df)

        self.num_classes = NUM_BINS
        y = self._discretize(y_continuous)

        self.X = torch.tensor(X, dtype=torch.float)
        self.y = torch.tensor(y, dtype=torch.long)

    def __len__(self) -> int:
        return self.X.shape[0]

    def __getitem__(self, idx: int) -> Tuple:
        return self.X[idx, :], self.y[idx]

    @staticmethod
    def data_target_split(df: pd.DataFrame) -> Tuple:
        y = df['rating_mean']
        X = df.drop(columns='rating_mean').to_numpy()
        return X, y

    def _discretize(self, target: pd.Series) -> pd.Series:
        y = pd.cut(target, bins=self.num_classes, labels=False)
        return y

    def scale(self, train_idx, test_idx, scaler, features: List[int]):
        train_data = self.X[train_idx]
        test_data = self.X[test_idx]

        for feature in features:
            feature_train = train_data[:, feature].reshape(-1, 1)
            feature_test = test_data[:, feature].reshape(-1, 1)

            scaled_train = np.squeeze(scaler.fit_transform(feature_train))
            scaled_test = np.squeeze(scaler.transform(feature_test))

            self.X[train_idx, feature] = torch.tensor(scaled_train, dtype=torch.float)
            self.X[test_idx, feature] = torch.tensor(scaled_test, dtype=torch.float)

    def normalize(self, train_idx, test_idx, norm: str = 'l2'):
        train_data = self.X[train_idx]
        test_data = self.X[test_idx]

        norm_train = normalize(train_data, norm=norm)
        norm_test = normalize(test_data, norm=norm)

        self.X[train_idx, :] = torch.tensor(norm_train, dtype=torch.float)
        self.X[test_idx, :] = torch.tensor(norm_test, dtype=torch.float)

## Architecture

The MovieNet class defines at runtime the architecture of our Neural Network following the parameter specification, this was made possible by using ModuleList.
It was decided to keep the size of the hidden layer fixed at the nearest power of two defined by $2 \over 3$ of the number of features.
Since we have $1153$ features, ${1153 * 2 \over 3} = 769$, so the nearest power of two is $512$.

In [7]:
class MovieNet(nn.Module):
    def __init__(
            self,
            input_size: int,
            input_act: nn.Module,
            hidden_size: int,
            hidden_act: nn.Module,
            num_hidden_layers: int,
            output_fn,
            num_classes: int,
            dropout: float = 0.0,
            batch_norm: bool = False
    ) -> None:
        super(MovieNet, self).__init__()

        self.layers = nn.ModuleList([
            nn.Linear(input_size, hidden_size),
            input_act
        ])

        for _ in range(num_hidden_layers):
            self.layers.append(nn.Linear(hidden_size, hidden_size))

            if batch_norm:
                self.layers.append(nn.BatchNorm1d(hidden_size))

            self.layers.append(hidden_act)

            if dropout > 0.0:
                self.layers.append(nn.Dropout(dropout))

        self.layers.append(nn.Linear(hidden_size, num_classes))

        if output_fn:
            self.layers.append(output_fn)

    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

    def reset_weights(self):
        for layer in self.layers:
            if hasattr(layer, 'reset_parameters'):
                layer.reset_parameters()

## Train & Test

As with the sklearn models, the two cross validations were applied, one to check that the model works well on several test sets and the other to perform the optimization of the hyperparameters that is not performed using _GridSearchCV_ because of compatibility. For this reason _itertools_ has been used to calculate the Cartesian product of the hyperparameters. Scaling and balancing transformations are then applied before loading the data into the respective _DataLoader_.
The training phase also includes the following components:
- _optimizer_, whose task is to try to minimize the loss function. It has been defined of two types \[_Adam, SGD_\], both have weight decay while only the second uses momentum. These two parameters act on the updating of weights within the network to avoid overfitting (weight decay) and improve both training speed and accuracy (momentum).
- _scheduler_, which acts on the learning rate value by decreasing it every set number of epochs by a certain gamma value.
- _loss function_, only _CrossEntropy_ was used, which allows us to assess how our model is performing. It includes _SoftMax_ activation function and therefore no output layer was added to the network architecture.
- _early stopping_, which allows us to stop the training in advance if the loss does not vary for some epoch during validation.

In addition, the validate and test functions have also been implemented, both of which have the task of calculating metrics. The test additionally provides the possibility of using the classification_report function and printing the roc plot. All info outputs and plots were implemented via the tensorboard library.

In [8]:
def train_test(dataset: MovieDataset):
    features = [
        dataset.idx_column['year'],
        dataset.idx_column['title_length'],
        dataset.idx_column['tag_count'],
        dataset.idx_column['runtime'],
        dataset.idx_column['rating_count']
    ]
    num_workers = 2

    n_splits = 5
    cv_outer = StratifiedKFold(n_splits=n_splits, shuffle=True)

    for fold, (train_idx, test_idx) in enumerate(cv_outer.split(dataset.X, y=dataset.y), 1):
        hyper_parameters_model = itertools.product(
            best_param_layers['input_act'],
            best_param_layers['hidden_act'],
            best_param_layers['hidden_size'],
            best_param_layers['num_hidden_layers'],
            best_param_layers['dropout'],
            best_param_layers['batch_norm'],
            best_param_layers['output_fn'],
            best_param_grid_mlp['starting_lr'],
            best_param_grid_mlp['num_epochs'],
            best_param_grid_mlp['batch_size'],
            best_param_grid_mlp['optim'],
            best_param_grid_mlp['momentum'],
            best_param_grid_mlp['weight_decay'],
        )

        print('=' * 65)
        print(f'Fold {fold}')

        list_fold_stat = []

        data_test = utils.data.Subset(dataset, test_idx)

        loader_test = utils.data.DataLoader(data_test, batch_size=1,
                                            shuffle=False,
                                            num_workers=num_workers)

        for idx, (input_act,
                  hidden_act,
                  hidden_size,
                  num_hidden_layers,
                  dropout,
                  batch_norm,
                  _,
                  starting_lr,
                  num_epochs,
                  batch_size,
                  optimizer_class,
                  momentum,
                  weight_decay) in enumerate(hyper_parameters_model):

            best_val_network = None
            max_f1_val = 0

            cfg = (
                input_act, hidden_act, hidden_size, num_hidden_layers, dropout, batch_norm, starting_lr, num_epochs,
                batch_size, optimizer_class, momentum, weight_decay)

            cv_inner = StratifiedKFold(n_splits=n_splits, shuffle=True)

            for inner_fold, (inner_train_idx, val_idx) in enumerate(
                    cv_inner.split(dataset.X[train_idx], y=dataset.y[train_idx]), 1):

                # Balancing
                train_target = dataset.y[inner_train_idx]
                sampler = balancer(train_target)

                # Scaling
                scaler = preprocessing.MinMaxScaler()
                dataset.scale(train_idx, test_idx, scaler, features)

                data_train = utils.data.Subset(dataset, inner_train_idx)
                data_val = utils.data.Subset(dataset, val_idx)

                loader_train = utils.data.DataLoader(data_train, batch_size=batch_size,
                                                     sampler=sampler,
                                                     pin_memory=True,
                                                     num_workers=num_workers)

                loader_val = utils.data.DataLoader(data_val, batch_size=1,
                                                   shuffle=False,
                                                   num_workers=num_workers)

                input_size = dataset.X.shape[1]
                num_classes = dataset.num_classes
                network = MovieNet(input_size=input_size,
                                   input_act=input_act,
                                   hidden_size=hidden_size,
                                   hidden_act=hidden_act,
                                   num_hidden_layers=num_hidden_layers,
                                   dropout=dropout,
                                   output_fn=None,
                                   num_classes=num_classes)
                network.reset_weights()
                network.to(device)

                if fold == 1 and inner_fold == 1:
                    print('=' * 65)
                    print(f'Configuration [{idx}]: {cfg}')
                    summary(network)

                name_train = f'movie_net_experiment_{idx}'

                if optimizer_class == torch.optim.Adam:
                    optimizer = optimizer_class(network.parameters(),
                                                lr=starting_lr,
                                                weight_decay=weight_decay)
                else:
                    optimizer = optimizer_class(network.parameters(),
                                                lr=starting_lr,
                                                momentum=momentum,
                                                weight_decay=weight_decay)

                fold_stat = execute(name_train,
                                        network,
                                        optimizer,
                                        num_epochs,
                                        loader_train,
                                        loader_val,
                                        device)
                list_fold_stat.append(fold_stat)

                if fold_stat['f1_val'] >= max_f1_val:
                    max_f1_val = fold_stat['f1_val']
                    best_val_network = network

            path = os.path.join(INTERIM_MODEL_FOLDER, f'{fold}_network.pt')
            torch.save(best_val_network, path)

            criterion = CrossEntropyLoss()
            loss_test, acc_test, f1_test = test_eval(fold, loader_test, device, criterion, notebook=True)
            print(f'Test {fold}, loss={loss_test:3f}, accuracy={acc_test:3f}, f1={f1_test:3f}')

In [9]:
train_test(final)

Fold 1
Configuration [0]: (LeakyReLU(negative_slope=0.01), LeakyReLU(negative_slope=0.01), 512, 3, 0.2, True, 0.001, 50, 128, <class 'torch.optim.adam.Adam'>, 0.9, 1e-05)
Epoch: 1  Lr: 0.00100000  Loss: Train = [1.3777] - Val = [0.8493]  Accuracy: Train = [43.84%] - Val = [62.55%]  F1: Train = [0.427] - Val = [0.630]  Time one epoch (s): 9.5267 
Epoch: 2  Lr: 0.00100000  Loss: Train = [0.5713] - Val = [0.6633]  Accuracy: Train = [75.88%] - Val = [71.82%]  F1: Train = [0.757] - Val = [0.720]  Time one epoch (s): 8.0136 
Epoch: 3  Lr: 0.00100000  Loss: Train = [0.4520] - Val = [0.5585]  Accuracy: Train = [80.79%] - Val = [77.52%]  F1: Train = [0.807] - Val = [0.779]  Time one epoch (s): 7.8188 
Epoch: 4  Lr: 0.00100000  Loss: Train = [0.3601] - Val = [0.5188]  Accuracy: Train = [85.09%] - Val = [79.28%]  F1: Train = [0.850] - Val = [0.792]  Time one epoch (s): 7.9086 
Epoch: 5  Lr: 0.00100000  Loss: Train = [0.4351] - Val = [0.5405]  Accuracy: Train = [82.99%] - Val = [77.52%]  F1: Train



Epoch: 1  Lr: 0.00100000  Loss: Train = [1.3928] - Val = [0.8898]  Accuracy: Train = [43.71%] - Val = [59.55%]  F1: Train = [0.425] - Val = [0.600]  Time one epoch (s): 7.8918 
Epoch: 2  Lr: 0.00100000  Loss: Train = [0.6069] - Val = [0.7524]  Accuracy: Train = [74.19%] - Val = [65.73%]  F1: Train = [0.739] - Val = [0.629]  Time one epoch (s): 7.7353 
Epoch: 3  Lr: 0.00100000  Loss: Train = [0.4402] - Val = [0.5341]  Accuracy: Train = [81.30%] - Val = [77.38%]  F1: Train = [0.812] - Val = [0.773]  Time one epoch (s): 7.5419 
Epoch: 4  Lr: 0.00100000  Loss: Train = [0.3557] - Val = [0.5314]  Accuracy: Train = [84.99%] - Val = [77.90%]  F1: Train = [0.849] - Val = [0.781]  Time one epoch (s): 7.6036 
Epoch: 5  Lr: 0.00100000  Loss: Train = [0.3418] - Val = [0.7530]  Accuracy: Train = [85.91%] - Val = [68.77%]  F1: Train = [0.858] - Val = [0.689]  Time one epoch (s): 7.5794 
Epoch: 6  Lr: 0.00010000  Loss: Train = [0.2451] - Val = [0.4519]  Accuracy: Train = [90.03%] - Val = [81.84%]  F1: