# Stratified cross validation

The goal of this notebook is to compare the four obtained training sets to decide on which one to run the hyperparameter tuning phase.

The model tested for each dataset are default neural networks with a numebr of hidden neuros equals to two third of the input plus the output. parameters are kept default and the training last 100 epochs.

Stratified cross validation is perfomed to account for class imbalance in the training set, also, class weights are considered when training.

In [1]:
import sys
sys.path.append("..")
from src.model import NeuralNetwork
import pandas as pd
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.utils import class_weight
from pprint import pprint
import numpy as np
import tensorflow


from numpy.random import seed
seed(1)
tensorflow.random.set_seed(1)
import warnings  
warnings.filterwarnings("ignore")

In [2]:
def get_cross_scores(path, neurons, optimizer="sgd", epochs=100):
    data = pd.read_csv(path)
    x = data.drop("class", axis=1)
    y = data["class"]
    
    kf = StratifiedKFold(n_splits=5)
    
    class_weights = class_weight.compute_class_weight('balanced',
                                                      np.unique(y),
                                                      y)
    weights_dict = dict(zip(np.unique(y), class_weights))
    acc=[]
    loss=[]

    for train_index, test_index in kf.split(x, y):
        net = NeuralNetwork.create_model(neurons=neurons, optimizer=optimizer)
        net.fit(x.iloc[train_index], 
                y.iloc[train_index],
                batch_size=64, 
                epochs=epochs, 
                verbose=0, 
                class_weight=weights_dict)
        scores = net.evaluate(x.iloc[test_index], 
                              y.iloc[test_index], verbose=1)
        acc.append(scores[1])
        loss.append(scores[0])
    
    return {"Accuracy" : (np.mean(acc), np.std(acc), acc),
            "Loss" : (np.mean(loss), np.std(loss), loss)}

## First unscaled dataset
The first model is tested on the unscaled dataset, this has 132 features.

In [3]:
res_1 = get_cross_scores("../data/processed/initial/train_unscaled.csv", (132, 60, 30, 10))
pprint(res_1)

Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
{'Accuracy': (0.11380249559879303,
              0.0038706225007881373,
              [0.1111111119389534,
               0.1111111119389534,
               0.12111110985279083,
               0.11444444209337234,
               0.11123470216989517]),
 'Loss': (nan,
          nan,
          [nan, 2.3026485443115234, 54869046067200.0, 2.300891160964966, nan])}


## First scaled dataset
The second model is tested on the scaled dataset, 132 features and Standard Scaler

In [4]:
res_2 = get_cross_scores("../data/processed/initial/train_scaled.csv", (132, 60, 30, 10))
pprint(res_2)

{'Accuracy': (0.5743455648422241,
              0.03242956403450645,
              [0.5666666626930237,
               0.5644444227218628,
               0.6377778053283691,
               0.5477777719497681,
               0.5550611615180969]),
 'Loss': (2.180751657485962,
          0.12919648526724084,
          [2.0357139110565186,
           2.1174960136413574,
           2.078993558883667,
           2.3363943099975586,
           2.335160493850708])}


## Extended and scaled dataset
This dataset has more features, 144 features and Standard Scaler

In [5]:
res_3 = get_cross_scores("../data/processed/extended/train_extended.csv", (180, 80, 46, 10))
pprint(res_3)

{'Accuracy': (0.63635174036026,
              0.049398414649388336,
              [0.6311110854148865,
               0.6555555462837219,
               0.7200000286102295,
               0.5922222137451172,
               0.582869827747345]),
 'Loss': (1.894150710105896,
          0.1574803493676554,
          [1.990844488143921,
           1.6214247941970825,
           1.9870859384536743,
           1.816994309425354,
           2.0544040203094482])}


## PCA dataset
This is a reduced extended scaled dataset, with 120 features found by PCA

In [10]:
res_4 = get_cross_scores("../data/processed/extended/train_pca.csv", (102, 45, 30, 10))
pprint(res_4)

{'Accuracy': (0.6187900066375732,
              0.041966724167351435,
              [0.6455555558204651,
               0.6288889050483704,
               0.6744444370269775,
               0.5899999737739563,
               0.5550611615180969]),
 'Loss': (2.0732010126113893,
          0.3064759477588344,
          [1.701684594154358,
           1.7803637981414795,
           2.1056747436523438,
           2.2388598918914795,
           2.539422035217285])}


## Final decision

The models selected for hyperparameter turning are the last twos, as performcances are better.