# 5. Validation & Testing

Welcome to the fifth notebook of our six part series part of our tutorial on Deep Learning for Human Activity Recognition. Within the last notebook you learned:

- How do I define a sample neural network architecture in PyTorch? 
- What additional preprocessing do I need to apply to my data to fed it into my network?
- How do I define a train loop which trains my neural network?

This notebook will teach you everything you need to know about validation and testing. When building a predictive pipeline there are a lot of parameters which one needs to set before comencing the actual training. Coming up with a suitable set of hyperparameters is called hypertuning. In order to gain feedback whether the applied hyperparameters are a good choice, we check the predictive performance of our model on the validation set. This is called validation.

Now you might ask yourself: Solely relying and tuning based on the validation scores would inherit that your trained model would end up being too well optimized on the validation set and thus not general anymore, right? If asked yourself that question, then you are 100% right in your assumption! This is what we call overfitting and is one of the major pitfalls in Machine Learning.Overfitting your model results in bad prediction performance on unseen data. 

We therefore need a third dataset, called the test dataset. The test dataset is a part of the initial dataset which you keep separate from all optimization steps. It is only used to gain insights on the predictive performance of the model and must not (!) be used as a reference for tuning hyperparameters. As we mentioned in during the theoretical parts of this tutorial, (supervised) Deep Learning, in our opinion, is just a fancy word for function approximation. If your model performs both well during validation and testing, it is a general function which properly approximates the underlying function.

After completing this notebook you will be answer the following questions:
- How do I split my initial dataset into a train, validation and test dataset?
- What validation methods exist in Human Activity Recognition? How are they performed?
- How is testing usually performed?

## 5.1. Important Remarks

If you are accessing this tutorial via [Google Colab](https://colab.research.google.com/github/mariusbock/dl-for-har/blob/main/tutorial_notebooks/training.ipynb), first make sure to use Google Colab in English. This will help us to better assist you with issues that might arise during the tutorial. There are two ways to change the default language if it isn't English already:
1. On Google Colab, go to `Help` -> `View in English` 
2. Change the default language of your browser to `English`.

To also ease the communication when communicating errors, enable line numbers within the settings of Colab.

1. On Google Colab, go to `Tools` -> `Settings` -> `Editor` -> `Show line numbers`

In general, we strongly advise you to use Google Colab as it provides you with a working Python distribution as well as free GPU resources. To make Colab use GPUs, you need to change the current notebooks runtime type via:

- `Runtime` -> `Change runtime type` -> `Dropdown` -> `GPU` -> `Save`

**Hint:** you can auto-complete code in Colab via `ctrl` + `spacebar`

For the live tutorial, we require all participants to use Colab. If you decide to rerun the tutorial at later points and rather want to have it run locally on your machine, feel free to clone our [GitHub repository](https://github.com/mariusbock/dl-for-har).

To get started with this notebook, you need to first run the code cell below. Please set `use_colab` to be `True` if you are accessing this notebook via Colab. If not, please set it to `False`. This code cell will make sure that imports from our GitHub repository will work.

In [None]:
import os, sys

use_colab = True

module_path = os.path.abspath(os.path.join('..'))

if use_colab:
    # move to content directory and remove directory for a clean start 
    %cd /content/         
    %rm -rf dl-for-har
    # clone package repository (will throw error if already cloned)
    !git clone https://github.com/mariusbock/dl-for-har.git
    # navigate to dl-for-har directory
    %cd dl-for-har/       
else:
    os.chdir(module_path)
    
# this statement is needed so that we can use the methods of the DL-ARC pipeline
if module_path not in sys.path:
    sys.path.append(module_path)

## 5.1. Splitting your data

Within the first part of this notebook we will split our data in the above mentioned three datasets, namely the train, validation and test dataset. There are multiple ways how to split the data into the two respective datasets, for example:

- **Subject-wise:** split according to participants within the dataset. This means that we are reserving certain subjects to be included in the train, validation and test set respectively. For example, given that there are a total of 10 subjects, you could use 6 subjects for trainig, 2 subjects for validation and 2 subjects for testing.
- **Percentage-wise:** state how large percentage-wise your train, validation and test dataset should be compared to the full dataset. For example, you could use 60% of your data for training, 20% for validation and 20% for testing. The three splits can also be chosen to be stratified, meaning that the relative label distribution within each of the two dataset is kept the same as in the full dataset. Note that stratifiying your data would require the data to be shuffled.
- **Record-wise:** state how many records should be in your train, validation and test dataset should be contained, i.e. define two cutoff points. For example, given that there are 1 million records in your full dataset, you could have the first 600 thousand records to be contained in the train dataset, the next 200 thousand in the validation dataset and the remaining 200 thousand records to be contained in the test dataset.

**WARNING:** shuffling your dataset during splitting (which is e.g. needed for stratified splits) will destroy the time-dependencies among the data records. To minimize this effect, apply a sliding window on top of your data before splitting. This way, time-dependencies will at least be preserved within the windows. While working on this notebook, we will notify you when this is necessary.

To keep things simple and fast, we will be splitting our data subject-wise. We will use the first data of the first subject for training, the data of the second subject for validation and the data of the third subject for testing. Your first task will be to perform said split. Note that we already imported the dataset for you using the `load_dataset()` function, which is part of the DL-ARC feature stack.

### Task 1: Split the data into train, validation and test data

1. Define the `train` dataset to be the data of the first subject, i.e. with `subject_identifier = 0`. (`lines 13-14`)
2. Define the `valid` dataset to be the data of the second subject, i.e. with `subject_identifier = 1`. (`lines 15-16`)
3. Define the `test` dataset to be the data of the third subject, i.e. with `subject_identifier = 2`. (`lines 17-18`)
4. Define a fourth dataset being a concatenated version of the `train` and `valid` dataset called `train_valid`. You will need this dataset for some of the validation methods. Use `np.concatenate()` in order to concat the two numpy arrays along `axis=0`. (`lines 20-21`)

In [None]:
import numpy as np
import warnings
warnings.filterwarnings("ignore")

from data_processing.preprocess_data import load_dataset


# data loading (we are using a predefined method called load_dataset, which is part of the DL-ARC feature stack)
X, y, num_classes, class_names, sampling_rate, has_null = load_dataset('rwhar_3sbjs', include_null=True)
# since the method returns features and labels separatley, we need to concat them
data = np.concatenate((X, y[:, None]), axis=1)

# define the train data to be the data of the first subject
train_data = data[data[:, 0] == 0]
# define the valid data to be the data of the second subject
valid_data = data[data[:, 0] == 1]
# define the test data to be the data of the third subject
test_data = data[data[:, 0] == 2]

# define the train_valid_data by concatenating the train and validation dataset 
train_valid_data = np.concatenate((train_data, valid_data), axis=0)

print('\nShape of the train, validation and test dataset:')
print(train_data.shape, valid_data.shape, test_data.shape)
print('\nShape of the concatenated train_valid dataset:')
print(train_valid_data.shape)

## 5.2. Define the hyperparameters

Before we go over talking about how to perform validation in Human Activtiy Recognition, we need to define our hyperparameters again. As you know from the previous notebook, it is common practice to track all your settings and parameters in a compiled `config` object. Due to fact that we will be using pre-implemented methods of the feature stack of the DL-ARC GitHub, we will now need to define a more complex `config` object. 

Within the next code block we defined a sample `config` object for you. It contains some parameters which you already know from previous notebooks, but also lots which you don't know. We will not cover all of them during this tutorial, but encourage you to check out the complete implementation of the DL-ARC. We also separated the parameters into two groups for you, once which you can play around with and ones which you should handle with care and rather leave as is.

In [None]:
from misc.torchutils import seed_torch

config = {
    ##### TRY AND CHANGE THESE PARAMETERS ####
    # sliding window settings
    'sw_length': 50,
    'sw_unit': 'units',
    'sampling_rate': 50,
    'sw_overlap': 30,
    # network settings
    'nb_conv_blocks': 2,
    'conv_block_type': 'normal',
    'nb_filters': 64,
    'filter_width': 11,
    'nb_units_lstm': 128,
    'nb_layers_lstm': 1,
    'drop_prob': 0.5,
    # training settings
    'epochs': 10,
    'valid_epoch': 'best',
    'batch_size': 100,
    'loss': 'cross_entropy',
    'weighted': True,
    'weights_init': 'xavier_uniform',
    'optimizer': 'adam',
    'lr': 1e-4,
    'weight_decay': 1e-6,
    'shuffling': True,
    ### UP FROM HERE YOU SHOULD RATHER NOT CHANGE THESE ####
    'no_lstm': False,
    'batch_norm': False,
    'dilation': 1,
    'pooling': False,
    'pool_type': 'max',
    'pool_kernel_width': 2,
    'reduce_layer': False,
    'reduce_layer_output': 10,
    'nb_classes': 8,
    'seed': 1,
    'gpu': 'cuda:0',
    'verbose': False,
    'print_freq': 10,
    'save_gradient_plot': False,
    'print_counts': False,
    'adj_lr': False,
    'adj_lr_patience': 5,
    'early_stopping': False,
    'es_patience': 5,
    'save_test_preds': False
}

## 5.3. Validation

Within the next segment we will explain the most prominent validation methods used in Human Activity Recognition. These are:

- Train-Valid Split
- k-Fold Cross-Validation
- Cross-Participant Cross-Validation

### 5.3.1. Train-Valid Split

The train-valid split is one of the most basic validation method, which you  already did yourself. Instead of varying the validation set and getting a more holistic view, we define it to be a set part of the data. As mentioned above there are multiple ways how to do so. For simplicity purposes, we chose to use a subject-wise split. Within the next task you will be asked to train your network using the `train` data and obtain predictions on the `valid` data. We do not ask you to define the training loop again and allow you to use the built-in `train` function of the DL-ARC.

#### Task 2: Implementing the train-valid split validation loop

1. As you already defined the train and valid dataset you can go ahead and apply a sliding window on top of both datasets. You can use the predefined method `apply_sliding_window()`, which is part of the DL-ARC pipeline, to do so. It is already be imported for you. We will give you hints on what to pass the method. (`lines 22-30`)
2. (*Optional*) Omit the first feature column (subject_identifier) from the train and validation dataset. (`lines 32-34`)
3. Within the `config` object, set the parameters `window_size` and `nb_channels` accordingly. (`lines 36-40`)
4. Define the `DeepConvLSTM` object. It is already imported for you. Also define the `optimizer` being the [Adam optimizer](https://pytorch.org/docs/stable/optim.html) and `criterion` being the [Cross-Entropy Loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) (`lines 42-48`)
5. Convert the feature columns of the train and validation to `float32` and label column to `uint8` for GPU compatibility. Use the [built-in function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html) of a pandas dataframe called `astype()`. (`lines 50-52`)
6. Use both datasets to run the `train()` function. (`lines 54-55`)

In [None]:
import time
import numpy as np
import torch

from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score, jaccard_score
from model.train import train
from model.DeepConvLSTM import DeepConvLSTM
from data_processing.sliding_window import apply_sliding_window
from misc.torchutils import seed_torch


# in order to get reproducible results, we need to seed torch and other random parts of our implementation
seed_torch(config['seed'])

# needed for saving results
log_date = time.strftime('%Y%m%d')
log_timestamp = time.strftime('%H%M%S')

print(train_data.shape, valid_data.shape)

# apply the sliding window on top of both the train and validation data; use the "apply_sliding_window" function
# found in data_processing.sliding_window
X_train, y_train = apply_sliding_window(train_data[:, :-1], train_data[:, -1], sliding_window_size=config['sw_length'], unit=config['sw_unit'], sampling_rate=config['sampling_rate'], sliding_window_overlap=config['sw_overlap'])

print(X_train.shape, y_train.shape)

X_valid, y_valid = apply_sliding_window(valid_data[:, :-1], valid_data[:, -1], sliding_window_size=config['sw_length'], unit=config['sw_unit'], sampling_rate=config['sampling_rate'], sliding_window_overlap=config['sw_overlap'])

print(X_valid.shape, y_valid.shape)

# (optional) omit the first feature column (subject_identifier) from the train and validation dataset
# you can do it if you want to as it is not a useful feature
X_train, X_valid = X_train[:, :, 1:], X_valid[:, :, 1:]

# within the config file, set the parameters 'window_size' and 'nb_channels' accordingly
# window_size = size of the sliding window in units
# nb_channels = number of feature channels
config['window_size'] = X_train.shape[1]
config['nb_channels'] = X_train.shape[2]

# define the network to be a DeepConvLSTM object; can be imported from model.DeepConvLSTM
# pass it the config object
net = DeepConvLSTM(config=config)

# defines the loss and optimizer
loss = torch.nn.CrossEntropyLoss()
opt = torch.optim.Adam(net.parameters(), lr=config['lr'], weight_decay=config['weight_decay'])

# convert the features of the train and validation to float32 and labels to uint8 for GPU compatibility 
X_train, y_train = X_train.astype(np.float32), y_train.astype(np.uint8)
X_valid, y_valid = X_valid.astype(np.float32), y_valid.astype(np.uint8)

# feed the datasets into the train function; can be imported from model.train
train_valid_net,_, val_output, train_output = train(X_train, y_train, X_valid, y_valid, network=net, optimizer=opt, loss=loss, config=config, log_date=log_date, log_timestamp=log_timestamp)

# the next bit prints out your results if you did everything correctly
cls = np.array(range(config['nb_classes']))

print('\nVALIDATION RESULTS: ')
print("\nAvg. Accuracy: {0}".format(jaccard_score(val_output[:, 1], val_output[:, 0], average='macro')))
print("Avg. Precision: {0}".format(precision_score(val_output[:, 1], val_output[:, 0], average='macro')))
print("Avg. Recall: {0}".format(recall_score(val_output[:, 1], val_output[:, 0], average='macro')))
print("Avg. F1: {0}".format(f1_score(val_output[:, 1], val_output[:, 0], average='macro')))

print("\nVALIDATION RESULTS (PER CLASS): ")
print("\nAccuracy:")
for i, rslt in enumerate(jaccard_score(val_output[:, 1], val_output[:, 0], average=None, labels=cls)):
    print("   {0}: {1}".format(class_names[i], rslt))
print("\nPrecision:")
for i, rslt in enumerate(precision_score(val_output[:, 1], val_output[:, 0], average=None, labels=cls)):
    print("   {0}: {1}".format(class_names[i], rslt))
print("\nRecall:")
for i, rslt in enumerate(recall_score(val_output[:, 1], val_output[:, 0], average=None, labels=cls)):
    print("   {0}: {1}".format(class_names[i], rslt))
print("\nF1:")
for i, rslt in enumerate(f1_score(val_output[:, 1], val_output[:, 0], average=None, labels=cls)):
    print("   {0}: {1}".format(class_names[i], rslt))

print("\nGENERALIZATION GAP ANALYSIS: ")
print("\nTrain-Val-Accuracy Difference: {0}".format(jaccard_score(train_output[:, 1], train_output[:, 0], average='macro') -
                                                  jaccard_score(val_output[:, 1], val_output[:, 0], average='macro')))
print("Train-Val-Precision Difference: {0}".format(precision_score(train_output[:, 1], train_output[:, 0], average='macro') -
                                                   precision_score(val_output[:, 1], val_output[:, 0], average='macro')))
print("Train-Val-Recall Difference: {0}".format(recall_score(train_output[:, 1], train_output[:, 0], average='macro') -
                                                recall_score(val_output[:, 1], val_output[:, 0], average='macro')))
print("Train-Val-F1 Difference: {0}".format(f1_score(train_output[:, 1], train_output[:, 0], average='macro') -
                                            f1_score(val_output[:, 1], val_output[:, 0], average='macro')))

### 5.3.2. K-Fold Cross-Validation

The k-fold cross-validation is the most popular form of cross-validation. Instead of only splitting our data once into a train and validation dataset, like we did in the previous validation method, we take the average of k different train-valid splits. To do so we take our concatenated version of the train and validation set and split it into k equal-sized chunks of data. A so-called fold is now that we train our network using all but one of these chunks of data and validate it using the chunk we excluded (thus being unseen data). The process is repeated k-times, i.e. k-folds, so that each chunk of data is the validation dataset exactly once. Note that with each fold, the network needs to be reinitialized, i.e. trained from scratch, to ensure that it is not predicting already seen data.


**Note:** It is recommended to use stratified k-fold cross-validation, i.e. each of the k chunks of data has the same distribution of labels as the original full dataset. This avoids the risk, especially for unbalanced datasets, of having certain labels missing within the train dataset, which would cause the validation process to break. Nevertheless, as also stated above, stratification requires shuffeling and thus one should always first apply the sliding window before applying the split.

The next task will lead you through the implementation of the k-fold cross-validation loop. In order to chunk your data and also apply stratification, we recommend you to use the scikit-learn helper object for stratified k-fold cross-validation called `StratifiedKFold`.

#### Task 3: Implementing the k-fold CV loop 

1. Define the scikit-learn helper object for stratified k-fold cross-validation called `StratifiedKFold`. It is already imported for you. We will also give you hints what to pass it as arguments. (`lines 14-16`)
2. Apply the `apply_sliding_window()` function on top of the `train_valid_data` object which you previously defined. (`lines 20-24`)
3. (*Optional*) Omit the first feature column (subject_identifier) from the `train_valid_data` dataset. (`lines 26-28`)
4. Define the k-fold loop; use the `split()` function of the `StratifiedKFold` object to obtain indeces to split the `train_valid_data` (`lines 42-49`)
5. Within the `config` object, set the parameters `window_size` and `nb_channels` accordingly. (`lines 51-55`)
6. Define the `DeepConvLSTM` object. It is already imported for you. Also define the `optimizer` being the [Adam optimizer](https://pytorch.org/docs/stable/optim.html) and `criterion` being the [Cross-Entropy Loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) (`lines 57-63`)
7. Convert the feature columns of the train and validation to `float32` and label column to `uint8` for GPU compatibility. Use the [built-in function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html) of a pandas dataframe called `astype()`. (`lines 65-67`)
8. Use both datasets to run the `train()` function. (`lines 69-70`)

In [None]:
from sklearn.model_selection import StratifiedKFold


# number of splits, i.e. folds
config['splits_kfold'] = 10

# in order to get reproducible results, we need to seed torch and other random parts of our implementation
seed_torch(config['seed'])

# needed for saving results
log_date = time.strftime('%Y%m%d')
log_timestamp = time.strftime('%H%M%S')

# define the stratified k-fold object; it is already imported for you
# pass it the number of splits, i.e. folds, and seed as well as set shuffling to true
skf = StratifiedKFold(n_splits=config['splits_kfold'],shuffle = True, random_state=config['seed'])
    
print(train_valid_data.shape)

# apply the sliding window on top of both the train_valid_data; use the "apply_sliding_window" function
# found in data_processing.sliding_window
X_train_valid, y_train_valid = apply_sliding_window(train_valid_data[:, :-1], train_valid_data[:, -1], sliding_window_size=config['sw_length'], unit=config['sw_unit'], sampling_rate=config['sampling_rate'], sliding_window_overlap=config['sw_overlap'])

print(X_train_valid.shape, y_train_valid.shape)

# (optional) omit the first feature column (subject_identifier) from the train _valid_data
# you can do it if you want to as it is not a useful feature
X_train_valid = X_train_valid[:, :, 1:]

# result objects used for accumulating the scores across folds; add each fold result to these objects so that they
# are averaged at the end of the k-fold loop
kfold_accuracy = np.zeros(config['nb_classes'])
kfold_precision = np.zeros(config['nb_classes'])
kfold_recall = np.zeros(config['nb_classes'])
kfold_f1 = np.zeros(config['nb_classes'])
    
kfold_accuracy_gap = 0
kfold_precision_gap = 0
kfold_recall_gap = 0
kfold_f1_gap = 0

# k-fold validation loop; for each loop iteration return fold identifier and indeces which can be used to split
# the train + valid data into train and validation data according to the current fold
for j, (train_index, valid_index) in enumerate(skf.split(X_train_valid, y_train_valid)):
    print('\nFold {0}/{1}'.format(j + 1, config['splits_kfold']))
    
    # split the data into train and validation data; to do so, use the indeces produces by the split function
    X_train, X_valid = X_train_valid[train_index], X_train_valid[valid_index]
    y_train, y_valid = y_train_valid[train_index], y_train_valid[valid_index]
    
    # within the config file, set the parameters 'window_size' and 'nb_channels' accordingly
    # window_size = size of the sliding window in units
    # nb_channels = number of feature channels
    config['window_size'] = X_train.shape[1]
    config['nb_channels'] = X_train.shape[2]
    
    # define the network to be a DeepConvLSTM object; can be imported from model.DeepConvLSTM
    # pass it the config object
    net = DeepConvLSTM(config=config)
    
    # defines the loss and optimizer
    loss = torch.nn.CrossEntropyLoss()
    opt = torch.optim.Adam(net.parameters(), lr=config['lr'], weight_decay=config['weight_decay'])

    # convert the features of the train and validation to float32 and labels to uint8 for GPU compatibility 
    X_train, y_train = X_train.astype(np.float32), y_train.astype(np.uint8)
    X_valid, y_valid = X_valid.astype(np.float32), y_valid.astype(np.uint8)
    
    # feed the datasets into the train function; can be imported from model.train
    kfold_net, _, val_output, train_output = train(X_train, y_train, X_valid, y_valid, network=net, optimizer=opt, loss=loss, config=config, log_date=log_date, log_timestamp=log_timestamp)
        
    # in the following validation and train evaluation metrics are calculated
    cls = np.array(range(config['nb_classes']))
    val_accuracy = jaccard_score(val_output[:, 1], val_output[:, 0], average=None, labels=cls)
    val_precision = precision_score(val_output[:, 1], val_output[:, 0], average=None, labels=cls)
    val_recall = recall_score(val_output[:, 1], val_output[:, 0], average=None, labels=cls)
    val_f1 = f1_score(val_output[:, 1], val_output[:, 0], average=None, labels=cls)
    train_accuracy = jaccard_score(train_output[:, 1], train_output[:, 0], average=None, labels=cls)
    train_precision = precision_score(train_output[:, 1], train_output[:, 0], average=None, labels=cls)
    train_recall = recall_score(train_output[:, 1], train_output[:, 0], average=None, labels=cls)
    train_f1 = f1_score(train_output[:, 1], train_output[:, 0], average=None, labels=cls)
    
    # add up the fold results
    kfold_accuracy += val_accuracy
    kfold_precision += val_precision
    kfold_recall += val_recall
    kfold_f1 += val_f1

    # add up the generalization gap results
    kfold_accuracy_gap += train_accuracy - val_accuracy
    kfold_precision_gap += train_precision - val_precision
    kfold_recall_gap += train_recall - val_recall
    kfold_f1_gap += train_f1 - val_f1
    
# the next bit prints out the average results across folds if you did everything correctly
print("\nK-FOLD VALIDATION RESULTS: ")
print("Accuracy: {0}".format(np.mean(kfold_accuracy / config['splits_kfold'])))
print("Precision: {0}".format(np.mean(kfold_precision / config['splits_kfold'])))
print("Recall: {0}".format(np.mean(kfold_recall / config['splits_kfold'])))
print("F1: {0}".format(np.mean(kfold_f1 / config['splits_kfold'])))
    
print("\nVALIDATION RESULTS (PER CLASS): ")
print("\nAccuracy:")
for i, rslt in enumerate(kfold_accuracy / config['splits_kfold']):
    print("   {0}: {1}".format(class_names[i], rslt))
print("\nPrecision:")
for i, rslt in enumerate(kfold_precision / config['splits_kfold']):
    print("   {0}: {1}".format(class_names[i], rslt))
print("\nRecall:")
for i, rslt in enumerate(kfold_recall / config['splits_kfold']):
    print("   {0}: {1}".format(class_names[i], rslt))
print("\nF1:")
for i, rslt in enumerate(kfold_f1 / config['splits_kfold']):
    print("   {0}: {1}".format(class_names[i], rslt))
    
print("\nGENERALIZATION GAP ANALYSIS: ")
print("\nAccuracy: {0}".format(kfold_accuracy_gap / config['splits_kfold']))
print("Precision: {0}".format(kfold_precision_gap / config['splits_kfold']))
print("Recall: {0}".format(kfold_recall_gap / config['splits_kfold']))
print("F1: {0}".format(kfold_f1_gap / config['splits_kfold']))

### 5.3.3. Cross-Participant Cross-Validation

Cross-participant cross-validation, also known as Leave-One-Subject-Out (LOSO) cross-validation is the most complex, but also most expressive validation method one can apply when dealing with multi-subject data. In general, it can be seen as a variation of the k-fold cross-validation with k being the number of subjects. Within each fold, you train your network on the data of all but one subject and validate it on the left-out subject. The process is repeated as many times as there are subjects so that each subject becomes the validation set exaclty once. This way, each subject is treated as the unseen data at least once. 

Leaving one subject out each fold ensures that the overall evaluation of the algorithm does not overfit on subject-specific traits, i.e. how subjects performed the activities individually. It is therefore a great method to obtain a model which is good at predicting activities no matter which person performs them, i.e. a more general model!

The next task will lead you through the implementation of the cross-participant cross-validation loop.

#### Task 4: Implementing the cross-participant CV loop

1. Define a loop which iterates over the identifiers of all subjects. (`lines 8-10`)
2. Define the `train` data to be everything but the current subject's data and the `valid` data to be the current subject's data by filtering the `train_valid_data`. (`lines 12-15`)
3. Apply the `apply_sliding_window()` function on top of the filtered datasets you just defined. (`lines 19-27`)
4. (*Optional*) Omit the first feature column (subject_identifier) from the train and validation dataset. (`lines 29-31`)
5. Within the `config` object, set the parameters `window_size` and `nb_channels` accordingly. (`lines 51-55`)
6. Define the `DeepConvLSTM` object. It is already imported for you. Also define the `optimizer` being the [Adam optimizer](https://pytorch.org/docs/stable/optim.html) and `criterion` being the [Cross-Entropy Loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) (`lines 39-45`)
7. Convert the feature columns of the train and validation to `float32` and label column to `uint8` for GPU compatibility. Use the [built-in function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html) of a pandas dataframe called `astype()`. (`lines 47-49`)
8. Use both datasets to run the `train()` function. (`lines 51-52`)

In [None]:
# needed for saving results
log_date = time.strftime('%Y%m%d')
log_timestamp = time.strftime('%H%M%S')

# in order to get reproducible results, we need to seed torch and other random parts of our implementation
seed_torch(config['seed'])

# iterate over all subjects
for i, sbj in enumerate(np.unique(train_valid_data[:, 0])):
    print('\n VALIDATING FOR SUBJECT {0} OF {1}'.format(int(sbj) + 1, int(np.max(train_valid_data[:, 0])) + 1))
    
    # define the train data to be everything, but the data of the current subject
    train_data = train_valid_data[train_valid_data[:, 0] != sbj]
    # define the validation data to be the data of the current subject
    valid_data = train_valid_data[train_valid_data[:, 0] == sbj]
    
    print(train_data.shape, valid_data.shape)
    
    # apply the sliding window on top of both the train and validation data; use the "apply_sliding_window" function
    # found in data_processing.sliding_window
    X_train, y_train = apply_sliding_window(train_data[:, :-1], train_data[:, -1], sliding_window_size=config['sw_length'], unit=config['sw_unit'], sampling_rate=config['sampling_rate'], sliding_window_overlap=config['sw_overlap'])

    print(X_train.shape, y_train.shape)

    X_valid, y_valid = apply_sliding_window(valid_data[:, :-1], valid_data[:, -1], sliding_window_size=config['sw_length'], unit=config['sw_unit'], sampling_rate=config['sampling_rate'], sliding_window_overlap=config['sw_overlap'])

    print(X_valid.shape, y_valid.shape)

    # (optional) omit the first feature column (subject_identifier) from the train and validation dataset
    # you can do it if you want to as it is not a useful feature
    X_train, X_valid = X_train[:, :, 1:], X_valid[:, :, 1:]
    
    # within the config file, set the parameters 'window_size' and 'nb_channels' accordingly
    # window_size = size of the sliding window in units
    # nb_channels = number of feature channels
    config['window_size'] = X_train.shape[1]
    config['nb_channels'] = X_train.shape[2]
    
    # define the network to be a DeepConvLSTM object; can be imported from model.DeepConvLSTM
    # pass it the config object
    net = DeepConvLSTM(config=config)

    # defines the loss and optimizer
    loss = torch.nn.CrossEntropyLoss()
    opt = torch.optim.Adam(net.parameters(), lr=config['lr'], weight_decay=config['weight_decay'])

    # convert the features of the train and validation to float32 and labels to uint8 for GPU compatibility 
    X_train, y_train = X_train.astype(np.float32), y_train.astype(np.uint8)
    X_valid, y_valid = X_valid.astype(np.float32), y_valid.astype(np.uint8)
    
    # feed the datasets into the train function; can be imported from model.train
    cross_participant_net, _, val_output, train_output = train(X_train, y_train, X_valid, y_valid, network=net, optimizer=opt, loss=loss, config=config, log_date=log_date, log_timestamp=log_timestamp)
    
    # the next bit prints out the average results per subject if you did everything correctly
    cls = np.array(range(config['nb_classes']))
    
    print('\nVALIDATION RESULTS FOR SUBJECT {0}: '.format(int(sbj) + 1))
    print("\nAvg. Accuracy: {0}".format(jaccard_score(val_output[:, 1], val_output[:, 0], average='macro')))
    print("Avg. Precision: {0}".format(precision_score(val_output[:, 1], val_output[:, 0], average='macro')))
    print("Avg. Recall: {0}".format(recall_score(val_output[:, 1], val_output[:, 0], average='macro')))
    print("Avg. F1: {0}".format(f1_score(val_output[:, 1], val_output[:, 0], average='macro')))

    print("\nVALIDATION RESULTS (PER CLASS): ")
    print("\nAccuracy:")
    for i, rslt in enumerate(jaccard_score(val_output[:, 1], val_output[:, 0], average=None, labels=cls)):
        print("   {0}: {1}".format(class_names[i], rslt))
    print("\nPrecision:")
    for i, rslt in enumerate(precision_score(val_output[:, 1], val_output[:, 0], average=None, labels=cls)):
        print("   {0}: {1}".format(class_names[i], rslt))
    print("\nRecall:")
    for i, rslt in enumerate(recall_score(val_output[:, 1], val_output[:, 0], average=None, labels=cls)):
        print("   {0}: {1}".format(class_names[i], rslt))
    print("\nF1:")
    for i, rslt in enumerate(f1_score(val_output[:, 1], val_output[:, 0], average=None, labels=cls)):
        print("   {0}: {1}".format(class_names[i], rslt))

    print("\nGENERALIZATION GAP ANALYSIS: ")
    print("\nTrain-Val-Accuracy Difference: {0}".format(jaccard_score(train_output[:, 1], train_output[:, 0], average='macro') -
                                                      jaccard_score(val_output[:, 1], val_output[:, 0], average='macro')))
    print("Train-Val-Precision Difference: {0}".format(precision_score(train_output[:, 1], train_output[:, 0], average='macro') -
                                                       precision_score(val_output[:, 1], val_output[:, 0], average='macro')))
    print("Train-Val-Recall Difference: {0}".format(recall_score(train_output[:, 1], train_output[:, 0], average='macro') -
                                                    recall_score(val_output[:, 1], val_output[:, 0], average='macro')))
    print("Train-Val-F1 Difference: {0}".format(f1_score(train_output[:, 1], train_output[:, 0], average='macro') -
                                                f1_score(val_output[:, 1], val_output[:, 0], average='macro')))

## 5.4 Testing

Now, after having implemented each of the validation techniques we want to get an unbiased view of how our trained algorithm perfoms on unseen data. To do so we use the testing set which we split off the original dataset within the first step of this notebook.

### Task 5: Testing your trained networks

1. Apply the `apply_sliding_window()` function on top of the `test` data. (`lines 7-9`)
2. (*Optional*) Omit the first feature column (subject_identifier) from the test dataset. (`lines 12-14`)
3. Convert the feature columns of the test dataset to `float32` and label column to `uint8` for GPU compatibility. Use the [built-in function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html) of a pandas dataframe called `astype()`. (`lines 17-18`)
4. Using the `predict()` function of the DL-ARC GitHub to obtain results on the `test` data using each of the trained networks as input. The function is already imported for you. (`lines 20-29`)
5. Which model does perform the best and why? Was this expected? Can you make out a reason why that is? 
6. What would you change about the pipeline we just created if your goal was to get the best predictions possible? Hint: think about the amount of data which actually trained your model in the end!

In [None]:
from model.train import predict


# in order to get reproducible results, we need to seed torch and other random parts of our implementation
seed_torch(config['seed'])

# apply the sliding window on top of both the test data; use the "apply_sliding_window" function
# found in data_processing.sliding_window
X_test, y_test = apply_sliding_window(test_data[:, :-1], test_data[:, -1], sliding_window_size=config['sw_length'], unit=config['sw_unit'], sampling_rate=config['sampling_rate'], sliding_window_overlap=config['sw_overlap'])

print(X_test.shape, y_test.shape)

# (optional) omit the first feature column (subject_identifier) from the test dataset
# you need to do it if you did so during training and validation!
X_test = X_test[:, :, 1:]

# convert the features of test to float32 and labels to uint8 for GPU compatibility 
X_test, y_test = X_test.astype(np.float32), y_test.astype(np.uint8)

# the next lines will print out the test results for each of the trained networks
print('COMPILED TEST RESULTS: ')
print('\nTest results (train-valid-split): ')
predict(X_test, y_test, train_valid_net, config, log_date, log_timestamp)

print('\nTest results (k-fold): ')
predict(X_test, y_test, kfold_net, config, log_date, log_timestamp)

print('\nTest results (cross-participant): ')
predict(X_test, y_test, cross_participant_net, config, log_date, log_timestamp)