This notebook is to demonstrate how this library works


# 1. Structure 

root\
├── callbacks // here you can create your custom callbacks \
├── checkpoint // were we store the trained models \
├── data // quick data import and transformation pipeline \
│ ├── selection // data selection for train-validation-test split\
│ ├── transformation // custom transformation compatible with torchvision.transform\
│ ├── torchData // custom torch Dataset and DataLoader\
│ ├── custom_data.py // data for specific format\
│ ├── load_npy_format.py // tranform and load csv files into npy files\
│ └── utils.py\
├── laboratory // notebooks for running experiments\
│ ├── saved_model \
│ └── record \
├── losses // custom losses\
├── metrics // custom metrics\
├── main.py **to be edited**\
├── models // quick default model setup \
│ ├── baseline.py // baseline models\
│ ├── cnn.py // torchvision CNN models\
│ ├── self_supervised.py // torch.nn.module for contrastive learning, **to be depreciated** \
│ ├── temproal.py // CNN-LSTM\
│ └── utils.py // utility torch.nn.module\
├── playground.ipynb // fast experiment with things\
├── README.md\
├── test // to be implemented\
└── utils.py // utilities functions\

In [1]:
# In the notebook all used libraries are pre-loaded, this demo instead load them one by one
import os
import sys
import numpy as np
import pandas as pd
import torch

In [None]:
# OPTIONAL: Some notebooks are not setup in root, this is to add to the system path  
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [45]:
# random seed
seed = 42
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)

# gpu setting
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
torch.cuda.set_device(DEVICE)
device = DEVICE

# 2. data preparation

In [3]:
# Please edit the following arguments 
data_dir = 'E:\\external_data\\opera_csi\\Session_2\\experiment_data\\experiment_data\\exp_7_amp_spec_only\\npy_format' 
readtype = 'npy' # type of file, currently support 'csv' or 'npy'
splitchar = '\\'
fpath = '.\\laboratory' # for saving models and records 

Each row of the **filepath-dataframe** consist of the fullpath of the file and its corresponding file based on its filename or folder level. The standard and the most versatile way is to use `filepath_dataframe` from `data.utils`. 

In [5]:
from data.utils import filepath_dataframe

# this is the most versatile way to read the files, folder levels represent classes  
df = filepath_dataframe(data_dir,splitchar)
df.head()

Unnamed: 0,fullpath,class_1,class_2
0,E:\external_data\opera_csi\Session_2\experimen...,bodyrotate,nuc1
1,E:\external_data\opera_csi\Session_2\experimen...,bodyrotate,nuc1
2,E:\external_data\opera_csi\Session_2\experimen...,bodyrotate,nuc1
3,E:\external_data\opera_csi\Session_2\experimen...,bodyrotate,nuc1
4,E:\external_data\opera_csi\Session_2\experimen...,bodyrotate,nuc1


Fuctions in `data.custom_data` is created for recently publish data. It consists of `filepath_dataframe` to create filepath-dataframe that extract more information from the filename. `nucPaired_fpDataframe` is for joint data based on the NUC unit 

In [16]:
from data.custom_data import filepath_dataframe

# this is specifically for recently publised data  
df = filepath_dataframe(data_dir,splitchar)
print(df.shape)
df.head()

(5812, 8)


Unnamed: 0,fullpath,exp,person,room,activity,index,nuc,key
0,E:\external_data\opera_csi\Session_2\experimen...,5,One,1,bodyrotate,10,NUC1,exp_005_person_One_room_1_bodyrotate_index_10
1,E:\external_data\opera_csi\Session_2\experimen...,5,One,1,bodyrotate,11,NUC1,exp_005_person_One_room_1_bodyrotate_index_11
2,E:\external_data\opera_csi\Session_2\experimen...,5,One,1,bodyrotate,12,NUC1,exp_005_person_One_room_1_bodyrotate_index_12
3,E:\external_data\opera_csi\Session_2\experimen...,5,One,1,bodyrotate,13,NUC1,exp_005_person_One_room_1_bodyrotate_index_13
4,E:\external_data\opera_csi\Session_2\experimen...,5,One,1,bodyrotate,14,NUC1,exp_005_person_One_room_1_bodyrotate_index_14


In [17]:
from data.custom_data import nucPaired_fpDataframe

pair_df = nucPaired_fpDataframe(df)
print(pair_df.shape)
pair_df.head()

(2906, 9)


Unnamed: 0,fullpath_x,exp,person,room,activity,index,nuc,key,fullpath_y
0,E:\external_data\opera_csi\Session_2\experimen...,5,One,1,bodyrotate,10,NUC1,exp_005_person_One_room_1_bodyrotate_index_10,E:\external_data\opera_csi\Session_2\experimen...
1,E:\external_data\opera_csi\Session_2\experimen...,5,One,1,bodyrotate,11,NUC1,exp_005_person_One_room_1_bodyrotate_index_11,E:\external_data\opera_csi\Session_2\experimen...
2,E:\external_data\opera_csi\Session_2\experimen...,5,One,1,bodyrotate,12,NUC1,exp_005_person_One_room_1_bodyrotate_index_12,E:\external_data\opera_csi\Session_2\experimen...
3,E:\external_data\opera_csi\Session_2\experimen...,5,One,1,bodyrotate,13,NUC1,exp_005_person_One_room_1_bodyrotate_index_13,E:\external_data\opera_csi\Session_2\experimen...
4,E:\external_data\opera_csi\Session_2\experimen...,5,One,1,bodyrotate,14,NUC1,exp_005_person_One_room_1_bodyrotate_index_14,E:\external_data\opera_csi\Session_2\experimen...


I found the biggest bottlneck for loading the data is the format, therefore I advise to save the data into .npy format and use the newly generated data instead. This can speed up the loading process by 100 times. Please copy the following command and execute on root

In [None]:
!python ./data/load_npy_format.py

Now we have filepath-dataframe that represent the actual data, we can perform data selection by manipulate the filepath-dataframe. `data.selection` is a group of function to split the dataframe into train-validation-test set for the **newly published data**. `Selection` is the standardised way to split dataset, which consists arguments:

- split (str): spliting method, available are `'random'` or `'loov'` (leave one participant out)
- test_sub (str/float): depend on the spliting method, if `'random'`, it is the percentage of that the data becoming test set, if `'loov'`, it is the 'person' in the filepath-dataframe to become test subject
- val_sub (str/float): depend on the spliting method, if `'random'`, it is the percentage of that the data becoming test set, if `'loov'`, it is the `'person'` in the filepath-dataframe to become test subject, if `None`, the validation data = None
- nuc (str/list): nuc to be included 
- room (int/list): room to be included
- sample_per_class (int/bool): selecting number of sample for each class, if None, this process will not proceed
- \*\*kwarg: torch DataLoader

In [30]:
from data.selection import Selection

# standardised way to split dataset 
data_selection = Selection(split='random',test_sub=0.2,val_sub=0.1,nuc='NUC1',room=1,sample_per_class=None)
df_train,df_val,df_test = data_selection(df)
print(f"Train size: {df_train.shape}\tValidation size: {df_val.shape}\tTest size: {df_test.shape}")


Train size: (1324, 8)	Validation size: (190, 8)	Test size: (162, 8)


For quick setup, there are SelectionSet_1 to 5 for serval setting

In [31]:
from data.selection import SelectionSet_1,

# Alternative predefined selection, total of 5 available 
data_selection = SelectionSet_1()
df_train,df_val,df_test = data_selection(df)
print(f"Train size: {df_train.shape}\tValidation size: {df_val.shape}\tTest size: {df_test.shape}")

Train size: (1433, 8)	Validation size: (173, 8)	Test size: (132, 8)


Original data may not have our desire format for the algorithm. `data.transformation` is designed based on **torchvision** transformation pipeline, so each data is processed according after reading from the file. Currently there are three transformation pipeline available, you can also use your own pipeline, given it is compatible with [torchvision](https://pytorch.org/vision/stable/transforms.html#:~:text=torchvision.transforms%20Transforms%20are%20common%20image%20transformations.%20They%20can,functional%20transforms%20give%20fine-grained%20control%20over%20the%20transformations.)

In [None]:
from data.transformation import Transform_CnnLstmS,Transform_CnnS,Transform_Cnn

transform = Transform_CnnS()

Now we have three filepath-dataframes, we will create `DataLoading` that help us to setup the our predefined dataLoader, it consists of:

- transform (torchvision.transforms.transforms.Compose) - transformation pipeline
- batch_size (int) - batch size of train and validation set
- readtype (str) - currently support 'csv' or 'npy'
- load_data (bool) - please set it as False 

In [35]:
from data.torchData import DataLoading


batch_size = 64
num_workers = 0

data_loading = DataLoading(transform=transform,batch_size=batch_size,readtype=readtype,
                           num_workers=num_workers,drop_last=True)
test_loading = DataLoading(transform=transform,batch_size=len(df_test),readtype=readtype,
                           num_workers=num_workers,drop_last=True)

train_loader = data_loading(df_train)
val_loader   = data_loading(df_val)
test_loader  = test_loading(df_test)

In [38]:
def test_dataloading(loader):
    for x,y in loader:
        break
    return x.shape,y.shape

print(f"train_loader - X: {test_dataloading(train_loader)[0]} \t Y: {test_dataloading(train_loader)[1]}")
print(f"test_loader  - X: {test_dataloading(test_loader)[0]} \t Y: {test_dataloading(test_loader)[1]}")

train_loader - X: torch.Size([64, 1, 70, 1600]) 	 Y: torch.Size([64])
test_loader  - X: torch.Size([132, 1, 70, 1600]) 	 Y: torch.Size([132])




`PairDataLoading` is similar to `DataLoading`, but it only takes filename-dataframe generated from `nucPaired_fpDataframe` (with columns `fullpath_x` and `fullpath_y`). We have extra argument `supervision` for whether it returns the label, due to the later process, please alway set it True

In [39]:
from data.torchData import PairDataLoading

pdf_train,pdf_val,pdf_test = data_selection(pair_df)
pretrain_loading = PairDataLoading(transform=transform,
                                   batch_size=batch_size,
                                   readtype=readtype,
                                   supervision=True,
                                   num_workers=num_workers,
                                   drop_last=True)


pretrain_loader = pretrain_loading(pdf_train)

In [43]:
def test_dataloading(loader):
    for x1,x2,y in loader:
        break
    return x1.shape,x2.shape,y.shape

print(f"train_loader - \nX1: {test_dataloading(pretrain_loader)[0]} \nX2: {test_dataloading(pretrain_loader)[1]} \nY: {test_dataloading(pretrain_loader)[2]}")

train_loader - 
X1: torch.Size([64, 1, 70, 1600]) 
X2: torch.Size([64, 1, 70, 1600]) 
Y: torch.Size([64])


# 3. Training 

We use a builder, a callable function that take no arguments and returns model and latent size, instead of the model itself. The easier way is by **lambda-anonymous-function**

In [51]:
from models.cnn import create_alexnet

builder = lambda: create_alexnet(output_size=(1,6))
model_fname = None

### a. Contrastive pretraining 

In [52]:
from training.contrastive_pretraining import Contrastive_PreTraining

supervision = True
temperature = 0.1
model_fname= os.path.join(fpath,'saved_model/test_model')

# Setup for NT-Xent
clr = Contrastive_PreTraining(
    encoder_builder=builder,
    batch_size=batch_size,
    supervision=supervision,
    temperature=temperature
)

# Pretraining with NT-Xent
encoder = clr.train(train_loader=pretrain_loader,
                    epochs=1,
                    rtn_history=False,
                    device=device)

# Save and delete model 
torch.save(encoder.state_dict(),model_fname)
del encoder, clr, pretrain_loader

Epoch 1 >>>>>>>>>>>>>>>>>>>>>> loss: 0.48631221055984497


### b. Standard/Fine-Tuning

To standardised the record, we use [poutyne](https://poutyne.org/) for fine-tuning and supervising learning 

`training.finetuning.FineTuneCNN` is an Encoder Decoder Architecture (nn.Module), it serves two functions: fine-tuning, standard training and loov. It 
1. Create a empty encoder with `encoder_builder`
2. It loads the state dictionary from `model_path` into the encoder, and freeze it. If `model_path` is None, no information will be loaded into the encoder
3. Create a decoder with latent size, `hidden layer` and `n_classes`

In [54]:
from training.finetuning import FineTuneCNN

hidden_layer=128

model = FineTuneCNN(model_path=model_fname,
                    encoder_builder=builder,
                    hidden_layer=hidden_layer,
                    n_classes=df.activity.nunique())

In [55]:
import poutyne
from poutyne import Model,Experiment

# train with poutyne
finetune_epochs = 1

mdl = Model(model,'adam','cross_entropy',
            batch_metrics=['accuracy'],
            epoch_metrics=[poutyne.F1('micro'),poutyne.F1('macro')]).to(device)
history = mdl.fit_generator(train_generator=train_loader,valid_generator=test_loader,epochs=finetune_epochs)

[35mEpoch: [36m1/1 [35mStep: [36m22/22 [35m100.00% |[35m█████████████████████████[35m|[32m38.10s [35mloss:[94m 2749.430311[35m acc:[94m 37.215909[35m fscore_micro:[94m 0.372159[35m fscore_macro:[94m 0.205339[35m val_loss:[94m 4002.342285[35m val_acc:[94m 16.666668[35m val_fscore_micro:[94m 0.166667[35m val_fscore_macro:[94m 0.047619[0m


In [56]:
pd.DataFrame(history)

Unnamed: 0,epoch,loss,time,acc,fscore_micro,fscore_macro,val_loss,val_acc,val_fscore_micro,val_fscore_macro
0,1,2749.430311,38.100532,37.215909,0.372159,0.205339,4002.342285,16.666668,0.166667,0.047619


# 4. Validation

### a. LOOV validation

`validation.loov.leaveOneOut_crossValidation` takes FineTuneCNN `training.finetuning.FineTuneCNN`, filepath-dataframe `dataframe`, transformation pipeline to perfrom Leave-One-Participant-Out Validation. It automates the data preparation, initiate the model and train with poutyne.Model module.  Experimental variables can be setup as **kwargs**, here are the keywords and default values 

- nuc = 'NUC1'
- room = 1
- batch_size = 128
- readtype = 'npy'
- num_workers = 0
- optimizer = 'adam'
- loss = 'cross_entropy'
- batch_metrics = \['accuracy'\]
- epoch_metrics = \[poutyne.F1('micro'),poutyne.F1('macro')\]
- epochs = 250
- device

In [57]:
from validation.loov import leaveOneOut_crossValidation

records = leaveOneOut_crossValidation(model,df,transform,verbose=True)

For individual training and validation, please follow the notebooks in `./laboratory`