<a href="https://colab.research.google.com/github/jh5723/SSE_Lab2/blob/main/notebooks/training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Step 0.1: Clone the repository and install the dependencies

Select "Restart session" when prompted.

In [1]:
!rm -rf tkd-ar
!git clone https://github.com/jh5723/tkd-ar.git
%cd tkd-ar
!git pull origin main
!pip uninstall -y tkd-ar
!pip install -e .
!pip install ipywidgets

Cloning into 'tkd-ar'...
remote: Enumerating objects: 829, done.[K
remote: Counting objects: 100% (248/248), done.[K
remote: Compressing objects: 100% (162/162), done.[K
remote: Total 829 (delta 164), reused 160 (delta 82), pack-reused 581 (from 1)[K
Receiving objects: 100% (829/829), 973.12 KiB | 6.80 MiB/s, done.
Resolving deltas: 100% (477/477), done.
/content/tkd-ar
From https://github.com/jh5723/tkd-ar
 * branch            main       -> FETCH_HEAD
Already up to date.
[0mObtaining file:///content/tkd-ar
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pandas~=2.2.2 (from tkd-ar==0.1)
  Downloading pandas-2.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (19 kB)
Collecting numpy~=1.24.4 (from tkd-ar==0.1)
  Downloading numpy-1.24.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting pykalman~=0.9.7 (from tkd-ar==0.1)
  Downloading pykalman-0.9.7-py2.py3-none-any.whl.metadata (5.5 kB)
Collecting scikit-le

Collecting jedi>=0.16 (from ipython>=4.0.0->ipywidgets)
  Using cached jedi-0.19.1-py2.py3-none-any.whl.metadata (22 kB)
Using cached jedi-0.19.1-py2.py3-none-any.whl (1.6 MB)
Installing collected packages: jedi
Successfully installed jedi-0.19.1


### Step 0.2: Mount Google Drive to access the training data and store model checkpoints



In [1]:
import os

from google.colab import drive
drive.mount('/content/drive') # force_remount=True if failing

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Step 1: Set Up Data Paths and Checkpoint Parameters

In this step, we will define the paths to our preprocessed training, validation, and test datasets, as well as the paths necessary to manage the training checkpoints for our model. The data and checkpoints will be stored in a structured directory that reflects the type of keypoint data used, the preprocessing methods applied, and the specific experiment configuration.

**Details:**

- **Skeleton Type (`skel`)**: Select the keypoint dataset format (e.g., AIC, MPII, COCO, COCO_25) that matches your training data. Different formats track different numbers of body keypoints, affecting both training time and model detail.

- **Experiment Code (`experiment_code`)**: This code uniquely identifies the preprocessing methods and labelling criteria applied to the keypoint data. Use the format `major.minor` (e.g., 5.1) to indicate different experiment categories and their specific variations. Refer to **[X]** for more details on the preprocessing steps.

- **Dataset Identifier (`dataset`)**: Combine the skeleton type with the keypoint detection model (e.g., 'vitpose_l') and the object detection model (e.g., 'yolo_m'). This identifier helps to organize data and checkpoints based on different model configurations.

- **Data Paths**: Set the paths to the training, validation, and test datasets, which are stored in separate directories within the structured data root.

- **Checkpoint Path (`checkpoint_path`)**: Define the directory path that combines the base checkpoint directory with the dataset identifier and experiment code. This structure ensures that all checkpoints for this specific experiment are stored in an organized manner.


Modify as appropriate:

In [2]:
import ipywidgets as widgets
from IPython.display import display, Markdown

skel_dropdown = widgets.Dropdown(
    options=['aic', 'mpii', 'coco', 'coco_25'],
    value='aic',
    description='Skeleton:',
    disabled=False,
)

major_dropdown = widgets.Dropdown(
    options=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    value=7,
    description='Preprocessing settings:',
    disabled=False,
)

minor_dropdown = widgets.Dropdown(
    options=[0, 1, 2],
    value=1,
    description='Confidence threshold settings:',
    disabled=False,
)

split_type_dropdown = widgets.Dropdown(
    options=['participant', 'random'],
    value='participant',
    description='Split type:',
    disabled=False,
)

def display_selections(skel, major, minor, split_type):
    display(Markdown(f"### Current Experiment Setup"))
    display(Markdown(f"**Skeleton**: {skel}"))
    display(Markdown(f"**Preprocessing Strategy (Major)**: {major}"))
    display(Markdown(f"**Filters Applied (Minor)**: {minor}"))
    display(Markdown(f"**Split Type**: {split_type}"))

widgets.interactive(display_selections, skel=skel_dropdown, major=major_dropdown, minor=minor_dropdown, split_type=split_type_dropdown)

display(skel_dropdown)
display(major_dropdown)
display(minor_dropdown)
display(split_type_dropdown)

Dropdown(description='Skeleton:', options=('aic', 'mpii', 'coco', 'coco_25'), value='aic')

Dropdown(description='Preprocessing settings:', index=7, options=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10), value=7)

Dropdown(description='Confidence threshold settings:', index=1, options=(0, 1, 2), value=1)

In [9]:
skel = skel_dropdown.value
major = major_dropdown.value
minor = minor_dropdown.value
split_type = split_type_dropdown.value
if split_type == 'random':
    experiment_code = f"{major}.{minor}_random_split"
else:
    experiment_code = f"{major}.{minor}"

print(f"Selected Skeleton: {skel}")
print(f"Selected Experiment: {experiment_code}")

Selected Skeleton: aic
Selected Experiment: 7.0


This cell generates the necessary filepaths for training.

In [10]:
dataset = f'{skel}_vitpose_l_yolo_m'

# Define the paths to the preprocessed training, validation, and test datasets
data_root = f'/content/drive/MyDrive/tkd-ar/har_data/preprocessed/participant_split/{dataset}/experiment_{experiment_code}'
train_path = os.path.join(data_root, 'train/preprocessed_data.json')
val_path = os.path.join(data_root, 'valid/preprocessed_data.json')
test_path = os.path.join(data_root, 'test/preprocessed_data.json')

# Define the paths for model checkpoints
checkpoint_path = '/content/drive/MyDrive/tkd-ar/checkpoints'
subdirectory = f'{dataset}/experiment_{experiment_code}'
checkpoint_path = os.path.join(checkpoint_path, subdirectory)
os.makedirs(checkpoint_path, exist_ok=True)

# Temporary directory name for checkpoints which will be renamed with performance
temp_checkpoint_dir = os.path.join(checkpoint_path, 'trial4')

print(f"Training data path: {train_path}")
print(f"Validation data path: {val_path}")
print(f"Test data path: {test_path}")
print(f"Checkpoint path: {checkpoint_path}")

Training data path: /content/drive/MyDrive/tkd-ar/har_data/preprocessed/participant_split/aic_vitpose_l_yolo_m/experiment_7.0/train/preprocessed_data.json
Validation data path: /content/drive/MyDrive/tkd-ar/har_data/preprocessed/participant_split/aic_vitpose_l_yolo_m/experiment_7.0/valid/preprocessed_data.json
Test data path: /content/drive/MyDrive/tkd-ar/har_data/preprocessed/participant_split/aic_vitpose_l_yolo_m/experiment_7.0/test/preprocessed_data.json
Checkpoint path: /content/drive/MyDrive/tkd-ar/checkpoints/aic_vitpose_l_yolo_m/experiment_7.0


### Step 2: Toggle Data Augmentation Transforms

In this step, you'll decide whether to apply data augmentation transforms to your input data during training. Data augmentation can help improve model generalization by introducing variability in the training data. The available transforms include geometric modifications, keypoint noise, and occlusion techniques.

**Instructions:**

- **Toggle Button (`toggle_button`)**: Use the toggle button to enable or disable data augmentation. When the button is toggled on, the transforms will be applied; when toggled off, no transformations will be applied. This provides a quick way to experiment with and without data augmentation.
- **Transforms Pipeline (`transforms`)**:
  - If `use_transforms` is enabled, a sequence of random transformations will be applied to the data. These include:
    - **Geometric Transformations**: Mirroring, rotation, translation, and scaling.
    - **Keypoint Noise**: Adding Gaussian noise, jittering, and swapping left and right keypoints.
    - **Occlusion**: Dropping frames or keypoints to simulate occlusion.
  - If `use_transforms` is disabled, the transform pipeline will be set to `None`, meaning no transformations will be applied.
- **Dynamic Behavior**: The `use_transforms` variable is set based on the state of the toggle button. The transforms are only applied if `use_transforms` is `True`.

This setup allows you to easily toggle between training with or without data augmentation using a simple interactive button.


First toggle transformations on or off. Please note that the transformations only work with the keypoint-based normalisations. Experiments where data has been normalised to include velocities and euclidean distances between keypoints are not yet supported.

In [11]:
import ipywidgets as widgets
from IPython.display import display

toggle_transforms = widgets.ToggleButton(
    value=False,
    description='Use Transforms',
    disabled=False,
    button_style='',
    tooltip='Click to toggle transforms',
    icon='check'
)

display(toggle_transforms)

ToggleButton(value=False, description='Use Transforms', icon='check', tooltip='Click to toggle transforms')

In [27]:
# Set the use_transforms variable based on the button state
use_transforms = toggle_transforms.value
print(f"Transforms are {'enabled' if use_transforms else 'disabled'}")

Transforms are disabled


Here the specific transforms and their likelihood of application can be modified before loading the JSON files into the ActionDataset object.

In [28]:
import common.augmentations.frame_rate as fr
import common.augmentations.geometric as gm
import common.augmentations.keypoint_noise as kn
import common.augmentations.occlusion as oc
import torchvision.transforms as transforms

p_high = 0.5
p_med = 0.05
p_low = 0.01

if use_transforms:
    transforms = transforms.Compose(
        [
            transforms.RandomApply([gm.Mirror()], p=p_high),
            transforms.RandomApply([gm.Rotate(angle_range=(-10, 10))], p=p_med),
            transforms.RandomApply([gm.Translate(max_translation=0.05)], p=p_low),
            transforms.RandomApply([gm.ScaleX(scale_range=(0.9, 1.0))], p=p_low),
            transforms.RandomApply([gm.ScaleY(scale_range=(0.9, 1.0))], p=p_low),
            # transforms.RandomApply([fr.FrameRateReduction()], p=0.025),
            # transforms.RandomApply([fr.FrameRateIncrease()], p=0.025),
            transforms.RandomApply([kn.GaussianNoise(scale_factor=0.0005, p=0.25)], p=p_med),
            transforms.RandomApply([kn.Jittering(
                angles=(0, 360), distance=(0.0, 0.005), keypoints=None, p=0.05,)], p=p_med,
            ),
            transforms.RandomApply([kn.SwapLeftAndRight(
                skel_format=skel, swap_probability=0.02, max_swaps_per_sequence=3,)], p=p_high,
            ),
            transforms.RandomApply([oc.DropFrames(
                max_frames=3, missing_value=0.0, skip_missing_values=True)], p=p_low,
            ),
            transforms.RandomApply([oc.DropKeypoints(
                missing_value=0.0, skip_missing_values=True, drop_probability=0.001, max_keypoints_dropped=3,)], p=p_med,
            ),
            transforms.RandomApply([oc.DropLimb(
                max_frames=3, missing_value=0.0, skip_missing_values=True)], p_low,
            )
        ]
    )
else:
    transforms = None

### Step 3: Load the preprocessed training datasets

This step loads the validation dataset using the custom `ActionDataset` class and prints a summary to better understand the training, validation and testing data composition. Additionally, it extracts key parameters, such as the input size and the number of classes, which are necessary for defining the model architecture.

**Details:**

- **Loading the Datasets**:
  - Load the validation data from a specified file path using the `ActionDataset` class. The dataset will be structured based on the specified keypoint format and the path to the preprocessed data.

- **Printing the Dataset Summary**:
  - After loading, we will print a summary of the dataset using the `print_summary()` method. This summary includes statistics such as the number of sequences, label distribution, and other relevant data insights.

- **Extracting Parameters**:
  - The `input_size` is determined by the number of keypoints per frame.
  - The `num_classes` will be calculated from the dataset, representing the number of unique actions or labels present.

- **Setting Training Epochs**:
  - We set the number of epochs for training here too.

## Note:
Data that has not been preprocessed can be loaded by setting the preprocessed_data_path input to none and providing the filepath to the location of the individual technique JSON files with the directory_path parameter. See **[X]** for more details.

In [29]:
from data.datasets.action_dataset import ActionDataset


print(f"Loading training data from {train_path}...")
train_dataset = ActionDataset(
    skeleton=skel,
    preprocessed_data_path=train_path,
    transform=transforms
)
train_dataset.print_summary()

print(f"Loading validation data from {val_path}...")
val_dataset = ActionDataset(
    skeleton=skel,
    preprocessed_data_path=val_path,
    transform=None,
)
val_dataset.print_summary()

print(f"Loading testing data from {train_path}...")
test_dataset = ActionDataset(
    skeleton=skel,
    preprocessed_data_path=test_path,
    transform=None,
)
test_dataset.print_summary()

Loading training data from /content/drive/MyDrive/tkd-ar/har_data/preprocessed/participant_split/aic_vitpose_l_yolo_m/experiment_7.0/train/preprocessed_data.json...
Preprocessed data loaded from /content/drive/MyDrive/tkd-ar/har_data/preprocessed/participant_split/aic_vitpose_l_yolo_m/experiment_7.0/train/preprocessed_data.json
Total sequences loaded: 5109
5109 in the dataset:
Technique 3 (turning kick): 439 samples (8.59%)
Technique 2 (spinning half turning kick): 390 samples (7.63%)
Technique 9 (check/side kick): 593 samples (11.61%)
Technique 5 (reverse turning kick): 412 samples (8.06%)
Technique 4 (hook kick): 371 samples (7.26%)
Technique 8 (axe kick): 407 samples (7.97%)
Technique 7 (crescent kick): 410 samples (8.03%)
Technique 6 (back kick): 427 samples (8.36%)
Technique 10 (body punch): 355 samples (6.95%)
Technique 1 (half turning kick): 453 samples (8.87%)
Technique 0 (idle): 852 samples (16.68%)
Loading validation data from /content/drive/MyDrive/tkd-ar/har_data/preprocess

In [30]:
sequence, label = train_dataset[0]
input_size = sequence.shape[1]  # keypoints per frame
num_classes = train_dataset.count_unique_labels()
print(f"Input size: {input_size}")
print(f"Number of classes: {num_classes}")

Input size: 58
Number of classes: 11


### Step 4: Set up the helper functions for training

Select the configurations for the LSTM to be trained. Note that 'pytorch' applies standard dropout to the output of an LSTM layer whenever it is followed by a subsequent LSTM layer. Gal isntead applies it within the LSTM cells themselves.

In [31]:
model_type_dropdown = widgets.Dropdown(
    options=['standard', 'bilstm', 'attention', 'convolution', 'conv_attention'],
    value='attention',
    description='Model Type:',
)

bidirectional_dropdown = widgets.Dropdown(
    options=[True, False],
    value=True,
    description='Creat bidirectional model:',
)

dropout_method_dropdown = widgets.Dropdown(
    options=['pytorch', 'gal'],
    value='pytorch',
    description='Dropout Method:',
)

layer_norm_dropdown = widgets.Dropdown(
    options=[True, False],
    value=False,
    description='Apply layer normalisation within the LSTM cells:',
)


display(model_type_dropdown)
display(bidirectional_dropdown)
display(dropout_method_dropdown)
display(layer_norm_dropdown)

Dropdown(description='Model Type:', index=2, options=('standard', 'bilstm', 'attention', 'convolution', 'conv_…

Dropdown(description='Creat bidirectional model:', options=(True, False), value=True)

Dropdown(description='Dropout Method:', options=('pytorch', 'gal'), value='pytorch')

Dropdown(description='Apply layer normalisation within the LSTM cells:', index=1, options=(True, False), value…

In [32]:
selected_model_type = model_type_dropdown.value
selected_bidirectional = bidirectional_dropdown.value
selected_dropout_method = dropout_method_dropdown.value
selected_layer_norm = layer_norm_dropdown.value

print(f"Selected Model Type: {selected_model_type}")
print(f"Selected Bidirectional: {selected_bidirectional}")
print(f"Selected Dropout Method: {selected_dropout_method}")
print(f"Selected Layer Normalisation: {selected_layer_norm}")

Selected Model Type: attention
Selected Bidirectional: True
Selected Dropout Method: pytorch
Selected Layer Normalisation: False


Below are functions to quickly initialise a new model instance and learner within the grid search loop. They are provided below to facilitate easily changing additional desired parameters, such as early stopping patience, the degree of prints within the training loop etc.

In [33]:
import common.utils.training_utils as tutils
from models.model_factory import LSTMBuilder
from models.learner import ClassificationLearner


# Function to reinitialise a fresh model with each set of hyperparameters
def initialise_model(config):
    builder = LSTMBuilder(
        input_size=config['input_size'],
        hidden_size=config['hidden_size'],
        num_layers=config['num_layers'],
        num_classes=config['num_classes'],
    )

    model = builder.build(
        model_type=config['model_type'],
        bidirectional=config['bidirectional'],
        dropout_method=config['dropout_method'],
        layer_norm=config['layer_norm'],
        dropout_rate=config['dropout_rate'],
    )

    tutils.log_configuration(config, config['checkpoint_path'], 'model_config.json')

    return model


# Function to reinitialise a fresh learner with each set of hyperparameters
def initialise_learner(config, model, optimiser, train_loader, val_loader, test_loader, params):
    learner = ClassificationLearner(
        model=model,
        skel_format=config['skel_format'],
        optimiser=optimiser,
        train_loader=train_loader,
        val_loader=val_loader,
        test_loader=test_loader,
        checkpoint_dir=config['checkpoint_path'],
        early_stopping_patience=config['early_stopping_patience'],
        early_stopping_criterion=config['early_stopping_criterion'],
        checkpoint_interval=config['num_epochs'],
        use_l1=config['use_l1'],
        l1_lambda=config['l1_lambda'],
        use_l2=config['use_l2'],
        l2_lambda=config['reg_lambda'],
        loss_type=config['loss_type'],
        max_grad_norm=config['max_grad_norm'],
        lr_warmup_epochs=config['lr_warmup_epochs'],
        scheduler_type=config['scheduler_type'],
        scheduler_params=config['scheduler_params'],
        save_checkpoints=config['save_checkpoints'],
        disable_print_statements=config['disable_print_statements'],
        params=params
    )

    tutils.log_configuration(config, config['checkpoint_path'], 'learner_config.json')
    return learner


def get_scheduler_params(scheduler_type, train_loader=None):
    if scheduler_type == "StepLR":
        return {"step_size": 5, "gamma": 0.8}
    elif scheduler_type == "CosineAnnealingLR":
        return {"T_max": 40, "eta_min": 1e-6}
    elif scheduler_type == "ReduceLROnPlateau":
        return {
            "mode": "min",
            "factor": 0.8,
            "patience": 10,
            "threshold": 0.0001,
            "threshold_mode": "rel",
            "cooldown": 0,
            "min_lr": 1e-6,
            "eps": 1e-8,
        }
    elif scheduler_type == "CyclicLR":
        if train_loader is None:
            raise ValueError("train_loader must be provided for CyclicLR")
        step_size = max(len(train_loader) // 4, 1)
        return {
            "base_lr": 1e-5,
            "max_lr": 1e-3,
            "step_size_up": step_size,
            "step_size_down": step_size,
            "mode": "triangular2",
            "cycle_momentum": True,
        }
    elif scheduler_type is None:
        return None
    else:
        raise ValueError(f"Unknown scheduler type: {scheduler_type}")


# Clear the GPU cache if a trial fails due to memory constraints
def clear_gpu_cache():
    torch.cuda.empty_cache()
    gc.collect()

### Step 5: Testing combinations of hyperparameters

Select whether to use a grid search approach, or Bayesian optimisation:

In [34]:
approach_dropdown = widgets.Dropdown(
    options=['random search', 'bayesian optimisation'],
    value='bayesian optimisation',
    description='Tuning approach:',
)

display(approach_dropdown)

Dropdown(description='Tuning approach:', index=1, options=('random search', 'bayesian optimisation'), value='b…

In [35]:
selected_approach = approach_dropdown.value
print(f"Selected approach: {selected_approach}")

Selected approach: bayesian optimisation


In [36]:
num_epochs = 1000
num_trials = 20
num_workers = 2

### Step 5.1: First option: trial different hyperparameter combinations using Bayesian Optimisation (Optuna)

Install optuna into the environment.

In [37]:
!pip install optuna



Define the parameters for the search.

In [38]:
import torch
import gc
import optuna


def objective(trial):
    # Suggest hyperparameters here
    # If a value should be kept fixed, keep it in the list with a single value
    learning_rate = trial.suggest_float('learning_rate', 1e-5, 1e-2, log=True)
    batch_size = trial.suggest_categorical('batch_size', [8, 16, 32, 64, 128, 256, 512, 1024, 2096])
    hidden_size = trial.suggest_categorical('hidden_size', [64, 128, 256, 512])
    num_layers = trial.suggest_categorical('num_layers', [1, 2, 3, 4])
    dropout_rate = trial.suggest_float('dropout_rate', 0.0, 0.5, step=0.1)
    scheduler_type = trial.suggest_categorical('scheduler_type', ['StepLR', 'CosineAnnealingLR', 'ReduceLROnPlateau', 'CyclicLR', None])
    reg_lambda = trial.suggest_float('reg_lambda', 1e-5, 1e-3, log=True)

    try:

        train_loader, val_loader, test_loader = tutils.create_dataloaders(
            train_dataset=train_dataset,
            val_dataset=val_dataset,
            batch_size=batch_size,
            num_workers=num_workers,
        )

        model_config = {
            'input_size': input_size,
            'hidden_size': hidden_size,
            'num_layers': num_layers,
            'num_classes': num_classes,
            'model_type': model_type_dropdown.value,
            'bidirectional': bidirectional_dropdown.value,
            'dropout_method': dropout_method_dropdown.value,
            'layer_norm': layer_norm_dropdown.value,
            'dropout_rate': dropout_rate,
            'checkpoint_path': temp_checkpoint_dir,
            'use_skip_connection': True
        }

        learner_config = {
            'skel_format': skel,
            'checkpoint_path': temp_checkpoint_dir,
            'early_stopping_patience': 20,
            'early_stopping_criterion': 'loss',
            'num_epochs': num_epochs,
            'use_l1': False,
            'l1_lambda': 1e-5,
            'use_l2': True,
            'reg_lambda': reg_lambda,
            'loss_type': "weighted_focal_loss",  # ['standard', 'weighted_loss", "weighted_focal_loss", "standard_focal_loss"]
            'max_grad_norm': 5.0,
            'lr_warmup_epochs': 10,
            'scheduler_type': scheduler_type,
            'scheduler_params': get_scheduler_params(scheduler_type, train_loader=train_loader),
            'save_checkpoints': False,
            'disable_print_statements': False,
        }

        os.makedirs(temp_checkpoint_dir, exist_ok=True)

        model = initialise_model(model_config)

        optimiser = tutils.initialise_optimiser(
            model=model,
            learning_rate=learning_rate,
        )

        params = (
             learning_rate,
             batch_size,
             hidden_size,
             dropout_rate,
             scheduler_type
        )

        learner = initialise_learner(learner_config, model, optimiser, train_loader, val_loader, test_loader, params)

        learner.train(num_epochs=num_epochs)

        val_loss, val_accuracy, val_f1, _, _, _, _ = learner.validate()

        clear_gpu_cache()

        return val_loss

    except RuntimeError as e:
        if 'out of memory' in str(e):
            clear_gpu_cache()
            print(f"Memory error: {e}. Skipping this configuration.")
            return float('inf')  # Ensure config not selected as the best
        else:
            raise e  # Unexpected errors

Set the path for the Optuna study database. This path puts it at the root of the experiment studies.

In [39]:
db_path = f'{checkpoint_path}/_val_loss_lstm_{skel}_study.db'
storage_name = f"sqlite:///{db_path}"

Run the trials. Make sure to set n_trials before running the cell.

In [None]:
if selected_approach == 'bayesian optimisation':
    study = optuna.create_study(
        study_name=db_path,
        storage=storage_name,
        direction='minimize',
        load_if_exists=True)  # Minimise validation loss
    study.optimize(objective, n_trials=num_trials)  # Set number of trials

    best_trial = study.best_trial
    print(f"Best trial number: {best_trial.number}")
    print(f"Best trial value (loss): {best_trial.value}")
    print(f"Best parameters: {best_trial.params}")
    best_params = best_trial.params

[I 2024-08-26 20:17:21,487] Using an existing study with name '/content/drive/MyDrive/tkd-ar/checkpoints/aic_vitpose_l_yolo_m/experiment_7.0/_val_loss_lstm_aic_study.db' instead of creating a new one.


Using GPU: CUDA
Scheduler Type: ReduceLROnPlateau
Using Scheduler Params: {'mode': 'min', 'factor': 0.8, 'patience': 10, 'threshold': 0.0001, 'threshold_mode': 'rel', 'cooldown': 0, 'min_lr': 1e-06, 'eps': 1e-08}
Commencing training...
Epoch [1/1000], Epoch Time: 1.79s
Metric          Loss            Accuracy        F1 Score        Precision       Recall         
-----------------------------------------------------------------------------------------------
Train           2.4803          8.01           % 0.0194          0.0229          0.0801         

Validation      2.4538          8.30           % 0.0205          0.0128          0.0830         

Best model saved: /content/drive/MyDrive/tkd-ar/checkpoints/aic_vitpose_l_yolo_m/experiment_7.0/trial4/best_model.pt
Epoch [2/1000], Epoch Time: 1.72s
Metric          Loss            Accuracy        F1 Score        Precision       Recall         
-----------------------------------------------------------------------------------------------

[I 2024-08-26 20:19:00,668] Trial 4 finished with value: 1.8612725862434931 and parameters: {'learning_rate': 0.0004915126419919569, 'batch_size': 64, 'hidden_size': 64, 'num_layers': 3, 'dropout_rate': 0.30000000000000004, 'scheduler_type': 'ReduceLROnPlateau', 'reg_lambda': 0.0004958240504341476}. Best is trial 0 with value: 1.5572472897442904.


Using GPU: CUDA
Scheduler Type: CosineAnnealingLR
Using Scheduler Params: {'T_max': 40, 'eta_min': 1e-06}
Commencing training...
Epoch [1/1000], Epoch Time: 10.90s
Metric          Loss            Accuracy        F1 Score        Precision       Recall         
-----------------------------------------------------------------------------------------------
Train           2.3865          8.16           % 0.0486          0.0798          0.0816         

Validation      2.3816          6.00           % 0.0170          0.0132          0.0600         

Best model saved: /content/drive/MyDrive/tkd-ar/checkpoints/aic_vitpose_l_yolo_m/experiment_7.0/trial4/best_model.pt
Epoch [2/1000], Epoch Time: 10.95s
Metric          Loss            Accuracy        F1 Score        Precision       Recall         
-----------------------------------------------------------------------------------------------
Train           2.3436          7.75           % 0.0375          0.0540          0.0775         

Valida

### Step 5.2: Second option: perform a grid search

Modify the parameters for the desired grid search below before running the cell.

In [None]:
import random
from sklearn.model_selection import ParameterGrid

if selected_approach == 'random search':
    # Define the hyperparameter grid
    param_grid = {
        'learning_rate': [1e-5, 1e-4, 1e-3],
        'batch_size': [8, 16, 32, 64],
        'hidden_size': [64, 128, 256],
        'num_layers': [1, 2, 3],
        'dropout_rate': [0.0, 0.2, 0.5],
        'scheduler_type': ['StepLR', 'CosineAnnealingLR', 'ReduceLROnPlateau', None],
        'reg_lambda': [1e-5, 1e-4, 1e-3]
    }

    param_combinations = list(ParameterGrid(param_grid))
    random.shuffle(param_combinations)

    param_combinations = param_combinations[:num_trials]

    best_val_loss = float('inf')
    best_config = None
    best_trial_number = None

    for trial_num, params in enumerate(param_combinations):
        try:
            print(f"Testing configuration: {params} (Trial {trial_num + 1}/{len(param_combinations)})")

            train_loader, val_loader, test_loader = tutils.create_dataloaders(
                train_dataset=train_dataset,
                val_dataset=val_dataset,
                batch_size=params['batch_size'],
                num_workers=num_workers,
            )

            model_config = {
                'input_size': input_size,
                'hidden_size': params['hidden_size'],
                'num_layers': params['num_layers'],
                'num_classes': num_classes,
                'model_type': model_type_dropdown.value,
                'bidirectional': bidirectional_dropdown.value,
                'dropout_method': dropout_method_dropdown.value,
                'layer_norm': layer_norm_dropdown.value,
                'dropout_rate': params['dropout_rate'],
                'checkpoint_path': temp_checkpoint_dir,
                'use_skip_connection': True,
            }

            learner_config = {
                'skel_format': skel,
                'checkpoint_path': temp_checkpoint_dir,
                'early_stopping_patience': 20,
                'early_stopping_criterion': 'loss',
                'num_epochs': num_epochs,
                'use_l1': False,
                'l1_lambda': 1e-5,
                'use_l2': True,
                'reg_lambda': params['reg_lambda'],
                'loss_type': "weighted_focal_loss",  # ['standard', 'weighted_loss", "weighted_focal_loss", "standard_focal_loss"]
                'max_grad_norm': 5.0,
                'lr_warmup_epochs': 10,
                'scheduler_type': params['scheduler_type'],
                'scheduler_params': get_scheduler_params(params['scheduler_type'], train_loader),
                'save_checkpoints': False,
                'disable_print_statements': False,
            }

            os.makedirs(temp_checkpoint_dir, exist_ok=True)

            model = initialise_model(model_config)

            optimiser = tutils.initialise_optimiser(
                model=model,
                learning_rate=params['learning_rate'],
            )

            parameters = (
                params['learning_rate'],
                params['batch_size'],
                params['hidden_size'],
                params['dropout_rate'],
                params['scheduler_type'],
            )

            learner = initialise_learner(learner_config, model, optimiser, train_loader, val_loader, test_loader, params=parameters)

            # Train and validate the model, capturing the performance metric
            learner.train(num_epochs=num_epochs)

            val_loss, val_accuracy, val_f1, _, _, _, _ = learner.validate()

            # Check if this is the best configuration so far
            if val_loss < best_val_loss:
                best_val_loss = val_loss
                best_config = params
                best_trial_number = trial_num + 1  # Store the trial number

            clear_gpu_cache()

        except RuntimeError as e:
            if 'out of memory' in str(e):
                clear_gpu_cache()
                print(f"Memory error: {e}. Skipping this configuration.")
                continue  # Skip to the next configuration
            else:
                raise e  # Unexpected errors

    print(f"Best trial number: {best_trial_number}")
    print(f"Best parameters: {best_config}")
    print(f"Best trial value (loss): {best_val_loss}")
    best_params = best_config

### Step 6: Training on the best set of hyperparameters

Finally, a model is trained on the combined testing and validation set using the best set of hyperparameters from the trials. Make sure to set the number of epochs directly.

Alternatively, the hyperparameters can be directly set.

In [None]:
training_paths = [train_path, val_path] # Combined dataset
# Set the epochs directly informed by the early stopping
num_epochs = 100

Load the best set of hyperparametes from the selected trials.

In [None]:
fixed_config = {
    'input_size': input_size,
    'num_classes': num_classes,
    'model_type': model_type_dropdown.value,
    'bidirectional': bidirectional_dropdown.value,
    'dropout_method': dropout_method_dropdown.value,
    'layer_norm': layer_norm_dropdown.value,
    'checkpoint_path': temp_checkpoint_dir,
    'num_epochs': num_epochs,
    'skel_format': skel,
    'early_stopping_patience': 20,
    'early_stopping_criterion': 'loss',
    'loss_type': "weighted_focal_loss",  # ['standard', 'weighted_loss", "weighted_focal_loss", "standard_focal_loss"]
    'max_grad_norm': 5.0,
    'lr_warmup_epochs': 10,
    'save_checkpoints': False,
    'disable_print_statements': False,
    'batch_size': 32,
}


def get_final_config(params, fixed_config):
    final_config = fixed_config.copy()
    final_config.update(params)
    return final_config

if selected_approach == 'bayesian optimisation':
    final_config = get_final_config(best_params, fixed_config)
elif selected_approach == 'random search':
    final_config = get_final_config(best_params, fixed_config)

Create the DataLoaders.

In [None]:
from data.datasets.action_dataset import ActionDataset, collate_fn
from torch.utils.data import DataLoader


print(f"Loading training and validation data from {train_path}...")
combined_dataset = ActionDataset(
    skeleton=skel,
    preprocessed_data_path=training_paths,
    transform=transforms
)
combined_dataset.print_summary()


combined_loader = DataLoader(
    combined_dataset,
    batch_size=final_config['batch_size'],
    shuffle=True,
    num_workers=num_workers,
    pin_memory=True,
    drop_last=True,
    collate_fn=collate_fn
)

test_loader = DataLoader(
    test_dataset,
    batch_size=final_config['batch_size'],
    shuffle=False,
    num_workers=num_workers,
    pin_memory=True,
    drop_last=False,
    collate_fn=collate_fn
)
test_dataset.print_summary()

Define the configs. To train on custom parameters, change the desired entries below.

In [None]:
model_config = {
    'input_size': final_config['input_size'],
    'hidden_size': final_config['hidden_size'],
    'num_layers': final_config['num_layers'],
    'num_classes': final_config['num_classes'],
    'model_type': final_config['model_type'],
    'bidirectional': final_config['bidirectional'],
    'dropout_method': final_config['dropout_method'],
    'layer_norm': final_config['layer_norm'],
    'dropout_rate': final_config['dropout_rate'],
    'checkpoint_path': final_config['checkpoint_path'],
}

learner_config = {
    'skel_format': final_config['skel_format'],
    'checkpoint_path': final_config['checkpoint_path'],
    'early_stopping_patience': final_config['early_stopping_patience'],
    'early_stopping_criterion': final_config['early_stopping_criterion'],
    'num_epochs': final_config['num_epochs'],
    'use_l1': False,
    'l1_lambda': 1e-5,
    'use_l2': True,
    'reg_lambda': final_config['reg_lambda'],
    'loss_type': final_config['loss_type'],
    'max_grad_norm': final_config['max_grad_norm'],
    'lr_warmup_epochs': final_config['lr_warmup_epochs'],
    'scheduler_type': final_config['scheduler_type'],
    'scheduler_params': get_scheduler_params(final_config['scheduler_type'], combined_loader),
    'save_checkpoints': final_config['save_checkpoints'],
    'disable_print_statements': final_config['disable_print_statements'],
}

parameters = (
    final_config['learning_rate'],
    final_config['batch_size'],
    final_config['hidden_size'],
    final_config['dropout_rate'],
    final_config['scheduler_type'],
)

Finally, run the training loop and observe the results.


In [None]:
from common.utils.learner_utils import compute_and_save_metrics


model = initialise_model(model_config)

optimiser = tutils.initialise_optimiser(
    model=model,
    learning_rate=final_config['learning_rate'],
)

learner = initialise_learner(learner_config, model, optimiser, combined_loader, None, test_loader, parameters)

learner.train(num_epochs=final_config['num_epochs'])

metrics = learner.evaluate(show_metrics=True)