## **Running the models using the 'modelling' package**

A notebook through which different modelling configurations can be ran, using the ``modelling`` package. It follows the steps of:
- preparing packages;
- setting "global" variables;
- getting the data;
- defining hyperparameters;
- running a grid search and/or training a model; and
- evaluation.
In the modelling package, variations can be made to the models and training functions to experiment. Don't forget to restart the notebook after making changes there.

For future models, a suggestion is to embed the training/testing functions in a Model class, instead of keeping them loose from each other. (With, optimally, inheritance from a base class, etc etc, such that there is minimal code duplication.) This way, the training procedure can be easily tailored per model. In the current set-up, different functions have to be called for fully-connected networks and hierarchical networks because they handle the data differently. Another way this would be a worth investment, is for implementation of physics-informed models, which require a whole physics injection into the training procedure. In that case, tight coupling is much recommended over the current state of this file. Therefore, I'd first change the code such that it works per model and such that only functionalities independent of model type are actually independent/loosely coupled from the models, therewith facilitating scalable experimentation.

Throughout the notebook, there are printing statements to clarify potential errors happening on Habrok

In [1]:
print("Starting script...")

from modelling import *
from modelling import GRU
from modelling import HGRU

import os
from pathlib import Path
import datetime
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torch.utils.data import ConcatDataset

Starting script...

Running __init__.py for data pipeline...
Modelling package initialized



Use GPU when available

In [2]:
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")
print("Device: ", device)

Device:  cpu


### **Set "global" variables**

In [21]:
HABROK = bool(0)                  # set to True if using HABROK; it will print
                                  # all stdout to a .txt file to log progress

BASE_DIR = Path.cwd()
MODEL_PATH = BASE_DIR / "results" / "models"
MINMAX_PATH = BASE_DIR.parent / "data" / "data_combined" / "contaminant_minmax.csv"

print("BASE_DIR: ", BASE_DIR)
print("MODEL_PATH: ", MODEL_PATH)
print("MINMAX_PATH: ", MINMAX_PATH)

torch.manual_seed(34)             # set seed for reproducibility

N_HOURS_U = 72                    # number of hours to use for input
N_HOURS_Y = 24                    # number of hours to predict
N_HOURS_STEP = 24                 # "sampling rate" in hours of the data; e.g. 24 
                                  # means sample an I/O-pair every 24 hours
                                  # the contaminants and meteorological vars
CONTAMINANTS = ['NO2', 'O3', 'PM10', 'PM25']
COMPONENTS = ['NO2', 'O3', 'PM10', 'PM25', 'SQ', 'WD', 'Wvh', 'dewP', 'p', 'temp']

BASE_DIR:  /home/nick/bachelor-project/forecasting_smog_DL_GNN/src
MODEL_PATH:  /home/nick/bachelor-project/forecasting_smog_DL_GNN/src/results/models
MINMAX_PATH:  /home/nick/bachelor-project/forecasting_smog_DL_GNN/data/data_combined/contaminant_minmax.csv


### **Load in data and create PyTorch *Datasets***

In [4]:
# Load in data and create PyTorch Datasets. To tune
# which exact .csv files get extracted, change the
# lists in the get_dataframes() definition

train_input_frames = get_dataframes('train', 'u')
train_output_frames = get_dataframes('train', 'y')

val_input_frames = get_dataframes('val', 'u')
val_output_frames = get_dataframes('val', 'y')

test_input_frames = get_dataframes('test', 'u')
test_output_frames = get_dataframes('test', 'y')

print("Successfully loaded data")

Successfully loaded data


In [5]:
train_dataset = TimeSeriesDataset(
    train_input_frames,  # list of input training dataframes
    train_output_frames, # list of output training dataframes
    5,                   # number of dataframes put in for both
                         # (basically len(train_input_frames) and
                         # len(train_output_frames) must be equal)
    N_HOURS_U,           # number of hours of input data
    N_HOURS_Y,           # number of hours of output data
    N_HOURS_STEP,        # number of hours between each input/output pair
)
val_dataset = TimeSeriesDataset(
    val_input_frames,    # etc.
    val_output_frames,
    3,
    N_HOURS_U,
    N_HOURS_Y,
    N_HOURS_STEP,
)
test_dataset = TimeSeriesDataset(
    test_input_frames,
    test_output_frames,
    3,
    N_HOURS_U,
    N_HOURS_Y,
    N_HOURS_STEP,
)

del train_input_frames, train_output_frames
del val_input_frames, val_output_frames
del test_input_frames, test_output_frames

### **Define hyperparameters**

In [6]:
# Here, all (hyper)parameters are defined. The hyperparameters are defined in
# a dictionary, which is then passed to the model and the training functions.
# The grid search is performed by generating all possible combinations of the
# hyperparameters defined in the hp_space dictionary, and then performing k-fold cross
# validation on each of these configurations. The best configuration is then returned.
# When the search is finished, comment out the hp_space dictionary and save the best found
# hyperparameters in the hp dictionary, and train the final model with these.

hp = {
    'n_hours_u' : N_HOURS_U,
    'n_hours_y' : N_HOURS_Y,

    'model_class' : HGRU,
    'input_units' : train_dataset.__n_features_in__(),
    'hidden_layers' : 4,
    'hidden_units' : 64,
    'branches' : 4,
    'output_units' : train_dataset.__n_features_out__(),

    'Optimizer' : torch.optim.Adam,
    'lr_shared' : 1e-3,
    'scheduler' : torch.optim.lr_scheduler.ReduceLROnPlateau,
    'scheduler_kwargs' : {'mode' : 'min',
                          'factor' : 0.1,
                          'patience' : 3,
                          'cooldown' : 8,
                          'verbose' : True},
    'w_decay' : 1e-7,
    'loss_fn' : torch.nn.MSELoss(),

    'epochs' : 5000,
    'early_stopper' : EarlyStopper,
    'patience' : 20,
    'batch_sz' : 16,
    'k_folds' : 5,
}                                   # The lr for the branched layer(s) is calculated
                                    # based on the "power ratio" between the branched
                                    # part of the network and the shared layer, which
                                    # is *assumed* to be proportional to n_hidden_layers
hp['lr_branch'] = hp['lr_shared'] * hp['hidden_layers']

hp_space = []                       # grid search space, put in the hyperparameters
                                    # to search over here

### **Start hyperparameter search/training**

In [7]:
print("starting training...")

current_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
stdout_location = f'results/grid_search_exe_s/exe_of_HGRU_at_{current_time}.txt'
# train_dataset_full = ConcatDataset([train_dataset, val_dataset])
#                                     If HABROK, print to external file, else print to stdout
# with PrintManager(stdout_location, 'a', HABROK):
#     print(f"Grid search execution of HGRU at {current_time}\n")
#                                     # Train on the full training set
#     model, best_hp, val_loss = grid_search(hp, hp_space, train_dataset_full, True)
#                                     # Externally save the best model
#     torch.save(model.state_dict(), f"{MODEL_PATH}/results/model_HGRU.pth")

#     hp = update_dict(hp, best_hp)   # Update the hp dictionary with the best hyperparameters
#     print_dict_vertically(best_hp)

starting training...


Lay out model architecture with optimal hyperparameters

In [8]:
with PrintManager(stdout_location, 'a', HABROK):
    print("\nPrinting model:")
    model = HGRU(hp['n_hours_u'],
                 hp['n_hours_y'],
                 hp['input_units'],
                 hp['hidden_layers'],
                 hp['hidden_units'], 
                 hp['branches'],
                 hp['output_units'])
    print(model)


Printing model:
HGRU(
  (input_layer): GRU(10, 64, batch_first=True)
  (shared_layer): GRU(64, 64, batch_first=True)
  (branches): ModuleList(
    (0-3): 4 x Branch(
      (layers): ModuleList(
        (0): GRU(64, 16, batch_first=True)
        (1): GRU(16, 16, num_layers=3, batch_first=True)
        (2): Linear(in_features=16, out_features=1, bias=True)
      )
    )
  )
)


Train model on complete training dataset (= train + validation)

In [9]:
train_loader = DataLoader(train_dataset, batch_size = hp['batch_sz'], shuffle = True)
val_loader = DataLoader(val_dataset, batch_size = hp['batch_sz'], shuffle = False) 
                                            
                                        # Train the final model on the full training set,
                                        # save the final model, and save the losses for plotting
with PrintManager(stdout_location, 'a', HABROK):
    print("\nTraining on full training set...")
    model_final, train_losses, test_losses, shared_losses, branch_losses = \
        train_hierarchical(hp, train_loader, val_loader, True)
    torch.save(model_final.state_dict(), f'{MODEL_PATH}/model_HGRU.pth')

df_losses = pd.DataFrame({'L_train': train_losses, 'L_test': test_losses})
df_losses.to_csv(f'{os.path.join(os.getcwd(), "results/final_losses")}/losses_HGRU_at_{current_time}.csv', 
                 sep = ';', decimal = '.', encoding = 'utf-8')


Training on full training set...
Epoch: 1 	Ltrain: 0.011305 	Lval: 0.010877
Epoch: 2 	Ltrain: 0.011735 	Lval: 0.012734
Epoch: 3 	Ltrain: 0.011082 	Lval: 0.011529
Epoch: 4 	Ltrain: 0.011007 	Lval: 0.010642
Epoch: 5 	Ltrain: 0.009112 	Lval: 0.007835
Epoch: 6 	Ltrain: 0.008229 	Lval: 0.007742
Epoch: 7 	Ltrain: 0.006938 	Lval: 0.007301
Epoch: 8 	Ltrain: 0.006323 	Lval: 0.005732
Epoch: 9 	Ltrain: 0.005777 	Lval: 0.006547
Epoch: 10 	Ltrain: 0.005304 	Lval: 0.005400
Epoch: 11 	Ltrain: 0.005326 	Lval: 0.004559
Epoch: 12 	Ltrain: 0.004814 	Lval: 0.004823
Epoch: 13 	Ltrain: 0.004650 	Lval: 0.005044
Epoch: 14 	Ltrain: 0.004256 	Lval: 0.004272
Epoch: 15 	Ltrain: 0.004208 	Lval: 0.004502
Epoch: 16 	Ltrain: 0.004401 	Lval: 0.004709
Epoch: 17 	Ltrain: 0.004201 	Lval: 0.003773
Epoch: 18 	Ltrain: 0.003911 	Lval: 0.003780
Epoch: 19 	Ltrain: 0.003865 	Lval: 0.003776
Epoch: 20 	Ltrain: 0.003925 	Lval: 0.004207
Epoch 00021: reducing learning rate of group 0 to 1.0000e-04.
Epoch 00021: reducing learning ra

#### **Testing the model**

In [12]:
model_final = HGRU(hp['input_units'], hp['hidden_layers'], hp['hidden_units'],
                     hp['branches'], hp['output_units'])
model_final.load_state_dict(torch.load(f"{MODEL_PATH}/model_HGRU.pth"))
print(model_final)

HGRU(
  (input_layer): GRU(10, 64, batch_first=True)
  (shared_layer): GRU(64, 64, batch_first=True)
  (branches): ModuleList(
    (0-3): 4 x Branch(
      (layers): ModuleList(
        (0): GRU(64, 16, batch_first=True)
        (1): GRU(16, 16, num_layers=3, batch_first=True)
        (2): Linear(in_features=16, out_features=1, bias=True)
      )
    )
  )
)


In [13]:
test_loader = DataLoader(test_dataset, batch_size = hp['batch_sz'], shuffle = False) 
test_error = test_hierarchical(model_final, nn.MSELoss(), test_loader)

with PrintManager(stdout_location, 'a', HABROK):
    print()
    print("Testing MSE:", test_error)


Testing MSE: 0.002871926175430417


In [22]:
print(test_hierarchical(model_final, nn.MSELoss(), train_loader))
print(test_hierarchical(model_final, nn.MSELoss(), val_loader))
print(test_hierarchical(model_final, nn.MSELoss(), test_loader))

print("\nMSE Training set:")
print_dict_vertically(
    test_hierarchical_separately(model_final, nn.MSELoss(), train_loader, True, MINMAX_PATH)
)
print("\nMSE Validation set:")
print_dict_vertically(
    test_hierarchical_separately(model_final, nn.MSELoss(), val_loader, True, MINMAX_PATH)
)
print("\nMSE Test set:")
print_dict_vertically(
    test_hierarchical_separately(model_final, nn.MSELoss(), test_loader, True, MINMAX_PATH)
)

0.003586217119335765
0.0035552000238870582
0.002871926175430417

MSE Training set:
NO2 : 79.17088792382216
O3  : 92.93203316665277
PM10: 116.6365341558689
PM25: 28.721079989177426

MSE Validation set:
NO2 : 57.7382615407308
O3  : 107.65130297342937
PM10: 132.43575795491537
PM25: 37.195351918538414

MSE Test set:
NO2 : 49.507030169169106
O3  : 78.67367490132649
PM10: 97.43037923177083
PM25: 30.48535426457723


In [23]:
print("\nRMSE Training set:")
print_dict_vertically_root(
    test_hierarchical_separately(model_final, nn.MSELoss(), train_loader, True, MINMAX_PATH)
)
print("\nRMSE Validation set:")
print_dict_vertically_root(
    test_hierarchical_separately(model_final, nn.MSELoss(), val_loader, True, MINMAX_PATH)
)
print("\nRMSE Test set:")
print_dict_vertically_root(
    test_hierarchical_separately(model_final, nn.MSELoss(), test_loader, True, MINMAX_PATH)
)
np.sqrt(test_hierarchical(model_final, nn.MSELoss(), test_loader, True, MINMAX_PATH))


RMSE Training set:


NO2 : 8.897802441950919
O3  : 9.640126220066746
PM10: 10.799839547368762
PM25: 5.359205163937785

RMSE Validation set:
NO2 : 7.598569703617306
O3  : 10.375514588367624
PM10: 11.508073598779049
PM25: 6.098799219398718

RMSE Test set:
NO2 : 7.036123234364866
O3  : 8.869818200015516
PM10: 9.870682814870044
PM25: 5.521354386794713


8.00150654435589

In [25]:
pair = 5
plot_pred_vs_gt(model_final, test_dataset, pair, 'NO2', N_HOURS_Y)
plot_pred_vs_gt(model_final, test_dataset, pair, 'O3', N_HOURS_Y)
plot_pred_vs_gt(model_final, test_dataset, pair, 'PM10', N_HOURS_Y)
plot_pred_vs_gt(model_final, test_dataset, pair, 'PM25', N_HOURS_Y)

AttributeError: 'list' object has no attribute 'clone'