# Custom Training Logic with Lightning Integration

In this example, we showcase the ability for the user to define own training logic and easily integrate into Lightning workflow



## NeuroMANCER and Dependencies

### Install (Colab only)
Skip this step when running locally.

In [None]:
!pip install "neuromancer[examples] @ git+https://github.com/pnnl/neuromancer.git@master"
!pip install lightning 


### Import

(The user might need to install PyTorch Lightning). If so, please run 

```
pip install lightning
```

In [1]:
import torch
import torch.nn as nn
import numpy as np
import neuromancer.slim as slim
import matplotlib.pyplot as plt
import matplotlib.patheffects as patheffects
import casadi
import time
import lightning.pytorch as pl 


In [3]:
from neuromancer.trainer import Trainer, LitTrainer
from neuromancer.problem import Problem
from neuromancer.constraint import variable
from neuromancer.dataset import DictDataset
from neuromancer.loss import PenaltyLoss
from neuromancer.modules import blocks
from neuromancer.system import Node


# Problem formulation

In this example we will solve parametric constrained [Rosenbrock problem](https://en.wikipedia.org/wiki/Rosenbrock_function):

$$
\begin{align}
&\text{minimize } &&  (1-x)^2 + a(y-x^2)^2\\
&\text{subject to} && \left(\frac{p}{2}\right)^2 \le x^2 + y^2 \le p^2\\
& && x \ge y
\end{align}
$$

with parameters $p, a$ and decision variables $x, y$.


# Lightning Workflow

The workflow when using Lightning consists of three parts: 

1. Defining a "data_setup_function() -- this function should return 4 values (train, dev, test datasets, and batch size). The datasets should be named Neuromancer DictDatasets. 
2. Defining the Problem -- consisting of Nodes, System, Loss. 
3. Instantiating the PyTorch-Lightning -based Trainer (LitTrainer class)

For this notebook, we assume all operations are done on the CPU. 

### Lightning Dataset

We constructy the dataset by sampling the parametric space.

In [4]:
data_seed = 408  # random seed used for simulated data
np.random.seed(data_seed)
torch.manual_seed(data_seed)
nsim = 5000  # number of datapoints: increase sample density for more robust results

# create dictionaries with sampled datapoints with uniform distribution
a_low, a_high, p_low, p_high = 0.2, 1.2, 0.5, 2.0

We define the **data_setup_function()** below. It randomly sample parameters from a uniform distribution: $0.5\le p\le2.0$;  $0.2\le a\le1.2$. It takes these parameters as inputs and outputs Neuromancer DictDatasets() for train, dev, and test data (or None type otherwise), as well as batch size. We have hardcoded batch size to be 64 in this case. 

It is important to define both training and dev/validation datasets. Training datasets will be used for the training step; dev datasets will be used for model checkpointing (if desired)

In [5]:

def data_setup_function(nsim, a_low, a_high, p_low, p_high): 

    
    samples_train = {"a": torch.FloatTensor(nsim, 1).uniform_(a_low, a_high),
                    "p": torch.FloatTensor(nsim, 1).uniform_(p_low, p_high)}
    samples_dev = {"a": torch.FloatTensor(nsim, 1).uniform_(a_low, a_high),
                "p": torch.FloatTensor(nsim, 1).uniform_(p_low, p_high)}
    samples_test = {"a": torch.FloatTensor(nsim, 1).uniform_(a_low, a_high),
                "p": torch.FloatTensor(nsim, 1).uniform_(p_low, p_high)}
    # create named dictionary datasets
    train_data = DictDataset(samples_train, name='train')
    dev_data = DictDataset(samples_dev, name='dev')
    test_data = DictDataset(samples_test, name='test')

    batch_size = 64

    # Return the dict datasets in train, dev, test order, followed by batch_size 
    return train_data, dev_data, test_data, batch_size 



We now define the **Problem()**

## Primal Solution Map Architecture

A neural network mapping problem parameters onto primal decision variables:  
$$x = \pi(\theta)$$

In [6]:
# define neural architecture for the trainable solution map
func = blocks.MLP(insize=2, outsize=2,
                bias=True,
                linear_map=slim.maps['linear'],
                nonlin=nn.ReLU,
                hsizes=[80] * 4)
# wrap neural net into symbolic representation of the solution map via the Node class: sol_map(xi) -> x
sol_map = Node(func, ['a', 'p'], ['x'], name='map')

## Objective and Constraints in NeuroMANCER

In [7]:
"""
variable is a basic symbolic abstraction in Neuromancer
   x = variable("variable_name")                      (instantiates new variable)  
variable construction supports:
   algebraic expressions:     x**2 + x**3 + 5     (instantiates new variable)  
   slicing:                   x[:, i]             (instantiates new variable)  
   pytorch callables:         torch.sin(x)        (instantiates new variable)  
   constraints definition:    x <= 1.0            (instantiates Constraint object) 
   objective definition:      x.minimize()        (instantiates Objective object) 
to visualize computational graph of the variable use x.show() method          
"""

# define decision variables
x1 = variable("x")[:, [0]]
x2 = variable("x")[:, [1]]
# problem parameters sampled in the dataset
p = variable('p')
a = variable('a')

# objective function
f = (1-x1)**2 + a*(x2-x1**2)**2
obj = f.minimize(weight=1.0, name='obj')

# constraints
Q_con = 100.  # constraint penalty weights
con_1 = Q_con*(x1 >= x2)
con_2 = Q_con*((p/2)**2 <= x1**2+x2**2)
con_3 = Q_con*(x1**2+x2**2 <= p**2)
con_1.name = 'c1'
con_2.name = 'c2'
con_3.name = 'c3'

In [11]:
# constrained optimization problem construction
objectives = [obj]
constraints = [con_1, con_2, con_3]
components = [sol_map]

# create penalty method loss function
loss = PenaltyLoss(objectives, constraints)
# construct constrained optimization problem
problem = Problem(components, loss)

# Custom Training Logic
Training within PyTorch Lightning framework is defined by a `training_step` function, which defines the logic going from a data batch to loss. For example, the default training_step used is shown below (other extraneous details removed for simplicity). Here, we get the problem output for the given batch and return the loss associated with that output.

```
def training_step(self, batch):
    output = self.problem(batch)
    loss = output[self.train_metric]
    return loss
```
While rare, there may be instances where the user might want to define their own training logic. Potential cases include test-time data augmentation (e.g. operations on/w.r.t the data rollout), other domain augmentations, or modifications to how the output and/or loss is handled. 

The user can pass in their own "training_step" by supplying an equivalent function handler to the "custom_training_step" keyword of LitTrainer, for example: 

```
def custom_training_step(model, batch): 
    output = model.problem(batch)
    Q_con = 1
    if model.current_epoch > 1: 
        Q_con = 1/10000
    loss = Q_con*(output[model.train_metric])
    return loss
```

The signature of this function should be `custom_training_step(model, batch)` where model is a Neuromancer Problem

In [10]:
def custom_training_step(model, batch): 
    output = model.problem(batch)
    Q_con = 1
    if model.current_epoch > 1: 
        Q_con = 1/10000    
    loss = Q_con*(output[model.train_metric])
    return loss

lit_trainer = LitTrainer(epochs=100, accelerator='cpu', patience=3, custom_training_step=custom_training_step)
lit_trainer.fit(problem=problem, data_setup_function=data_setup_function, nsim=nsim,a_low=0.2, a_high=1.2, p_low=0.5, p_high=2.0)


GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


Missing logger folder: /home/birm560/neuromancer/examples/lightning_integration_examples/other_examples/lightning_logs
/home/birm560/miniconda3/envs/neuromancer3/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:639: Checkpoint directory ./ exists and is not empty.

  | Name    | Type    | Params
------------------------------------
0 | problem | Problem | 19.8 K
------------------------------------
19.8 K    Trainable params
0         Non-trainable params
19.8 K    Total params
0.079     Total estimated model params size (MB)


                                                                           

/home/birm560/miniconda3/envs/neuromancer3/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=159` in the `DataLoader` to improve performance.
/home/birm560/miniconda3/envs/neuromancer3/lib/python3.10/site-packages/lightning/pytorch/utilities/data.py:77: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 64. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
/home/birm560/miniconda3/envs/neuromancer3/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=159` in the `DataLoader` to improve performance.


Epoch 0: 100%|██████████| 79/79 [00:00<00:00, 121.77it/s, v_num=0, train_loss_step=0.903]

/home/birm560/miniconda3/envs/neuromancer3/lib/python3.10/site-packages/lightning/pytorch/utilities/data.py:77: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 8. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.


Epoch 0: 100%|██████████| 79/79 [00:01<00:00, 73.36it/s, v_num=0, train_loss_step=0.903, dev_loss=0.815, train_loss_epoch=5.690]

Epoch 0, global step 79: 'dev_loss' reached 0.81470 (best 0.81470), saving model to './epoch=0-step=79.ckpt' as top 1


Epoch 1:  51%|█████     | 40/79 [00:00<00:00, 121.57it/s, v_num=0, train_loss_step=0.714, dev_loss=0.815, train_loss_epoch=5.690]

/home/birm560/miniconda3/envs/neuromancer3/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py:54: Detected KeyboardInterrupt, attempting graceful shutdown...


Below is another example of a dummy custom_training_step. Here we want to add the loss of the previous batch and accumulate into the "current" loss. (Again this is a dummy example and not necessarily propel ML techniques). Any sort of variables, such as "past_loss" can be defined by setting them as attributes of "model"

In [14]:
def custom_training_step(model, batch): 
    with torch.no_grad(): 
        if model.current_epoch == 0: 
            model.past_loss = 0
    
    output = model.problem(batch)
    loss = (output[model.train_metric]) + 0.5*model.past_loss
    model.past_loss = loss.item()
    return loss

lit_trainer = LitTrainer(epochs=100, accelerator='cpu', patience=3, custom_training_step=custom_training_step)
lit_trainer.fit(problem=problem, data_setup_function=data_setup_function, nsim=nsim,a_low=0.2, a_high=1.2, p_low=0.5, p_high=2.0)


GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/birm560/miniconda3/envs/neuromancer3/lib/python3.10/site-packages/lightning/pytorch/trainer/setup.py:187: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
/home/birm560/miniconda3/envs/neuromancer3/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:639: Checkpoint directory ./ exists and is not empty.

  | Name    | Type    | Params
------------------------------------
0 | problem | Problem | 19.8 K
------------------------------------
19.8 K    Trainable params
0         Non-trainable params
19.8 K    Total params
0.079     Total estimated model params size (MB)


                                                                            

/home/birm560/miniconda3/envs/neuromancer3/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=159` in the `DataLoader` to improve performance.
/home/birm560/miniconda3/envs/neuromancer3/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=159` in the `DataLoader` to improve performance.


Epoch 0: 100%|██████████| 79/79 [00:01<00:00, 75.73it/s, v_num=3, train_loss_step=0.0857, dev_loss=0.0834, train_loss_epoch=0.137]

Epoch 0, global step 79: 'dev_loss' reached 0.08340 (best 0.08340), saving model to './epoch=0-step=79-v3.ckpt' as top 1


Epoch 1: 100%|██████████| 79/79 [00:01<00:00, 74.35it/s, v_num=3, train_loss_step=0.213, dev_loss=0.0887, train_loss_epoch=0.282] 

Epoch 1, global step 158: 'dev_loss' was not in top 1


Epoch 2: 100%|██████████| 79/79 [00:01<00:00, 75.86it/s, v_num=3, train_loss_step=0.309, dev_loss=0.138, train_loss_epoch=0.284]  

Epoch 2, global step 237: 'dev_loss' was not in top 1


Epoch 3: 100%|██████████| 79/79 [00:01<00:00, 75.04it/s, v_num=3, train_loss_step=0.206, dev_loss=0.0908, train_loss_epoch=0.243]

Epoch 3, global step 316: 'dev_loss' was not in top 1


Epoch 3: 100%|██████████| 79/79 [00:01<00:00, 73.78it/s, v_num=3, train_loss_step=0.206, dev_loss=0.0908, train_loss_epoch=0.243]
