# Benchmarking

### Setup

The Mila Research Template leverages built-in PyTorch and Lightning functionality to make benchmarking accesible and flexible. 

In [3]:
import os
import rootutils

home_dir = rootutils.find_root(search_from="profiling.ipynb", indicator=".git")
%cd $home_dir
print(os.getcwd())

/home/mila/c/cesar.valdez/idt/ResearchTemplate
/home/mila/c/cesar.valdez/idt/ResearchTemplate


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


## Finding bottlenecks: Dataloading vs Training

A potential use of ..

In [4]:
!python project/main.py \
    algorithm=NoOp \
    trainer.max_epochs=1 \
    +trainer.limit_train_batches=0.01\
    +trainer.limit_val_batches=0.01\
    datamodule=imagenet

[2mCONFIG[0m
[2m├── [0m[2malgorithm[0m
[2m│   [0m[2m└── [0m[2;91;40m_target_[0m[2;97;40m:[0m[2;97;40m [0m[2;40mhydra_zen.funcs.zen_processing                                [0m
[2m│   [0m[2m    [0m[2;91;40m_zen_target[0m[2;97;40m:[0m[2;97;40m [0m[2;40mproject.algorithms.no_op.NoOp                              [0m
[2m│   [0m[2m    [0m[2;91;40m_zen_partial[0m[2;97;40m:[0m[2;97;40m [0m[2;40mtrue                                                      [0m
[2m│   [0m[2m    [0m[2;91;40m_zen_wrappers[0m[2;97;40m:[0m[2;97;40m [0m[2;40mhydra_zen.third_party.pydantic.pydantic_parser           [0m
[2m│   [0m[2m    [0m[2;40m                                                                        [0m
[2m├── [0m[2mnetwork[0m
[2m│   [0m[2m└── [0m[2;91;40m_target_[0m[2;97;40m:[0m[2;97;40m [0m[2;40mtorchvision.models.resnet.resnet18                            [0m
[2m│   [0m[2m    [0m[2;91;40mweights[0m[2;97;40m:[0m[2;97;40

In [10]:
!HYDRA_FULL_ERROR=1 python project/main.py \
    algorithm=example \
    trainer.max_epochs=1 \
    +trainer.limit_train_batches=0.01\
    +trainer.limit_val_batches=0.01\
    datamodule=imagenet

[2mCONFIG[0m
[2m├── [0m[2malgorithm[0m
[2m│   [0m[2m└── [0m[2;91;40m_target_[0m[2;97;40m:[0m[2;97;40m [0m[2;40mproject.algorithms.example.ExampleAlgorithm                   [0m
[2m│   [0m[2m    [0m[2;91;40m_partial_[0m[2;97;40m:[0m[2;97;40m [0m[2;40mtrue                                                         [0m
[2m│   [0m[2m    [0m[2;40m                                                                        [0m
[2m├── [0m[2mnetwork[0m
[2m│   [0m[2m└── [0m[2;91;40m_target_[0m[2;97;40m:[0m[2;97;40m [0m[2;40mtorchvision.models.resnet.resnet18                            [0m
[2m│   [0m[2m    [0m[2;91;40mweights[0m[2;97;40m:[0m[2;97;40m [0m[2;40mnull                                                           [0m
[2m│   [0m[2m    [0m[2;91;40mprogress[0m[2;97;40m:[0m[2;97;40m [0m[2;40mtrue                                                          [0m
[2m│   [0m[2m    [0m[2;91;40mnum_classes[0m[2;97;40m:[0m[2;9

## Testing for throughput across GPUs

Using the Mila Research template, it is possible to sweep over different parameters for testing purposes.  
For example, suppose we wanted to figure out how different GPUs perform relative to each other.  

[Mila's official documentation](https://docs.mila.quebec/Information.html) shows which GPUs are installed on the cluster. Typing ```savail``` on the command line shows their current availability.  
Testing their capacity can yield insights into their suitability for different training cases.

In [5]:
!savail

GPU               Avail / Total 
2g.20gb              31 / 48 
3g.40gb               9 / 48 
4g.40gb               7 / 24 
a100                  8 / 16 
a100l                 0 / 72 
a6000                 0 / 8 
rtx8000              11 / 400 
v100                  2 / 40 


We can observe the following prominent GPU classes: a100, a100l, a6000, rtx8000, v100.  
We will now proceed to specify different GPUs over training runs and compare their throughput.

In [None]:
# Add an example of a sweep over some parameters, 
# with the training throughput as the metric, 
# :: callbacks/samples_per_second, ### or add a devicestatsmonitor in
# and using different kinds of GPUs. 

Making sense of the former: if a GPU with lower maximum capacity is readily available, training on it may be more time and resource effective than waiting for higher capacity GPUs to become available.


### Logging with Weights & Biases (wandb)

The Mila Research template integrates wandb functionality as a logger specification.   
This has the advantage of being able to track additional metrics and create accompanying visualizations.  
We will now create a wandb report comparing throughput between GPUs. 


In [86]:
#  Create a wandb report with the throughput comparison 
# between the different GPU types.
# i.e. specify wandb as the logger and log the throughput

We would like to maximize our throughput given GPU choice

In [74]:
## Find the best datamodule parameters to maximize the throughput 
## (batches per second) without training (NoOP algo)

In [75]:
### Measure the performance on different GPUS using the optimal datamodule 
### params from before (and keeping other parameters the same)

We will now sweep over model hyper-parameters to maximize the utilization of our selected GPU.

In [76]:
#### Using the results from before, do a simple sweep over model hyper-parameters 
#### to maximize the utilization of the selected GPU (which was selected as a tradeoff 
#### between performance and difficulty to get an allocation). For example if the 
#### RTX8000's are 20% slower than A100s but 5x easier to get an allocation on, use those instead.

### Additional resources

[GPU Training (Basic) - LightningAI](https://lightning.ai/docs/pytorch/stable/accelerators/gpu_basic.html)  
[DeviceStatsMonitor class - LightningAI](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.callbacks.DeviceStatsMonitor.html)  
[PyTorch Profiler + W&B integration - Weights & Biases](https://wandb.ai/wandb/trace/reports/Using-the-PyTorch-Profiler-with-W-B--Vmlldzo5MDE3NjU)