# Profiling your code

The Mila Research Template leverages built-in PyTorch and Lightning functionality to make model profiling and benchmarking accessible and flexible.  
Make sure to read the Mila Docs page on profiling before going through this example.  
[PLACEHOLDER - Profiling](https://docs.mila.quebec/) . 

The research template profiling notebook extends the examples in the official documentation with additional tools: notably, native WandB integration to monitor performance and using hydra multiruns to compare the available GPUs on the official Mila cluster. See below. The goal of this notebook is to introduce profiling, present tools useful for doing so and to provide general concepts and guidelines for optimizing your code, within the Mila cluster ecosystem.


### Setup

In [3]:
import os
from pathlib import Path
# Set the working directory to the project root
notebook_path = Path().resolve()  
project_root = notebook_path.parent.parent
os.chdir(str(project_root))

## Introduction

As a deep learning researcher, training comparatively slow models as opposed to faster, optimized ones can greatly influence your research output. In addition, being a user of a shared cluster, being efficient about the use of institutional resources is a benefit to all the users in the ecosystem. Given the ample variety of available resources and training schemes to achieve the same modeling objective, optimizing your code isn't necessarily a straightforward task. 

While there's many costs involved in getting a model to train, some are more relevant than others when it comes to making your code more efficient. Setting a performance baseline, by observing said costs and identifying underperforming components in the code while properly contextualizing them within a broader training scheme is the very first step to optimizing your code. Once a baseline performance expectation is set, we can modify and observe our code's performance in a comparative manner to then determine if the performed optimizations are better. A profiler can help us in this endeavor.

## What is a profiler and what is it good for?

A profiler is a tool that allows you to measure the time and memory consumption of the model’s operators. Specifically, the PyTorch profiler output provides clues about operations relevant to model training. Examples include the total amount of time spent doing low-level mathematical operations in the GPU, and whether these are unexpectedly slow or take a disproportionate amount of time, indicating they should be avoided or optimized. Identifying problematic operations can greatly help us validate or rethink our baseline model performance expectations.


## Setting baseline model performance expectations 

In [None]:
# model size
# https://discuss.pytorch.org/t/finding-model-size/130275/2
model = ???
param_size = 0
for param in model.parameters():
    param_size += param.nelement() * param.element_size()
buffer_size = 0
for buffer in model.buffers():
    buffer_size += buffer.nelement() * buffer.element_size()

size_all_mb = (param_size + buffer_size) / 1024**2
print('model size: {:.3f}MB'.format(size_all_mb))


$$
MBU = 
\frac{\# \text{ Params} \cdot \text{bytes per param} \cdot \text{tokens per second}}{\text{Memory Bandwidth}}
$$


## Identifying potential bottleneck sources 

Finding a bottleneck is not necessarily straightforward or clear from the start. A sensible first step is to determine whether a potential slowdown originates from data loading or model computation. Querying both ends of the process can be done to determine whether the master process has a significant stall when fetching the next batch, or not. 
If it's close to 0, then data loading outpaces compute, and compute is the bottleneck. 
If it's much greater than 0, then compute outpaces data loading, and data loading is the bottleneck. 
You might not care about CPU usage by the master process and data loaders, so long as the GPU remains fully utilized. 
Nonetheless, a profiler may record that anyways. How do you look out for relevant stuff? Here are a few ideas.


In [6]:
!python project/main.py \
    algorithm=no_op \
    datamodule=imagenet \
    ++hydra=profiling_multirun \
    ++trainer.max_epochs=1 \
    ++trainer.limit_train_batches=30 \
    ++trainer.limit_val_batches=30 \


LexerNoViableAltException: \
                           ^
See https://hydra.cc/docs/1.2/advanced/override_grammar/basic for details

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.


In [None]:
!python project/main.py \
    algorithm=example \
    datamodule=imagenet \
    ++logger=wandb \
    ++trainer.max_epochs=1 \
    ++trainer.limit_train_batches=30 \
    ++trainer.limit_val_batches=30

## Testing for throughput across GPUs

As the Mila Research template is built with hydra as a configuration manager, it integrates [Multi-runs](https://hydra.cc/docs/tutorials/basic/running_your_app/multi-run/) by default. This makes it possible to sweep over different parameters for profiling or throughput testing purposes or both. For example, suppose we wanted to figure out how different GPUs perform relative to each other.  
[Mila's official documentation](https://docs.mila.quebec/Information.html) has a comprehensive rundown of the GPUs that are installed on the cluster. Typing ```savail``` on the command line, when logged into the cluster, shows their current availability. Testing their capacity can yield insights into their suitability for different training cases.

In [1]:
!savail

GPU               Avail / Total 
2g.20gb              12 / 48 
3g.40gb               0 / 48 
4g.40gb               0 / 24 
a100                  0 / 32 
a100l                 0 / 88 
a6000                 0 / 8 
rtx8000               8 / 408 
v100                  2 / 56 


As these jobs are part of the cluster, [Submitit](https://hydra.cc/docs/plugins/submitit_launcher/)

We can observe the following prominent GPU classes: a100, a100l, a6000, rtx8000, v100 and MiG partitions with sizes 2g.20gb, 3g.40gb, 4g.40gb.  
We will now proceed to specify different GPUs over training runs and compare their throughput.

In [None]:
# Add an example of a sweep over some parameters, 
# with the training throughput as the metric, 
# :: callbacks/samples_per_second, 
# or add a devicestatsmonitor in
# and using different kinds of GPUs. 

## salloc --gres=gpu:a100:1 -c 6 --mem=32G -t 48:00:00 --partition=unkillable

Making sense of the former: if a GPU with lower maximum capacity is readily available, training on it may be more time and resource effective than waiting for higher capacity GPUs to become available.


### Logging with Weights & Biases (wandb)

The Mila Research template integrates wandb functionality as a logger specification.   
This has the advantage of being able to track additional metrics and create accompanying visualizations.  
We will now create a wandb report comparing throughput between GPUs. 


In [None]:
#  Create a wandb report with the throughput comparison 
# between the different GPU types.
# i.e. specify wandb as the logger and log the throughput
!python project/main.py \
    algorithm=no_op \
    datamodule=imagenet \
    ++logger=wandb \
    ++trainer.max_epochs=1 \
    ++trainer.limit_train_batches=30 \
    ++trainer.limit_val_batches=30 \
    hydra=profiling_multirun

We would like to maximize our throughput given GPU choice

In [None]:
## Find the best datamodule parameters to maximize the throughput 
## (batches per second) without training (NoOP algo)

In [None]:
### Measure the performance on different GPUS using the optimal datamodule 
### params from before (and keeping other parameters the same)

We will now sweep over model hyper-parameters to maximize the utilization of our selected GPU.

In [None]:
#### Using the results from before, do a simple sweep over model hyper-parameters 
#### to maximize the utilization of the selected GPU (which was selected as a tradeoff 
#### between performance and difficulty to get an allocation). For example if the 
#### RTX8000's are 20% slower than A100s but 5x easier to get an allocation on, use those instead.

### Additional resources

[GPU Training (Basic) - LightningAI](https://lightning.ai/docs/pytorch/stable/accelerators/gpu_basic.html)  
[DeviceStatsMonitor class - LightningAI](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.callbacks.DeviceStatsMonitor.html)  
[PyTorch Profiler + W&B integration - Weights & Biases](https://wandb.ai/wandb/trace/reports/Using-the-PyTorch-Profiler-with-W-B--Vmlldzo5MDE3NjU)   
[Advanced profiling for model optimization - Accelerating Generative AI with PyTorch: Segment Anything, Fast](https://pytorch.org/blog/accelerating-generative-ai/)