# Profiling your code

The Mila Research Template leverages built-in PyTorch and Lightning functionality to make model profiling and benchmarking accessible and flexible.  
Make sure to read the Mila Docs page on [PLACEHOLDER - profiling](https://docs.mila.quebec/) before going through this example. 

The research template profiling notebook extends the examples in the official documentation with additional tools: notably, native WandB integration to monitor performance and using hydra multiruns to compare the available GPUs on the official Mila cluster. See below. The goal of this notebook is to introduce profiling, present tools useful for doing so and to provide general concepts and guidelines for optimizing your code, within the Mila cluster ecosystem.


### Setup

In [1]:
import os
from pathlib import Path
# Set the working directory to the project root
notebook_path = Path().resolve()  
project_root = notebook_path.parent.parent
os.chdir(str(project_root))

## Introduction

As a deep learning researcher, training comparatively slow models as opposed to faster, optimized ones can greatly impact your research output. In addition, as a user of a shared cluster, being efficient about the use of institutional resources is a benefit to all the users in the ecosystem. Given the ample variety of available resources and training schemes to achieve the same modeling objective, optimizing your code isn't necessarily a straightforward task. 

While there's many costs involved in getting a model to train, some are more relevant than others when it comes to making your code more efficient. Setting a performance baseline, by observing said costs and identifying underperforming components in the code while properly contextualizing them within a broader training scheme is the very first step to optimizing your code. Once a baseline performance expectation is set, we can modify and observe our code's performance in a comparative manner to then determine if the performed optimizations are better.

## Instrumenting your code

Setting up artifacts within your code to monitor metrics of interest can help set a cost baseline and evidence potential areas for improvement. Common metrics to watch for include but are not limited to:
 
- Training speed (samples/s)
- CPU/GPU utilization 
- RAM/VRAM utilization

In the Mila ResearchTemplate, this can be done by passing a callback to the trainer. Supported configs are found within the project template at `configs/trainer/callbacks`. Here, we will use the default callback, which in turn implements early stopping and tracks the learning rate, device utilisation and throughput, each through a specific callback instance.

In [2]:
!python project/main.py \
    algorithm=no_op \
    datamodule=imagenet \
    trainer=profiling \
    trainer/callbacks=default

[2KCreating schemas for Hydra config files...[35m   0%[0m [90m━━━━[0m [32m0/44[0m [ [33m0:0…[0m < [36m-:-…[0m , [31m?   [0m ]
[2K[1A[2K[2;36m[09/12/24 11:00:24][0m[2;36m [0m[34mINFO    [0m Config file config.yaml was      ]8;id=986330;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/project/utils/auto_schema.py\[2mauto_schema.py[0m]8;;\[2m:[0m]8;id=4653;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/project/utils/auto_schema.py#362\[2m362[0m]8;;\
[2;36m                    [0m         modified, regenerating the       [2m                  [0m
[2;36m                    [0m         schema.                          [2m                  [0m
Creating schemas for Hydra config files...[35m   0%[0m [90m━━━━[0m [32m0/44[0m [ [33m0:0…[0m < [36m-:-…[0m , [31m?   [0m ]
[2K[1A[2KCreating schemas for Hydra config files...[35m   0%[0m [90m━━━━[0m [32m0/44[0m [ [33m0:0…[0m < [36m-:-…[0m , [31m?   [0m ]
[2K[1A[2KCreatin

### Optional: log metrics on wandb

In addition to callback specification, the Mila Research template integrates wandb as a logger specification, which enables the tracking of additional metrics through visualizations and dashboard creation.

In [7]:
!python project/main.py \
    algorithm=no_op \
    datamodule=imagenet \
    trainer=profiling \
    trainer/logger=wandb \
    trainer.logger.wandb.name="WandB logging test" 

[2KCreating schemas for Hydra config files...[35m   0%[0m [90m━━━━[0m [32m0/44[0m [ [33m0:0…[0m < [36m-:-…[0m , [31m?   [0m ]
[2K[1A[2K[2;36m[09/12/24 11:20:50][0m[2;36m [0m[34mINFO    [0m Config file                      ]8;id=977657;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/project/utils/auto_schema.py\[2mauto_schema.py[0m]8;;\[2m:[0m]8;id=927078;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/project/utils/auto_schema.py#362\[2m362[0m]8;;\
[2;36m                    [0m         experiment/cluster_sweep_example [2m                  [0m
[2;36m                    [0m         .yaml was modified, regenerating [2m                  [0m
[2;36m                    [0m         the schema.                      [2m                  [0m
Creating schemas for Hydra config files...[35m   0%[0m [90m━━━━[0m [32m0/44[0m [ [33m0:0…[0m < [36m-:-…[0m , [31m?   [0m ]
[2K[1A[2KCreating schemas for Hydra config files...[35m   0%

[2;36m[09/12/24 11:22:43][0m[2;36m [0m[34mINFO    [0m Updated the yaml schemas in the  ]8;id=487389;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/project/utils/auto_schema.py\[2mauto_schema.py[0m]8;;\[2m:[0m]8;id=365722;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/project/utils/auto_schema.py#522\[2m522[0m]8;;\
[2;36m                    [0m         vscode settings file at          [2m                  [0m
[2;36m                    [0m         [35m/home/mila/c/cesar.valdez/idt/Re[0m [2m                  [0m
[2;36m                    [0m         [35msearchTemplate/.vscode/[0m[95msettings.[0m [2m                  [0m
[2;36m                    [0m         [95mjson.[0m                            [2m                  [0m
[2;36m                   [0m[2;36m [0m[34mINFO    [0m Instantiated the config at       ]8;id=827521;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/project/utils/hydra_utils.py\[2mhydra_utils.py[0m

We can now visualize the results of our run at `wandb_url`

## Identifying potential bottleneck sources 

Finding bottlenecks in your code is not necessarily clear or straightforward from the start. A sensible first step is to determine whether potential slowdowns originate from data loading or model computation. Running a model with and without training and contrasting the obtained outputs can help us determine whether the master process has a significant stall when fetching the next batch for training or not. Analyzing the difference between outputs can tell us the following about our model: 

- If the difference between data loading and training is close to 0, then the data loading procedure outpaces model computation, and computation is the bottleneck. 
- If the difference between data loading and training is much greater than 0, then model computation outpaces data loading, and data loading is the bottleneck. 

To showcase the former, we will proceed to run two separate model loops on imagenet: the first one doing data loading without any training, followed by one with.

In [9]:
!python project/main.py \
    algorithm=no_op \
    datamodule=imagenet \
    trainer=profiling \
    trainer/logger=wandb \
    trainer.logger.wandb.name="Dataloading only"

[2KCreating schemas for Hydra config files...[35m   0%[0m [90m━━━━[0m [32m0/44[0m [ [33m0:0…[0m < [36m-:-…[0m , [31m?   [0m ]
[2K[1A[2K[2;36m[09/12/24 11:33:46][0m[2;36m [0m[34mINFO    [0m Config file                      ]8;id=248360;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/project/utils/auto_schema.py\[2mauto_schema.py[0m]8;;\[2m:[0m]8;id=640404;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/project/utils/auto_schema.py#362\[2m362[0m]8;;\
[2;36m                    [0m         experiment/cluster_sweep_example [2m                  [0m
[2;36m                    [0m         .yaml was modified, regenerating [2m                  [0m
[2;36m                    [0m         the schema.                      [2m                  [0m
Creating schemas for Hydra config files...[35m   0%[0m [90m━━━━[0m [32m0/44[0m [ [33m0:0…[0m < [36m-:-…[0m , [31m?   [0m ]
[2K[1A[2KCreating schemas for Hydra config files...[35m   0%

In [10]:
!python project/main.py \
    algorithm=example \
    datamodule=imagenet \
    trainer=profiling \
    trainer/logger=wandb \
    trainer.logger.wandb.name="Dataloading + Training"

[2KCreating schemas for Hydra config files...[35m   0%[0m [90m━━━━[0m [32m0/44[0m [ [33m0:0…[0m < [36m-:-…[0m , [31m?   [0m ]
[2K[1A[2K[2;36m[09/12/24 11:35:07][0m[2;36m [0m[34mINFO    [0m Unable to properly create the    ]8;id=207449;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/project/utils/auto_schema.py\[2mauto_schema.py[0m]8;;\[2m:[0m]8;id=656180;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/project/utils/auto_schema.py#368\[2m368[0m]8;;\
[2;36m                    [0m         schema for                       [2m                  [0m
[2;36m                    [0m         experiment/cluster_sweep_example [2m                  [0m
[2;36m                    [0m         .yaml last time. Trying again.   [2m                  [0m
Creating schemas for Hydra config files...[35m   0%[0m [90m━━━━[0m [32m0/44[0m [ [33m0:0…[0m < [36m-:-…[0m , [31m?   [0m ]
[2K[1A[2KCreating schemas for Hydra config files...[35m   0%

As evidenced in the former, adding training to our run results in a difference in the ballpark of 100 samples/s. This would indicate that we have a computation bottleneck.

## Comparing throughput: GPU vs CPU model training

Advancements in Graphical Processing Units (GPUs) are widely known to have enabled the deep learning revolution, particularly through faster computation, relative to CPUs. Given that we have the option to run both GPU and CPU workloads, let's compare their throughput. In most workflows, the speedup provided by a GPU is dramatic. For a few select workloads, particularly those with a low number of steps or lighter computation requirements, if a 1.5-2x slower performance is observed when using a CPU, as opposed to a GPU, the former may be worth considering, as they're a far less contested resource on the cluster and pose far fewer availability issues.

In [11]:
#Compare speed using CPU only vs the slowest GPU available, for a low number of steps
!python project/main.py \
    algorithm=example \
    datamodule=imagenet \
    resources=cpu \
    trainer=profiling \
    trainer/logger=wandb \
    trainer.logger.wandb.name="CPU" \
    trainer.logger.wandb.group="GPU vs CPU"

[2;36m[09/12/24 12:18:52][0m[2;36m [0m[34mINFO    [0m Submitit [32m'slurm'[0m sweep     ]8;id=941276;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py\[2msubmitit_launcher.py[0m]8;;\[2m:[0m]8;id=775917;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py#120\[2m120[0m]8;;\
[2;36m                    [0m         output dir :               [2m                        [0m
[2;36m                    [0m         logs/default/multiruns/[1;36m202[0m [2m                        [0m
[2;36m                    [0m         [1;36m4[0m-[1;36m09[0m-[1;36m12[0m/[1;36m12[0m-[1;36m18[0m-[1;36m52[0m           [2m                        [0m
[2;36m                   [0m[2;36m [0m[34mINFO    [0m         #[1;36m0[0m :               ]8;id=38093;file:///home/mila/c/ces

In [15]:
!HYDRA_FULL_ERROR=1 python project/main.py \
    algorithm=example \
    datamodule=imagenet \
    trainer=profiling \
    resources=one_gpu \
    trainer/logger=wandb \
    trainer.logger.wandb.name="GPU" \
    trainer.logger.wandb.group="GPU vs CPU"

Traceback (most recent call last):
  File "/home/mila/c/cesar.valdez/idt/ResearchTemplate/project/main.py", line 177, in <module>
    main()
  File "/home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10

[Mila's official documentation](https://docs.mila.quebec/Information.html) has a comprehensive rundown of the GPUs that are installed on the cluster. Typing ```savail``` on the command line when logged into the cluster, shows their current availability. Testing their capacity can yield insights into their suitability for different training workloads.
As the Mila Research template is built with hydra as a configuration manager, it integrates [Multi-runs](https://hydra.cc/docs/tutorials/basic/running_your_app/multi-run/) by default. This makes it possible to sweep over different parameters for profiling or throughput testing purposes or both. For example, suppose we wanted to figure out how different GPUs perform relative to each other.  

In [8]:
!savail

GPU               Avail / Total 
2g.20gb               3 / 48 
3g.40gb               1 / 48 
4g.40gb               1 / 24 
a100                  0 / 32 
a100l                 0 / 88 
a6000                 0 / 8 
rtx8000              13 / 360 
v100                  3 / 56 


We can observe the following prominent GPU classes:

- NVIDIA Tensor Core GPUs: A100, A100L, V100 (previous gen)
- NVIDIA RTX GPUs: A6000, RTX8000
- Multi-Instance GPU (MiG) partitions: 2g.20gb, 3g.40gb, 4g.40gb  

We will now proceed to specify different GPUs over training runs and compare their throughput.

In [None]:
## What performance do you get with each type of GPU? 
# (Based on the VRAM requirements of the job (step 1), 
# try all the GPU types on the Cluster that can accommodate this kind of job)

# Add an example of a sweep over some parameters, 
# with the training throughput as the metric, 
# :: callbacks/samples_per_second, 
# or add a devicestatsmonitor in
# and using different kinds of GPUs. 

## salloc --gres=gpu:a100:1 -c 6 --mem=32G -t 48:00:00 --partition=unkillable

!HYDRA_FULL_ERROR=1 python project/main.py \
    algorithm=example \
    datamodule=imagenet \
    trainer=profiling \
    resources=one_gpu \
    trainer/logger=wandb

Making sense of the former: if a GPU with lower maximum capacity is readily available, training on it may be more time and resource effective than waiting for higher capacity GPUs to become available.


## GPU utilization

How well are we using the GPU?
Once we've selected the target GPU that we want to use, measure the GPU utilization. Is the GPU utilization high? (>80%?)
If it's high (>80%), then we can either stop here, or we can keep going a bit further
If it's low, then we can use the PyTorch profiler (or any other tool) to try to figure out what the bottleneck i
## maybe look at submitit's array_parallelism

We would like to maximize our throughput given GPU choice

In [None]:
### Measure the performance on different GPUS using the optimal datamodule 
### params from before (and keeping other parameters the same)

We will now sweep over model hyper-parameters to maximize the utilization of our selected GPU.

In [None]:
#### Using the results from before, do a simple sweep over model hyper-parameters 
#### to maximize the utilization of the selected GPU (which was selected as a tradeoff 
#### between performance and difficulty to get an allocation). For example if the 
#### RTX8000's are 20% slower than A100s but 5x easier to get an allocation on, use those instead.

## What is a profiler and what is it good for?

The former process was a bit contrived - We can zero down specifically on subprocesses...  
A profiler is a tool that allows you to measure the time and memory consumption of the model’s operators. Specifically, the PyTorch profiler output provides clues about operations relevant to model training. Examples include the total amount of time spent doing low-level mathematical operations in the GPU, and whether these are unexpectedly slow or take a disproportionate amount of time, indicating they should be avoided or optimized. Identifying problematic operations can greatly help us validate or rethink our baseline model performance expectations.

[Multiple](https://developer.nvidia.com/blog/profiling-and-optimizing-deep-neural-networks-with-dlprof-and-pyprof/) [profilers](https://github.com/plasma-umass/scalene) [exist](https://docs.python.org/3/library/profile.html). For the purposes of this example we'll use the default [PyTorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html). 

In [18]:
from torch.profiler import ProfilerActivity, profile 

profiler = profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
    with_stack=True,
)
profiler.start()
profiler.stop()
print(profiler.key_averages().table(sort_by="cpu_time_total", row_limit=10))


-------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                     Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls  
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
    cudaDeviceSynchronize       100.00%      14.444us       100.00%      14.444us      14.444us           0 b           0 b             1  
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 14.444us



### Additional resources

[GPU Training (Basic) - LightningAI](https://lightning.ai/docs/pytorch/stable/accelerators/gpu_basic.html)  
[DeviceStatsMonitor class - LightningAI](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.callbacks.DeviceStatsMonitor.html)  
[PyTorch Profiler + W&B integration - Weights & Biases](https://wandb.ai/wandb/trace/reports/Using-the-PyTorch-Profiler-with-W-B--Vmlldzo5MDE3NjU)   
[Advanced profiling for model optimization - Accelerating Generative AI with PyTorch: Segment Anything, Fast](https://pytorch.org/blog/accelerating-generative-ai/)