# Profiling your code

The Mila Research Template leverages built-in PyTorch and Lightning functionality to make model profiling and benchmarking accessible and flexible.  
Make sure to read the Mila Docs page on [PLACEHOLDER - profiling](https://docs.mila.quebec/) before going through this example. 

The research template profiling notebook extends the examples in the official documentation with additional tools: notably, native WandB integration to monitor performance and using hydra multiruns to compare the available GPUs on the official Mila cluster. See below. The goal of this notebook is to introduce profiling, present tools useful for doing so and to provide general concepts and guidelines for optimizing your code, within the Mila cluster ecosystem.


### Setup

In [1]:
import os
from pathlib import Path
# Set the working directory to the project root
notebook_path = Path().resolve()  
project_root = notebook_path.parent.parent
os.chdir(str(project_root))

## Introduction

As a deep learning researcher, training comparatively slow models as opposed to faster, optimized ones can greatly impact your research output. In addition, as a user of a shared cluster, being efficient about the use of institutional resources is a benefit to all the users in the ecosystem. Given the ample variety of available resources and training schemes to achieve the same modeling objective, optimizing your code isn't necessarily a straightforward task. 

While there's many costs involved in getting a model to train, some are more relevant than others when it comes to making your code more efficient. Setting a performance baseline, by observing said costs and identifying underperforming components in the code while properly contextualizing them within a broader training scheme is the very first step to optimizing your code. Once a baseline performance expectation is set, we can modify and observe our code's performance in a comparative manner to then determine if the performed optimizations are better.

## Instrumenting your code

Setting up artifacts within your code to monitor metrics of interest can help set a cost baseline and evidence potential areas for improvement. Common metrics to watch for include but are not limited to:
 
- Training speed (samples/s)
- CPU/GPU utilization 
- RAM/VRAM utilization

In the Mila ResearchTemplate, this can be done by passing a callback to the trainer. Supported configs are found within the project template at `configs/trainer/callbacks`. Here, we will use the default callback, which in turn implements early stopping and tracks the learning rate, device utilisation and throughput, each through a specific callback instance.

In [15]:
#%%capture
!python project/main.py \
    algorithm=no_op \
    datamodule=imagenet \
    trainer=profiling \
    trainer/callbacks=default

[2KCreating schemas for Hydra config files...[35m   0%[0m [90m━━━━[0m [32m0/46[0m [ [33m0:0…[0m < [36m-:-…[0m , [31m?   [0m ]
[2K[1A[2K[2;36m[09/16/24 16:09:10][0m[2;36m [0m[34mINFO    [0m Unable to properly create the    ]8;id=800276;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/project/utils/auto_schema.py\[2mauto_schema.py[0m]8;;\[2m:[0m]8;id=394219;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/project/utils/auto_schema.py#368\[2m368[0m]8;;\
[2;36m                    [0m         schema for                       [2m                  [0m
[2;36m                    [0m         experiment/cluster_sweep_example [2m                  [0m
[2;36m                    [0m         .yaml last time. Trying again.   [2m                  [0m
Creating schemas for Hydra config files...[35m   0%[0m [90m━━━━[0m [32m0/46[0m [ [33m0:0…[0m < [36m-:-…[0m , [31m?   [0m ]
[2K[1A[2KCreating schemas for Hydra config files...[35m   0%

### Logging metrics on WandB

In addition to callback specification, the Mila Research template integrates wandb as a logger specification, which enables the tracking of additional metrics through visualizations and dashboard creation. Given the flexibility and widespread adoption of using WandB as a logger, we'll be using it for the remainder of this tutorial.

In [18]:
%%capture
!python project/main.py \
    experiment=profiling_cpu \
    trainer/logger=wandb \
    trainer.logger.wandb.name="WandB logging test"

[2KCreating schemas for Hydra config files...[35m   0%[0m [90m━━━━[0m [32m0/45[0m [ [33m0:0…[0m < [36m-:-…[0m , [31m?   [0m ]
[2K[1A[2K[2;36m[09/16/24 16:31:23][0m[2;36m [0m[34mINFO    [0m Unable to properly create the    ]8;id=965865;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/project/utils/auto_schema.py\[2mauto_schema.py[0m]8;;\[2m:[0m]8;id=902936;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/project/utils/auto_schema.py#368\[2m368[0m]8;;\
[2;36m                    [0m         schema for                       [2m                  [0m
[2;36m                    [0m         experiment/cluster_sweep_example [2m                  [0m
[2;36m                    [0m         .yaml last time. Trying again.   [2m                  [0m
Creating schemas for Hydra config files...[35m   0%[0m [90m━━━━[0m [32m0/45[0m [ [33m0:0…[0m < [36m-:-…[0m , [31m?   [0m ]
[2K[1A[2KCreating schemas for Hydra config files...[35m   0%

We can now visualize the results of our run at `wandb_url`

## Identifying potential bottleneck sources 

Finding bottlenecks in your code is not necessarily clear or straightforward from the start. A sensible first step is to determine whether potential slowdowns originate from data loading or model computation. Running a model with and without training and contrasting the obtained outputs can help us determine whether the master process has a significant stall when fetching the next batch for training or not. Analyzing the difference between outputs can tell us the following about our model: 

- If the difference between data loading and training is close to 0, then the data loading procedure outpaces model computation, and computation is the bottleneck. 
- If the difference between data loading and training is much greater than 0, then model computation outpaces data loading, and data loading is the bottleneck. 

To showcase the former, we will proceed to run two separate model loops on imagenet: the first one doing data loading without any training, followed by one with.

In [21]:
%%capture
!python project/main.py \
    experiment=profiling_cpu \
    trainer.logger.wandb.name="Dataloading only" \
    datamodule.num_workers=1

[2;36m[09/16/24 16:44:50][0m[2;36m [0m[34mINFO    [0m Submitit [32m'slurm'[0m sweep     ]8;id=44774;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py\[2msubmitit_launcher.py[0m]8;;\[2m:[0m]8;id=174608;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py#120\[2m120[0m]8;;\
[2;36m                    [0m         output dir :               [2m                        [0m
[2;36m                    [0m         logs/default/multiruns/[1;36m202[0m [2m                        [0m
[2;36m                    [0m         [1;36m4[0m-[1;36m09[0m-[1;36m16[0m/[1;36m16[0m-[1;36m44[0m-[1;36m50[0m           [2m                        [0m
[2;36m                   [0m[2;36m [0m[34mINFO    [0m         #[1;36m0[0m :               ]8;id=55055;file:///home/mila/c/cesa

In [24]:
%%capture
!python project/main.py \
    experiment=profiling_cpu \
    datamodule.num_workers=4

[2;36m[09/16/24 17:02:15][0m[2;36m [0m[34mINFO    [0m Submitit [32m'slurm'[0m sweep     ]8;id=318334;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py\[2msubmitit_launcher.py[0m]8;;\[2m:[0m]8;id=306596;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py#120\[2m120[0m]8;;\
[2;36m                    [0m         output dir :               [2m                        [0m
[2;36m                    [0m         logs/default/multiruns/[1;36m202[0m [2m                        [0m
[2;36m                    [0m         [1;36m4[0m-[1;36m09[0m-[1;36m16[0m/[1;36m17[0m-[1;36m02[0m-[1;36m14[0m           [2m                        [0m
[2;36m                   [0m[2;36m [0m[34mINFO    [0m         #[1;36m0[0m :               ]8;id=259532;file:///home/mila/c/ce

In [None]:
%%capture
!python project/main.py \
    experiment=profiling_cpu \
    datamodule.num_workers=8

In [6]:
%%capture
!python project/main.py \
    experiment=profiling_cpu \
    datamodule.num_workers=16

In [None]:
%%capture
!python project/main.py \
    experiment=profiling_cpu \
    datamodule.num_workers=32

## Training models with GPUs

Advancements in Graphical Processing Units (GPUs) are widely known to have enabled the deep learning revolution, particularly through faster computation, relative to CPUs. Given that we have the option to run both GPU and CPU workloads, let's compare their throughput. In most workflows, the speedup provided by a GPU is dramatic. For a few select workloads, particularly those with a low number of steps or lighter computation requirements, if a 1.5-2x slower performance is observed when using a CPU, as opposed to a GPU, the former may be worth considering, as they're a far less contested resource on the cluster and pose far fewer availability issues.

In [None]:
%%capture
!python project/main.py \
    experiment=profiling_cpu \
    algorithm=example \
    trainer.logger.wandb.name="Dataloading + Training"

CHECK FOR ACCURACY ON DASHBOARD - As evidenced in the former, adding training to our run results in a difference in the ballpark of 100 samples/s. This would indicate that we have a computation bottleneck.

[Mila's official documentation](https://docs.mila.quebec/Information.html) has a comprehensive rundown of the GPUs that are installed on the cluster. Typing ```savail``` on the command line when logged into the cluster, shows their current availability. Testing their capacity can yield insights into their suitability for different training workloads.
As the Mila Research template is built with hydra as a configuration manager, it integrates [Multi-runs](https://hydra.cc/docs/tutorials/basic/running_your_app/multi-run/) by default. This makes it possible to sweep over different parameters for profiling or throughput testing purposes or both. For example, suppose we wanted to figure out how different GPUs perform relative to each other.  

In [8]:
!savail

GPU               Avail / Total 
2g.20gb               6 / 48 
3g.40gb               3 / 48 
4g.40gb               1 / 24 
a100                  0 / 32 
a100l                 0 / 88 
a6000                 0 / 8 
rtx8000              14 / 408 
v100                  0 / 56 


We can observe the following prominent GPU classes:

- NVIDIA Tensor Core GPUs: A100, A100L, V100 (previous gen)
- NVIDIA RTX GPUs: A6000, RTX8000
- Multi-Instance GPU (MiG) partitions: 2g.20gb, 3g.40gb, 4g.40gb  

We will now proceed to specify different GPUs over training runs and compare their throughput.

In [None]:
%%capture
!python project/main.py \
    experiment=profiling_gpu \
    hydra.launcher.gres=gpu:rtx8000:1 \
    trainer.logger.wandb.name="RTX8000"

In [18]:
%%capture
!python project/main.py \
    experiment=profiling_gpu \
    hydra.launcher.gres=gpu:a100:1 \
    trainer.logger.wandb.name="A100"

[2;36m[09/16/24 13:33:03][0m[2;36m [0m[34mINFO    [0m Submitit [32m'slurm'[0m sweep     ]8;id=246287;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py\[2msubmitit_launcher.py[0m]8;;\[2m:[0m]8;id=666892;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py#120\[2m120[0m]8;;\
[2;36m                    [0m         output dir :               [2m                        [0m
[2;36m                    [0m         logs/default/multiruns/[1;36m202[0m [2m                        [0m
[2;36m                    [0m         [1;36m4[0m-[1;36m09[0m-[1;36m16[0m/[1;36m13[0m-[1;36m33[0m-[1;36m03[0m           [2m                        [0m
[2;36m                   [0m[2;36m [0m[34mINFO    [0m         #[1;36m0[0m :               ]8;id=835344;file:///home/mila/c/ce

In [None]:
%%capture
!python project/main.py \
    experiment=profiling_gpu \
    hydra.launcher.gres=gpu:v100:1 \
    trainer.logger.wandb.name="V100"

Making sense of the former: if a GPU with lower maximum capacity is readily available, training on it may be more time and resource effective than waiting for higher capacity GPUs to become available.


## GPU utilization

How well are we using the GPU? Once we've done a few preliminary runs with candidate GPUs that we'd want to use, the GPU utilization can be measured and optimized. We generally aim for high GPU utilization. Is the GPU utilization high? (>80%?)  
If it's low (<80%), then we can use the PyTorch profiler (or similar tools) to try to figure out where the bottleneck lies, and further tune our parameters to increase our utilization.

In [11]:
%%capture
!python project/main.py \
    experiment=profiling_gpu \
    hydra.launcher.gres=gpu:rtx8000:1 \
    datamodule.batch_size=1

[2;36m[09/16/24 11:50:49][0m[2;36m [0m[34mINFO    [0m Submitit [32m'slurm'[0m sweep     ]8;id=78076;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py\[2msubmitit_launcher.py[0m]8;;\[2m:[0m]8;id=898758;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py#120\[2m120[0m]8;;\
[2;36m                    [0m         output dir :               [2m                        [0m
[2;36m                    [0m         logs/default/multiruns/[1;36m202[0m [2m                        [0m
[2;36m                    [0m         [1;36m4[0m-[1;36m09[0m-[1;36m16[0m/[1;36m11[0m-[1;36m50[0m-[1;36m48[0m           [2m                        [0m
[2;36m                   [0m[2;36m [0m[34mINFO    [0m         #[1;36m0[0m :               ]8;id=612036;file:///home/mila/c/ces

In [12]:
%%capture
!python project/main.py \
    experiment=profiling_gpu \
    hydra.launcher.gres=gpu:rtx8000:1 \
    datamodule.batch_size=8

[2;36m[09/16/24 12:06:22][0m[2;36m [0m[34mINFO    [0m Submitit [32m'slurm'[0m sweep     ]8;id=731053;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py\[2msubmitit_launcher.py[0m]8;;\[2m:[0m]8;id=366557;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py#120\[2m120[0m]8;;\
[2;36m                    [0m         output dir :               [2m                        [0m
[2;36m                    [0m         logs/default/multiruns/[1;36m202[0m [2m                        [0m
[2;36m                    [0m         [1;36m4[0m-[1;36m09[0m-[1;36m16[0m/[1;36m12[0m-[1;36m06[0m-[1;36m21[0m           [2m                        [0m
[2;36m                   [0m[2;36m [0m[34mINFO    [0m         #[1;36m0[0m :               ]8;id=462864;file:///home/mila/c/ce

In [13]:
%%capture
!python project/main.py \
    experiment=profiling_gpu \
    hydra.launcher.gres=gpu:rtx8000:1 \
    datamodule.batch_size=32

[2;36m[09/16/24 12:20:27][0m[2;36m [0m[34mINFO    [0m Submitit [32m'slurm'[0m sweep     ]8;id=271112;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py\[2msubmitit_launcher.py[0m]8;;\[2m:[0m]8;id=705864;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py#120\[2m120[0m]8;;\
[2;36m                    [0m         output dir :               [2m                        [0m
[2;36m                    [0m         logs/default/multiruns/[1;36m202[0m [2m                        [0m
[2;36m                    [0m         [1;36m4[0m-[1;36m09[0m-[1;36m16[0m/[1;36m12[0m-[1;36m20[0m-[1;36m26[0m           [2m                        [0m
[2;36m                   [0m[2;36m [0m[34mINFO    [0m         #[1;36m0[0m :               ]8;id=897324;file:///home/mila/c/ce

In [14]:
%%capture
!python project/main.py \
    experiment=profiling_gpu \
    hydra.launcher.gres=gpu:rtx8000:1 \
    datamodule.batch_size=128

[2;36m[09/16/24 12:35:51][0m[2;36m [0m[34mINFO    [0m Submitit [32m'slurm'[0m sweep     ]8;id=831424;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py\[2msubmitit_launcher.py[0m]8;;\[2m:[0m]8;id=96207;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py#120\[2m120[0m]8;;\
[2;36m                    [0m         output dir :               [2m                        [0m
[2;36m                    [0m         logs/default/multiruns/[1;36m202[0m [2m                        [0m
[2;36m                    [0m         [1;36m4[0m-[1;36m09[0m-[1;36m16[0m/[1;36m12[0m-[1;36m35[0m-[1;36m50[0m           [2m                        [0m
[2;36m                   [0m[2;36m [0m[34mINFO    [0m         #[1;36m0[0m :               ]8;id=663693;file:///home/mila/c/ces

## Additional optimization

Once GPU selection and a reasonable batch size are chosen, more can be done to speed up a model's computation.
- a
- la

In [16]:
%%capture
!python project/main.py \
    experiment=profiling_gpu \
    datamodule.batch_size=32
    datamodule.num_workers=4 ## optimal parameter from above tests, check

[2;36m[09/16/24 13:05:32][0m[2;36m [0m[34mINFO    [0m Submitit [32m'slurm'[0m sweep     ]8;id=865887;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py\[2msubmitit_launcher.py[0m]8;;\[2m:[0m]8;id=436119;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py#120\[2m120[0m]8;;\
[2;36m                    [0m         output dir :               [2m                        [0m
[2;36m                    [0m         logs/default/multiruns/[1;36m202[0m [2m                        [0m
[2;36m                    [0m         [1;36m4[0m-[1;36m09[0m-[1;36m16[0m/[1;36m13[0m-[1;36m05[0m-[1;36m32[0m           [2m                        [0m
[2;36m                   [0m[2;36m [0m[34mINFO    [0m         #[1;36m0[0m :               ]8;id=50432;file:///home/mila/c/ces

## What is a profiler and what is it good for?

The former process, while straightforward, was a bit contrived - would having a bird's eye view of our models performance be of aid when trying to optimize its parameters? It certainly wouldn't hurt. Enter the profiler.  
A profiler is a tool that allows you to measure the time and memory consumption of the model’s operators. Specifically, the PyTorch profiler output provides clues about operations relevant to model training. Examples include the total amount of time spent doing low-level mathematical operations in the GPU, and whether these are unexpectedly slow or take a disproportionate amount of time, indicating they should be avoided or optimized. Identifying problematic operations can greatly help us validate or rethink our baseline model performance expectations.

[Multiple](https://developer.nvidia.com/blog/profiling-and-optimizing-deep-neural-networks-with-dlprof-and-pyprof/) [profilers](https://github.com/plasma-umass/scalene) [exist](https://docs.python.org/3/library/profile.html). For the purposes of this example we'll use the default [PyTorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html). 

In [18]:
from torch.profiler import ProfilerActivity, profile 

profiler = profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
    with_stack=True,
)
profiler.start()
profiler.stop()
print(profiler.key_averages().table(sort_by="cpu_time_total", row_limit=10))


-------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                     Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls  
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
    cudaDeviceSynchronize       100.00%      14.444us       100.00%      14.444us      14.444us           0 b           0 b             1  
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 14.444us



### Additional resources

[GPU Training (Basic) - LightningAI](https://lightning.ai/docs/pytorch/stable/accelerators/gpu_basic.html)  
[DeviceStatsMonitor class - LightningAI](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.callbacks.DeviceStatsMonitor.html)  
[PyTorch Profiler + W&B integration - Weights & Biases](https://wandb.ai/wandb/trace/reports/Using-the-PyTorch-Profiler-with-W-B--Vmlldzo5MDE3NjU)   
[Advanced profiling for model optimization - Accelerating Generative AI with PyTorch: Segment Anything, Fast](https://pytorch.org/blog/accelerating-generative-ai/)