# Profiling your code

The Mila Research Template leverages built-in PyTorch and Lightning functionality to make model profiling and benchmarking accessible and flexible.  
Make sure to read the Mila Docs page on [PLACEHOLDER - profiling](https://docs.mila.quebec/) before going through this example. 

The Research Template's profiling notebook extends the examples in the official documentation with additional tools: notably, native WandB integration to monitor performance and using hydra multiruns to compare the available GPUs on the official Mila cluster. See below. The goal of this notebook is to introduce profiling, present tools useful for doing so and to provide general concepts and guidelines for optimizing your code, within the Mila cluster ecosystem.


### Setup

In [1]:
import os
from pathlib import Path
# Set the working directory to the project root
notebook_path = Path().resolve()  
project_root = notebook_path.parent.parent
os.chdir(str(project_root))

## Introduction

As a deep learning researcher, training comparatively slow models as opposed to faster, optimized ones can greatly impact your research output. In addition, as a user of a shared cluster, being efficient about the use of institutional resources is a benefit to all the users in the ecosystem. Given the ample variety of available resources and training schemes to achieve the same modeling objective, optimizing your code isn't necessarily a straightforward task. 

While there's many costs involved in getting a model to train, some are more relevant than others when it comes to making your code more efficient. Setting a performance baseline, by observing said costs and identifying underperforming components in the code while properly contextualizing them within a broader training scheme is the very first step to optimizing your code. Once a baseline performance expectation is set, we can modify and observe our code's performance in a comparative manner to then determine if the performed optimizations are better.

## Instrumenting your code

Setting up artifacts within your code to monitor metrics of interest can help set a cost baseline and evidence potential areas for improvement. Common metrics to watch for include but are not limited to:
 
- Training speed (samples/s)
- CPU/GPU utilization 
- RAM/VRAM utilization

In the Mila Research Template, this can be done by passing a callback to the trainer. Supported configs are found within the project template at `configs/trainer/callbacks`. Throughout this tutorial, we will use the default callback, which in turn implements early stopping and tracks the learning rate, device utilisation and throughput, each through a specific callback instance.

### Running a baseline and logging metrics on WandB

In addition to specifying callbacks, the Mila Research template integrates using WandB as a logger, which enables the tracking of additional metrics through visualizations and dashboard creation. Given the flexibility and widespread adoption of the WandB logger, we'll be using it for the remainder of this tutorial, which will then be visualizable at `wandb_url` , as supporting information for the experiments contained herein. We will now proceed to establish a baseline to profile, diagnose and optimize throughout this example.

In [None]:
%%capture
!python project/main.py \
    experiment=profiling \
    trainer.logger.wandb.name="1 RTX8000 GPU 1 CPU Training - ResNet-18 - ImageNet" \
    trainer.logger.wandb.tags=["Training baseline"] 

## Identifying potential bottleneck sources 

Finding bottlenecks in your code is not necessarily clear or straightforward from the start. A sensible first step is to determine whether potential slowdowns originate from data loading or model computation. Running a model with and without training and contrasting the obtained outputs can help us determine whether the master process has a significant stall when fetching the next batch for training or not. Analyzing the difference between outputs can tell us the following about our model: 

- If the difference between data loading and training is close to 0, then the data loading procedure outpaces model computation, and computation is the bottleneck. 
- If the difference between data loading and training is much greater than 0, then model computation outpaces data loading, and data loading is the bottleneck. 

We will proceed to run a series of experiments to identify potential bottlenecks: changing the workers involved in the dataloading process and the numbers of cpu assigned per task when training on a GPU.

In [6]:
%%capture
!python project/main.py \
    experiment=profiling \
    algorithm=no_op \
    resources=cpu \
    trainer.logger.wandb.tags=["1 CPU Dataloading"] \
    hydra.launcher.cpus_per_task=1 \
    datamodule.num_workers=1,4,8,16,32

In [7]:
%%capture
!python project/main.py \
    experiment=profiling \
    algorithm=no_op \
    resources=cpu \
    trainer.logger.wandb.tags=["2 CPU Dataloading"] \
    hydra.launcher.cpus_per_task=2 \
    datamodule.num_workers=1,4,8,16,32

In [8]:
%%capture
!python project/main.py \
    experiment=profiling \
    algorithm=no_op \
    resources=cpu \
    trainer.logger.wandb.tags=["3 CPU Dataloading"] \
    hydra.launcher.cpus_per_task=3 \
    datamodule.num_workers=1,4,8,16,32

In [9]:
%%capture
!python project/main.py \
    experiment=profiling \
    algorithm=no_op \
    resources=cpu \
    trainer.logger.wandb.tags=["4 CPU Dataloading"] \
    hydra.launcher.cpus_per_task=4 \
    datamodule.num_workers=1,4,8,16,32

Once we've determined the optimal number of workers and CPUs in terms of dataloading throughput, we can train a model similar to our baseline, albeit with the newly obtained parameters, to then compare throughput and determine if there was a sizeable increase.

In [None]:
#%%capture ----- RUN WITH OPTIMAL PARAMETERS ONCE DETERMINED, LOCALLY -----
!python project/main.py \
    experiment=profiling \
    trainer.logger.wandb.name="1 RTX8000 GPU 1 CPU Training - ResNet-18 - ImageNet" \
    trainer.logger.wandb.tags=["Optimized"] 

## The advantages of training models with GPUs

Advancements in Graphical Processing Units (GPUs) are widely known to have enabled the deep learning revolution, particularly through faster computation, relative to CPUs. Given that we have the option to run both GPU and CPU workloads, let's compare their throughput. In most workflows, the speedup provided by a GPU is dramatic. For a few select workloads, particularly those with a low number of steps or lighter computation requirements, if a 1.5-2x slower performance is observed when using a CPU, as opposed to a GPU, the former may be worth considering, as they're a far less contested resource on the cluster and pose far fewer availability issues.  

In this section, we'll train a model that's analogous to our ImageNet baseline - entirely on the CPU. We will also train two smaller fully connected networks on MNIST, a smaller dataset than ImageNet, to compare and contrast the differences in throughput when training with and without a GPU.  

In [5]:
%%capture ### USE OPTIMIZED NUM CPUS, WORKERS, BEFORE RUNNING
!python project/main.py \
    experiment=profiling \
    resources=cpu \
    trainer.logger.wandb.name="1 CPU Training - ResNet-18 - ImageNet" \
    trainer.logger.wandb.tags=["CPU Training"] 

In [10]:
%%capture
!python project/main.py \
    network=fcnet \
    datamodule=mnist \
    experiment=profiling \
    hydra.launcher.gres='gpu:rtx8000:1' \
    ###hydra.launcher.cpus_per_task=1 \ USE OPTIMIZED NUM CPUS, WORKERS
    trainer.logger.wandb.name="1 RTX8000 GPU 1 CPU Training - FcNet - MNIST"
    trainer.logger.wandb.tags=["GPU Training", "MNIST"] 

[2;36m[09/19/24 15:49:07][0m[2;36m [0m[34mINFO    [0m Submitit [32m'slurm'[0m sweep     ]8;id=854152;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py\[2msubmitit_launcher.py[0m]8;;\[2m:[0m]8;id=476561;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py#120\[2m120[0m]8;;\
[2;36m                    [0m         output dir :               [2m                        [0m
[2;36m                    [0m         logs/default/multiruns/[1;36m202[0m [2m                        [0m
[2;36m                    [0m         [1;36m4[0m-[1;36m09[0m-[1;36m19[0m/[1;36m15[0m-[1;36m49[0m-[1;36m06[0m           [2m                        [0m
[2;36m                   [0m[2;36m [0m[34mINFO    [0m         #[1;36m0[0m : [33mnetwork[0m=[35mfcnet[0m ]8;id=577341;file

In [11]:
%%capture ### USE OPTIMIZED NUM CPUS, WORKERS, BEFORE RUNNING
!python project/main.py \
    network=fcnet \
    datamodule=mnist \
    experiment=profiling \
    resources=cpu \
    trainer.logger.wandb.name="1 CPU Training - FcNet - MNIST" \
    trainer.logger.wandb.tags=["CPU Training", "MNIST"] 

## Throughput across GPU types

Observing the former, we've made a solid case for utilizing GPUs for model training. Furthermore, when using GPUs, these vary in throughput; some are more powerful than others. [Mila's official documentation](https://docs.mila.quebec/Information.html) has a comprehensive rundown of the GPUs that are installed on the cluster. Typing ```savail``` on the command line when logged into the cluster, shows their current availability. Testing their capacity can yield insights into their suitability for different training workloads. Let's see what's available on the Mila cluster.

In [8]:
!savail

GPU               Avail / Total 
2g.20gb               6 / 48 
3g.40gb               3 / 48 
4g.40gb               1 / 24 
a100                  0 / 32 
a100l                 0 / 88 
a6000                 0 / 8 
rtx8000              14 / 408 
v100                  0 / 56 


We can observe the following prominent GPU classes:

- NVIDIA Tensor Core GPUs: A100, A100L, V100 (previous gen)
- NVIDIA RTX GPUs: A6000, RTX8000
- Multi-Instance GPU (MiG) partitions: 2g.20gb, 3g.40gb, 4g.40gb  

As the Mila Research Template is built with hydra as a configuration manager, it integrates [Multi-runs](https://hydra.cc/docs/tutorials/basic/running_your_app/multi-run/) by default. This makes it possible to specify particular GPU resources for a given run, or sweeping over different parameters for profiling or throughput testing purposes or both.  
For example, suppose we wanted to figure out how different GPUs perform relative to each other.  
We are able to do this by specifying different GPUs over training runs and comparing their throughput. 

In [18]:
%%capture ### USE OPTIMIZED NUM CPUS, WORKERS, BEFORE RUNNING
!python project/main.py \
    experiment=profiling_gpu \
    hydra.launcher.gres=gpu:a100:1 \
    trainer.logger.wandb.name="A100" \
    datamodule.num_workers=#optimal params as determined before
    trainer.logger.wandb.name="A100 GPU X CPU X Num_workers - ResNet-18 - ImageNet" \
    trainer.logger.wandb.tags=["GPU Training", "ImageNet"]

[2;36m[09/16/24 13:33:03][0m[2;36m [0m[34mINFO    [0m Submitit [32m'slurm'[0m sweep     ]8;id=246287;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py\[2msubmitit_launcher.py[0m]8;;\[2m:[0m]8;id=666892;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py#120\[2m120[0m]8;;\
[2;36m                    [0m         output dir :               [2m                        [0m
[2;36m                    [0m         logs/default/multiruns/[1;36m202[0m [2m                        [0m
[2;36m                    [0m         [1;36m4[0m-[1;36m09[0m-[1;36m16[0m/[1;36m13[0m-[1;36m33[0m-[1;36m03[0m           [2m                        [0m
[2;36m                   [0m[2;36m [0m[34mINFO    [0m         #[1;36m0[0m :               ]8;id=835344;file:///home/mila/c/ce

In [None]:
%%capture ### USE OPTIMIZED NUM CPUS, WORKERS, BEFORE RUNNING
!python project/main.py \
    experiment=profiling_gpu \
    hydra.launcher.gres=gpu:v100:1 \
    trainer.logger.wandb.name="V100" \
    datamodule.num_workers=#optimal param as determined before 
    trainer.logger.wandb.name="V100 GPU X CPU X Num_workers - ResNet-18 - ImageNet" \
    trainer.logger.wandb.tags=["GPU Training", "ImageNet"]

## Maximizing GPU efficiency and utilization

While there is a clear difference in throughput between GPU types, if a GPU with lower maximum capacity is readily available, training on it may be more time and resource effective than waiting for higher capacity GPUs to become available. Optimizing a lower capacity GPU may be sufficient for your use case. How well are is a given GPU being utilized? Once we've done a few preliminary runs with candidate GPU configurations that we'd want to use, the GPU utilization can be measured and optimized.  
We generally aim for high GPU utilization. Is the GPU utilization high? (>80%)? If it's low (<80%), then we can use the PyTorch profiler (or similar tools) to try to figure out where the bottleneck lies, and further tune our parameters to increase our utilization.  

In [11]:
##%%capture ------ PLACEHOLDER: OPTIMAL PARAMETERS FROM SECTION 3 REQUIRED BEFORE RUNNING ------
!python project/main.py \
    experiment=profiling_gpu \
    hydra.launcher.gres=gpu:rtx8000:1 \
    datamodule.batch_size=1,8,32,64,128

[2;36m[09/16/24 11:50:49][0m[2;36m [0m[34mINFO    [0m Submitit [32m'slurm'[0m sweep     ]8;id=78076;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py\[2msubmitit_launcher.py[0m]8;;\[2m:[0m]8;id=898758;file:///home/mila/c/cesar.valdez/idt/ResearchTemplate/.venv/lib/python3.10/site-packages/hydra_plugins/hydra_submitit_launcher/submitit_launcher.py#120\[2m120[0m]8;;\
[2;36m                    [0m         output dir :               [2m                        [0m
[2;36m                    [0m         logs/default/multiruns/[1;36m202[0m [2m                        [0m
[2;36m                    [0m         [1;36m4[0m-[1;36m09[0m-[1;36m16[0m/[1;36m11[0m-[1;36m50[0m-[1;36m48[0m           [2m                        [0m
[2;36m                   [0m[2;36m [0m[34mINFO    [0m         #[1;36m0[0m :               ]8;id=612036;file:///home/mila/c/ces

## What is a profiler and what is it good for?

The former process, while straightforward, was a bit contrived - would having a bird's eye view of our models performance be of aid when trying to optimize its parameters? It certainly wouldn't hurt. Enter the profiler.  
A profiler is a tool that allows you to measure the time and memory consumption of the model’s operators. Specifically, the PyTorch profiler output provides clues about operations relevant to model training. Examples include the total amount of time spent doing low-level mathematical operations in the GPU, and whether these are unexpectedly slow or take a disproportionate amount of time, indicating they should be avoided or optimized. Identifying problematic operations can greatly help us validate or rethink our baseline model performance expectations.

[Multiple](https://developer.nvidia.com/blog/profiling-and-optimizing-deep-neural-networks-with-dlprof-and-pyprof/) [profilers](https://github.com/plasma-umass/scalene) [exist](https://docs.python.org/3/library/profile.html). For the purposes of this example we'll use the default [PyTorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html). 

In [18]:
from torch.profiler import ProfilerActivity, profile 

profiler = profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
    with_stack=True,
)
profiler.start()
profiler.stop()
print(profiler.key_averages().table(sort_by="cpu_time_total", row_limit=10))


-------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                     Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls  
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
    cudaDeviceSynchronize       100.00%      14.444us       100.00%      14.444us      14.444us           0 b           0 b             1  
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 14.444us



### Additional resources

[GPU Training (Basic) - LightningAI](https://lightning.ai/docs/pytorch/stable/accelerators/gpu_basic.html)  
[DeviceStatsMonitor class - LightningAI](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.callbacks.DeviceStatsMonitor.html)  
[PyTorch Profiler + W&B integration - Weights & Biases](https://wandb.ai/wandb/trace/reports/Using-the-PyTorch-Profiler-with-W-B--Vmlldzo5MDE3NjU)   
[Advanced profiling for model optimization - Accelerating Generative AI with PyTorch: Segment Anything, Fast](https://pytorch.org/blog/accelerating-generative-ai/)