# Profiling your code

### Prerequisites

Make sure to read the following sections of the documentation before going through this example:

- [Pytorch setup](../../frameworks/pytorch_setup/index.rst)
- [Checkpointing](../checkpointing/index.rst)
- [Multi-gpu training](../../distributed/multi_gpu/index.rst)

Figuring out if or where your code may be performing slower than it needs to can be complicated.
In the present minimal example, we'll go through a basic profiling procedure that'll tackle the following:

- Diagnosing if training or dataloading is the bottleneck in your code
- Using the pytorch profiler to find additional bottlenecks
- Potential avenues for further optimization with torch.compile, additional workers, multiple GPUs and related optimizations.

### Diagnosing a bottleneck: is it dataloading or training?

A simple way to tell if your bottleneck is coming from your dataloading procedure is to run the main script, ``main.py``, with and without training.  
Rationale being, if you run an epoch without training and the observed throughput is similar to the one you'd obtain while training, your dataloading is running at least at the speed of you training, making it comparatively slow.  
Take a minute to make sure this makes sense, then observe the two runs below.  

In [9]:
!python main.py --n-samples=20 --epochs=1 --skip-training

[08/05/24 13:25:45] INFO: Setting up ImageNet
Train epoch 0: 100%|████████████████████| 1.00/1.00 [00:01<00:00, 1.20s/Samples]
[08/05/24 13:25:52] INFO: epoch 0:
samples/s: 14.8144, 
updates/s: 0.0000, 
val_loss: 50.1568, 
val_accuracy: 0.00%


In [10]:
!python main.py --n-samples=20 --epochs=1 

[08/05/24 13:25:58] INFO: Setting up ImageNet
Train epoch 0: 100%|█| 1.00/1.00 [00:01<00:00, 1.39s/Samples, accuracy=0, loss=7
[08/05/24 13:26:05] INFO: epoch 0:
samples/s: 12.8945, 
updates/s: 0.7164, 
val_loss: 17.2102, 
val_accuracy: 0.00%


Comparing the throughput of the former two cells, we can determine that dataloading was the bottleneck in our code. With all other parameters being equal, training seems to go at least as fast as dataloading, suggesting that our training loop could take advantage of a faster dataloading procedure.  

Are there any other bottlenecks present? Can we further optimize our code?  
Let's take a more in-depth look with the pytorch profiler.

### Using the PyTorch profiler

The last operation was performed manually and was rather straightforward, since we already had a notion of where to look. In reality, bottlenecks might not be as easy to identify. Having a broader view of the model's operators can be very helpful in this pursuit. Luckily for us, PyTorch provides a way to do this through its [official profiler](https://pytorch.org/tutorials/beginner/profiler.html).

In this section, we'll use the PyTorch profiler to identify additional potential bottlenecks in our code.

In [21]:
## Basic profiler setup
!python main.py --n-samples=20 --epochs=1 --skip-training --pytorch-profiling

[08/05/24 14:41:48] INFO: Setting up ImageNet
Train epoch 0:   0%|                           | 0.00/1.00 [00:00<?, ?Samples/s]STAGE:2024-08-05 14:41:53 1916965:1916965 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
Train epoch 0: 100%|████████████████████| 1.00/1.00 [00:01<00:00, 1.34s/Samples]
[08/05/24 14:41:55] INFO: epoch 0:
samples/s: 13.3756, 
updates/s: 0.0000, 
val_loss: 32.6367, 
val_accuracy: 0.00%
STAGE:2024-08-05 14:41:55 1916965:1916965 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-08-05 14:41:55 1916965:1916965 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     C

A-ha! [Component]'s utilization seems off. Let's introduce a quick fix.

In [2]:
## Fix to last bottleneck, e.g. increase workers and see throughput go down
!python main.py --n-samples=20 --epochs=1  --num-workers=8 --skip-training  --pytorch-profiling

[08/05/24 14:49:03] INFO: Setting up ImageNet
Train epoch 0:   0%|                           | 0.00/1.00 [00:00<?, ?Samples/s]STAGE:2024-08-05 14:49:08 1939506:1939506 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
Train epoch 0: 100%|████████████████████| 1.00/1.00 [00:00<00:00, 1.05Samples/s]
[08/05/24 14:49:10] INFO: epoch 0:
samples/s: 18.4885, 
updates/s: 0.0000, 
val_loss: 42.7367, 
val_accuracy: 0.00%
STAGE:2024-08-05 14:49:10 1939506:1939506 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-08-05 14:49:10 1939506:1939506 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     C

See? we now have a pretty telling difference in profiler outputs. Can we do any better?

### WIP

 Show how the output of the profiler changes once this last bottleneck is fixed. Give hints as to how to keep identifying the next bottleneck, and potential avenues for further optimization (for example using something like torch.compile, or more workers, multiple GPUs, etc.)


In [None]:
## More code changes, potential avenues for improvement.

In [None]:
## Throughput with training
Take a look at https://docs.mila.quebec/examples/good_practices/launch_many_jobs/index.html

!srun --pty --gpus=1 --cpus-per-task=8 --mem=16G job.sh --epochs=1 --n-samples=20

### Additional resources


[PyTorch Recipes: PyTorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html)  
[PyTorch profiler with tensorboard](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html)  
[PyTorch End-To-End profiling](https://www.kaggle.com/code/wkaisertexas/pytorch-end-to-end-profiling)