# Profiling Megatron-LM training
---

## Learning Objectives

The goal of this lab is to profile the Megatron-LM's GPT model training runs with varying training configurations in order to ensure the GPU performance across multi-GPU or mult-nodes workload.


**Motivation** : Why should we care about profiling ?
  
The estimated time-to-compute which we went through in `Lab1-2_EstimateComputeDaysNeeded.ipynb` is based on the assumption that the training run will have good GPUs performance across multi-GPUs or multi-nodes jobs. Bad training configurations could result in low or inconsistent GPUs utilization, which in turn, might prolong the training run.

In this notebook, we will cover the following: 

    1. Intro to NVIDIA profiling toolchain
    2. Run profiling to record training runs - naive vs. improved runs
  
A challenge will be presented to you at the end of this notebook, you are tasked to beat the profile of the improved run.

Using the knowledge gained from going through `Lab1-2_EstimateComputeDaysNeeded.ipynb` and the profiling lecture presentations will help you to formulate strategies on training configuration in order to obtain a winning profile.

Note: TAs and the NVIDIA profile expert will be available during this session when you go through this notebook, do reach out to them if you have questions.

---

1. Intro to NVIDIA profiling toolchain :

<center><img src="./Megatron-LM/pics/NVprofilingToolchain.JPG" width="800"/></center>

Note: We will be going through an intro to NVIDIA profiling with a NVIDIA profiling expert in the lecture presentation.

The Profiling Workflow :

Profiling is an iterative process. We record the profiling run, then visualize and analyze the profile in order to find areas for improvement to act upon.

<center><img src="./Megatron-LM/pics/profiling_workflow.JPG" width="700"/></center>



In order to properly analyze the profile obtained via real training runs. We first need to understand how Megatron-LM launches the training job.

            ------------ Call out terminals as below illustrated ------------------------
<center><img src="./Megatron-LM/pics/Alt_callout2terminals.JPG" width="600"/></center>


To do live monitoring during a profiling run.

Examine the below [profilig video](https://youtu.be/bnN8ZohiZSI), this video will demonstrate how to call out and arrange 2 windows  within jupyter lab, then launch and monitor the profiling training runs with one window (left) and print out the Megatron-LM training launching procedure. The other window (right), shows nvidia-smi live monitoring the performance of the GPUs. The video will also showcase how to call out the saved profile obtained from the training run. and visualize it using Nsight UI.

In [None]:
from IPython.lib.display import YouTubeVideo
YouTubeVideo('bnN8ZohiZSI', width=600, height=1000)

Reference documents : 

[How to install Nsight](https://developer.nvidia.com/gameworksdownload#?dn=nsight-systems-2021-4-1)

[Nsight User Guide](https://docs.nvidia.com/nsight-systems/UserGuide/index.html)

<center><img src="./Megatron-LM/pics/multigpu_naive_run.jpg" width="1000"/></center>


Install nvtx library. Note the nvtx tags were already implemented in this repo for your convenience.

In [None]:
!pip install nvtx

For the purpose of profiling, we will clean the following folders after each profiling run, in order to ensure training always starts from scratch.

In [None]:
!rm -fr ../sv_ckpt/*
!rm -fr ../dataset/EN/*.npy

After the lecture with the NVIDIA profiling champion, we are now ready to try out our very first profiling Megatron-LM training job.

We start by profiling a naive run with a default configuration.

Note: the following were obtained from previous labs :

CHECKPOINT_PATH='../sv_ckpt/' ## path to save the checkpoint of the training run

DATA_PATH='../dataset/EN/NVblog_text_document' ## obtained from`Lab1-1` and `Lab1-5`

VOCAB_FILE='../dataset/EN/50k/gpt2-vocab.json' ## obtained from`Lab1-4`

MERGE_FILE='../dataset/EN/50k/gpt2-merges.txt' ## obtained from`Lab1-4`

PROFILE_OUTPUT_PATH='../profiles/naive/nsys_naive' ## path to save the profiles of this training run



To evoke profiling session, call nsys decorations followed by the normal Megatron-LM training launch script : 

<center><img src="./Megatron-LM/pics/evoke_nsys_profiling.JPG" width="1000"/></center>


To examine the naive profiling run bash script, click on [open profile_naive_run.sh ](./Megatron-LM/profile_naive_run.sh)

The following code block launches the naive profiling training run.

In [None]:
!bash ./Megatron-LM/profile_naive_run.sh


Below is an example of a successful profiling outputs :

        [after training is done] datetime: 2021-09-15 10:17:46 
        ------------------------------------------------------------------------------------------------------------------
         validation loss at the end of training for val data | lm loss value: 8.895156E+00 | lm loss PPL: 7.296543E+03 | 
        ------------------------------------------------------------------------------------------------------------------
        saving checkpoint at iteration      12 to ../sv_ckpt/
          successfully saved checkpoint at iteration      12 to ../sv_ckpt/
        *****************************************
        Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
        *****************************************
        Processing events...
        Capturing symbol files...
        Saving temporary "/tmp/nsys-report-4642-8c23-394b-8c2e.qdstrm" file to disk...
        Creating final output files...

        Processing [==============================================================100%]
        Saved report file to "/tmp/nsys-report-4642-8c23-394b-8c2e.qdrep"
        Report file moved to "/proj/guest_at_nsc/users/zcharpy/gpubootcamp/ai/Megatron/English/Python/jupyter_notebook/../profiles/naive/nsys_naive.qdrep" 
        
 


---

Visualizing the profiles via Nsight UI. Below is an example of the naive run profile visualized with Nsight UI :


Observe during training phrase, the GPUs utilizations are very low ( the light-blue bar ).
<center><img src="./Megatron-LM/pics/GPUs_naive_run.JPG" width="1000"/></center>

Below is a ReRun cell for experimentation of varying training configurations in order to obtain different training profiles.

Before each re-run, make sure you clear the checkpoint directory by running the blow code block to clear the checkpoint files.
<a id="Rerun_Cell"></a>

In [None]:
!rm -fr ../sv_ckpt/*

View/Modify the profile_2nd_run.sh, click to [open profile_2nd_run.sh](./Megatron-LM/profile_2nd_run.sh).

After viewing/modification, run the below cell block to obtain a new profile.

In [None]:
!bash ./Megatron-LM/profile_2nd_run.sh

Below is an example of a successful profiling outputs :

        > finished creating GPT datasets ...
        [after dataloaders are built] datetime: 2021-09-16 19:19:01 
        done with setup ...
        time (ms) | model-and-optimizer-setup: 772.93 | train/valid/test-data-iterators-setup: 1032.39
        training ...
        [after training is done] datetime: 2021-09-16 19:19:01 
        ------------------------------------------------------------------------------------------------------------------
         validation loss at the end of training for val data | lm loss value: 1.126569E+01 | lm loss PPL: 7.809596E+04 | 
        ------------------------------------------------------------------------------------------------------------------
        Processing events...
        Capturing symbol files...
        Saving temporary "/tmp/nsys-report-3aa1-f1a6-09c2-c853.qdstrm" file to disk...
        Creating final output files...

        Processing [==============================================================100%]
        Saved report file to "/tmp/nsys-report-3aa1-f1a6-09c2-c853.qdrep"
        Report file moved to "/proj/guest_at_nsc/users/zcharpy/gpubootcamp/ai/Megatron/English/Python/jupyter_notebook/../profiles/2ndrun/nsys_improved.qdrep"



---

The improved profiling run output file visualized with Nsight UI:

Observe that during the training phrase, the GPU utilizations are improved and more consistent( as shown in the light-blue bar ).

<center><img src="./Megatron-LM/pics/2ndrun.JPG" width="1000"/></center>


<a id="TheChallenge"></a>

----------------

## **The Challenge ** - get the best looking profile


Constraints : 

        - Use the given # of GPUs available ( 2 x A100 GPUs 40GB ) 
        - Only modify the parameters in the **modifiable section**
        - Avoid OOM error
        - training run must be finished and checkpoint must be saved successfully
Task : 
      Given the above constraints, achieve a good looking profile. 
      
The winning profile visualized on Nsight UI should look similar to the following : 

Observe the GPUs utilization are above 90% consistently (as shown in the **light-blue** bars) throughout the **training** phrase (as shown in the **dark-blue** bar).
      
<center><img src="./Megatron-LM/pics/GoodLookingProfile.JPG" width="1000"/></center>

Jump back to modify the [profiling bash script](./Megatron-LM/profile_2nd_run.sh) and rerun 
<a href="./Lab1-6_Observe_GPT_runs_vs_performance.ipynb#Rerun_Cell">GO to ReRun Cell</a> 



--- 
## Links and Resources
Don't forget to check out additional resources such as [NVIDIA Nsight Systems](https://docs.nvidia.com/nsight-systems/index.html), [NVTX Tutorial](https://developer.nvidia.com/blog/nvidia-tools-extension-api-nvtx-annotation-tool-for-profiling-code-in-python-and-c-c/) and [Nsight Systems](https://developer.nvidia.com/blog/transitioning-nsight-systems-nvidia-visual-profiler-nvprof/).


-----
## <p style="text-align:center;border:3px; padding: 1em"> <a href=../Start_Here.ipynb>HOME</a></p>

-----


## Licensing 

This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). 