# 

# 5 Monitor GPT training performance with varying config
---

## Learning Objectives
- **The goal of this lab is to monitor the performance of your training runs with different GPT training configurations **
    - motivation : why should we care ? 
    
    Answer : bad config result in very low / inconsistent gpus utilizations which in turn, slow down training and therefore longer experiments per run, it's a lose-lose-lose situation on all sides.
    ![see example](./Megatron-LM/pics/naive_run.JPG)
    
    - intro to profiling 
    - run profiling scripts 
    
   
    - example : naive run vs. improved run 
        - starts with multiGPUs 
    - exercise : beat the record !

it is possible to obtain more than 90% GPU utilizations overall with high tensorcore ops sustained throughout during **training** for all gpus 


----------------------------------------------------------
### intro to profiling 

#### NVIDIA Profiling ToolChain
<center><img src="./Megatron-LM/pics/NVprofilingToolchain.JPG" width="800"/></center>


----------------------------------------------------------
### The Profiling Workflow

<center><img src="./Megatron-LM/pics/profiling_workflow.JPG" width="700"/></center>



----------------------------------------------------------
### Understanding Megatron training launches

            ------------ call out terminals : watch -n 1 nvidia-smi to monitor training ------------------------
<center><img src="./Megatron-LM/pics/Alt_callout2terminals.JPG" width="600"/></center>


            -------- launch profiling sessions to record: visualize on Nsight( please use Nsight Systems version >=2021.3.1 ) ---------
<center><img src="./Megatron-LM/pics/multigpu_naive_run.jpg" width="1000"/></center>



---
### install nvtx for annotation 

In [1]:
!pip install nvtx

Defaulting to user installation because normal site-packages is not writeable


---
### Let's first verify training works properly, 
modify your configuration and the number of GPUs available to you

training output should look simialr to the following 


In [9]:
!bash ./Megatron-LM/verify_GPT3_Svenska.sh

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
using world size: 4, data-parallel-size: 4, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
setting global batch size to 4
using torch.float16 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_query_key_layer_scaling ................... True
  apply_residual_connection_post_layernorm ........ False
  attent

In [11]:
!ls ./Megatron-LM/sv_ckpt/

---
## making sure the previous ran and saved ckpt are empty 
otherwise the model won't train if already reached specified --train-samples / --train-iter 

In [10]:
!rm -fr ./Megatron-LM/sv_ckpt/*

----------------------------------------------------------
### My very first profiling session - naive run

Let's launch a naive training run 

a successful profiling session should look something similar to the following output ---

        ------------------------------------------------------------------------------------------------------------------
          successfully saved checkpoint at iteration      12 to ./Megatron-LM/sv_ckpt/
        *****************************************
        Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune 
        the variable for optimal performance in your application as needed. 
        *****************************************
        Processing events...
        Capturing symbol files...
        Saving temporary "/tmp/nsys-report-84a0-cf36-0eed-f814.qdstrm" file to disk...
        Creating final output files...

        Processing [==============================================================100%]
        Saved report file to "/tmp/nsys-report-84a0-cf36-0eed-f814.qdrep"
        Report file moved to "/home/zcharpy/profiles/DLprof/naive/nsys_naive.qdrep"

             
              
 

In [4]:
!bash ./Megatron-LM/profile_naive_run.sh

Collecting data...
Initializing NVTX monkey patches
Done with NVTX monkey patching
Initializing NVTX monkey patches
Done with NVTX monkey patching
Initializing NVTX monkey patches
Done with NVTX monkey patching
Initializing NVTX monkey patches
Initializing NVTX monkey patches
Initializing NVTX monkey patches
Initializing NVTX monkey patches
Initializing NVTX monkey patches
Done with NVTX monkey patching
Done with NVTX monkey patching
Done with NVTX monkey patching
Done with NVTX monkey patching
Done with NVTX monkey patching
using world size: 8, data-parallel-size: 8, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
using torch.float32 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  adlr_autoresume ..............

--------------------------------------------------
-----
visualizing the profiles via nsight should look similar to the following 

![multigpus naive run](./Megatron-LM/pics/multigpu_naive_run.jpg)

---
## below is a ReRun cell to experiment training configurations
before each re-run, make sure you clear the checkpoint directory below 
<a id="Rerun_Cell"></a>
a successful profiling session should look like the following 

                training ...
                time (ms) | model-and-optimizer-setup: 3900.44 | train/valid/test-data-iterators-setup: 3056.78
                [after training is done] datetime: 2021-08-27 01:51:24 
                ------------------------------------------------------------------------------------------------------------------
                 validation loss at the end of training for val data | lm loss value: 1.099207E+01 | lm loss PPL: 5.940106E+04 | 
                ------------------------------------------------------------------------------------------------------------------
                Processing events...
                Capturing symbol files...
                Saving temporary "/tmp/nsys-report-7b95-50de-7e4d-bd7e.qdstrm" file to disk...
                Creating final output files...

                Processing [==============================================================100%]
                Saved report file to "/tmp/nsys-report-7b95-50de-7e4d-bd7e.qdrep"
                Report file moved to "/home/zcharpy/profiles/DLprof/2ndrun/nsys_improved.qdrep"

In [7]:
!rm -fr ./Megatron-LM/sv_ckpt/*

In [6]:
!bash ./Megatron-LM/profile_2nd_run.sh

Collecting data...
Initializing NVTX monkey patches
Done with NVTX monkey patching
Initializing NVTX monkey patches
Initializing NVTX monkey patches
Initializing NVTX monkey patches
Done with NVTX monkey patching
Initializing NVTX monkey patches
Initializing NVTX monkey patches
Initializing NVTX monkey patches
Initializing NVTX monkey patches
Done with NVTX monkey patching
Done with NVTX monkey patching
Done with NVTX monkey patching
Done with NVTX monkey patching
Done with NVTX monkey patching
Done with NVTX monkey patching
using world size: 8, data-parallel-size: 1, tensor-model-parallel size: 8, pipeline-model-parallel size: 1 
using torch.float16 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  adlr_autoresume ..............

--------------------------------------------------
visualizing the profiles via nsight should look similar to the following 

![multigpus 2nd run](./Megatron-LM/pics/2ndrun.JPG)

<a id="TheChallenge"></a>

----------------

## **Challenge ** - the best profile
- prerequisites : 
        - use your current given # of gpus 
        - do NOT changing the following parameters --train-samples 100 
        - you cannot go OOM 
        - you must sustain >80% GPUs utilization in the **training** phase 
        - training run must be finished and checkpoint must be saved successfully
    - task : 
            given the above constraints, get as good training GPUs utilizations as possible
    - Pass : sustain 80% gpus utils ( across all gpus) in the **training** phase !
 


task: modify the [profiling bash script](./Megatron-LM/profile_2nd_run.sh) and rerun 
<a href="./Day2-5_Observe_GPT_runs_vs_performance.ipynb#Rerun_Cell">Jump to ReRun Cell</a> 
monitor the training runs to get an overall >80% gpu utils in **training** runs 

```
    TENSOR_MP_SIZE=
    PIPELINE_MP_SIZE=

    #GPT Config 
    LAYERS= 
    HIDDEN_SIZE=
    ATTN_HEADS=
    MICRO_BZ=
    GB_BZ=
    SEQ_LEN=
    MAX_POS_EM=
``` 


---
## Congratulations you are done for the day !
## Back To [start menu](../Start_Here.ipynb)

-----


## Licensing 

This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). 