# 

# 5 Monitor GPT training performance with varying config
---

## **Challenge ** - Go big or go home !
- prerequisites : 
    - use your current given # of gpus 
    - do NOT changing the following parameters **--train-samples 100 **
    - you cannot go OOM 
    - you must sustain >60% GPUs utilization in the **training** phase 
    - training run must be finished and checkpoint must be saved successfully


- task : 
        given the above constraints, train as BIG GPT model as possible



- winning criteria : the biggest model wins given the above constraints(=prerequisites).

    <a href="./Day3-5_run_Megatron_with_varying_config.ipynb#Rerun_Cell">Jump to ReRun Cell</a> 

```
                                #### the follow params are allowed to change 
                                WORLD_SIZE=8 # <--- remember to change the number of GPUs you actually have in your system
                                GPUS_PER_NODE=8 # <--- remember to change the number of GPUs you actually have in your system

                                TENSOR_MP_SIZE=8
                                PIPELINE_MP_SIZE=1
                                LYS=32
                                HIDDEN_SZ=2048
                                NUM_ATTN_HEADS=32
                                MICRO_BZ=
                                GLOBAL_BZ=
                                SEQ_LEN=
                                MAX_POS_EM=
                                #### ---------------------------#### 
``` 
                                ----------------------------For your reference --------------------------
<center><img src="./Megatron-LM/pics/GPT3_all.png" width="700"/></center>

<a id="Rerun_Cell"></a>

In [30]:
!rm -fr ./Megatron-LM/sv_ckpt/* 

In [29]:
%%writefile ./Megatron-LM/profile_SVGPT_BIG.sh
# Copyright (c) 2020 NVIDIA Corporation.  All rights reserved.
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1 #<-- currently we are using 1 node multigpus
NODE_RANK=0

### modify this section to point the file to its own path 
CHECKPOINT_PATH='./Megatron-LM/sv_ckpt/'
DATA_PATH='../dataset/SV/webnyheter2013_text_document'
VOCAB_FILE='../dataset/SV/32k/vocab.json'
MERGE_FILE='../dataset/SV/32k/merges.txt'
PROFILE_OUTPUT_PATH='/home/zcharpy/profiles/DLprof/2ndrun/nsys_improved' # modify this to your own profile path

#### [TODO]--------------- params in the following block are allowed to change -----------#### 
WORLD_SIZE=8 # <--- remember to change the number of GPUs you actually have in your system
GPUS_PER_NODE=8 # <--- remember to change the number of GPUs you actually have in your system

TENSOR_MP_SIZE=8
PIPELINE_MP_SIZE=1
LAYERS=64
HIDDEN_SZ=2048
NUM_ATTN_HEADS=32
MICRO_BZ=64
GLOBAL_BZ=512
SEQ_LEN=512
MAX_POS_EM=512
#### -------------------- end of blocks ------------------------#### 

export OMP_NUM_THREADS=1
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"

## for nsys run
#nsys profile --stats=false --force-overwrite=true --duration=300 --trace=cudnn,cuda,osrt,nvtx -o $PROFILE_OUTPUT_PATH \
python -m torch.distributed.launch $DISTRIBUTED_ARGS \
    ./Megatron-LM/Dlprof_pretrain_gpt.py \
       --tensor-model-parallel-size $TENSOR_MP_SIZE \
       --pipeline-model-parallel-size $PIPELINE_MP_SIZE \
       --num-layers $LAYERS \
       --hidden-size $HIDDEN_SZ \
       --num-attention-heads $NUM_ATTN_HEADS \
       --micro-batch-size $MICRO_BZ \
       --global-batch-size $GLOBAL_BZ \
       --seq-length $SEQ_LEN \
       --max-position-embeddings $MAX_POS_EM \
       --train-samples 100 \
       --save $CHECKPOINT_PATH \
       --load $CHECKPOINT_PATH \
       --data-path 1. $DATA_PATH \
       --vocab-file $VOCAB_FILE \
       --merge-file $MERGE_FILE \
       --data-impl mmap \
       --split 949,50,1 \
       --distributed-backend nccl \
       --lr 0.00015 \
       --lr-decay-style cosine \
       --min-lr 1.0e-5 \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --lr-warmup-fraction .01 \
       --checkpoint-activations \
       --log-interval 10 \
       --save-interval 100 \
       --eval-interval 200 \
       --eval-iters 10 \
       --fp16

Overwriting ./Megatron-LM/profile_SVGPT_BIG.sh


---
## check how big is your model - 
I got 1 Billion :)  what about you ?

In [26]:
!bash params_cnt.sh $LAYERS $HIDDEN_SZ $NUM_ATTN_HEADS $SEQ_LEN

3
3289513984


---
#### you should see something similar to the following 

            training ...
            time (ms) | model-and-optimizer-setup: 4013.85 | train/valid/test-data-iterators-setup: 2773.74
            [after training is done] datetime: 2021-08-27 06:24:46 
            ------------------------------------------------------------------------------------------------------------------
             validation loss at the end of training for val data | lm loss value: 1.124495E+01 | lm loss PPL: 7.649290E+04 | 
            ------------------------------------------------------------------------------------------------------------------
            Processing events...
            Capturing symbol files...
            Saving temporary "/tmp/nsys-report-96a7-0101-ea4b-0ee5.qdstrm" file to disk...
            Creating final output files...

            Processing [==============================================================100%]
            Saved report file to "/tmp/nsys-report-96a7-0101-ea4b-0ee5.qdrep"
            Report file moved to "/home/zcharpy/profiles/DLprof/2ndrun/nsys_improved.qdrep"

In [31]:
!bash ./Megatron-LM/profile_SVGPT_BIG.sh

Initializing NVTX monkey patches
Initializing NVTX monkey patches
Initializing NVTX monkey patches
Initializing NVTX monkey patches
Initializing NVTX monkey patches
Initializing NVTX monkey patches
Initializing NVTX monkey patches
Initializing NVTX monkey patches
Done with NVTX monkey patching
Done with NVTX monkey patching
Done with NVTX monkey patching
Done with NVTX monkey patching
Done with NVTX monkey patching
Done with NVTX monkey patching
Done with NVTX monkey patching
Done with NVTX monkey patching
using world size: 8, data-parallel-size: 1, tensor-model-parallel size: 8, pipeline-model-parallel size: 1 
using torch.float16 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  adlr_autoresume .................................

## Remember to copy and paste your output on Slack or Zoom
## Congratulations on completing the mission !


-----


## Licensing 

This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). 