## Scale up model size
---
In previous notebooks, we downloaded and extracted our own Swedish raw text; practiced filter, clean and deduplicate the raw text data ; trained our own GPTBPETokenizer and fitted to the raw Swedish text ; converted the raw text to mmap format integrating a custom sentence-splitter.

Now that we have learned the components to customize the Megatron-LM's workflow according to specific langauge needs ( in this case, it is Swedish). The next step is to train the Megatron-LM GPT model with the Swedish data. 

However, constraint by how much compute resources you get, i.e the number of GPUs available for the training job, there is an upper limit of how big a model you can train.

Let's test this out by presenting a Challenge. 

## **Challenge ** - Go big or go home !

- Constraints : 
    - 2 x A100 GPUs 40G is allocated for this challenge.
    - Only the parameters in the **modifiable blocks** are allowed to be changed.
    - Avoid OOM !
    - Training run must be finished and checkpoint must be saved successfully.


- Task : 
        given the above prerequisites, train as BIG a GPT model as possible.

- Winning criteria : the biggest model wins given the above constraints.

Note 1: Post the parameters you changed into the **modifiable blocks** on slack channels for verification.

Note 2: We purposefully turned-off nsys profiling in this challenge, because calling nsys profiling will introduce a small overhead, which will impact the maximum achievable model size.

Go directly to the code block and modify training configuration, click here to <a href="./Lab2-5_run_Megatron_with_varying_config.ipynb#MODIFY_CELL">Jump to Code Cell and Modify Training Config</a> 

---
# Hint :
### call out a terminal and type in **nvidia-smi** to monitor the GPUs' utils and power consumption 
### remember to fill up the GPU memory
![call out a terminal ](./Megatron-LM/pics/Alt_callout2terminals.JPG)

Modify and rerun the code blocks below to obtain a even bigger GPT model. 


<a id="MODIFY_CELL"></a>
<a href="./Lab2-5_run_Megatron_with_varying_config.ipynb#Rerun_Cell">Jump to ReRun Cell</a> 

<a id="MODIFY_CELL"></a>

Always clean the checkpoint folder to ensure trainining start from scratch.

In [1]:
!rm -fr ../sv_ckpt/* 

In [2]:
%%writefile ./Megatron-LM/SV_GPT_goingBIG.sh
# Copyright (c) 2020 NVIDIA Corporation.  All rights reserved.
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1 #<-- currently we are using 1 node multigpus
NODE_RANK=0
WORLD_SIZE=2 
GPUS_PER_NODE=2  


CHECKPOINT_PATH='../sv_ckpt/'
DATA_PATH='../dataset/SV/webnyheter2013_56kvocab_text_document'
VOCAB_FILE='../dataset/SV/56k/vocab.json'
MERGE_FILE='../dataset/SV/56k/merges.txt'
PROFILE_OUTPUT_PATH='../profiles/SV/nsys_sv_' # modify this to your own profile path

#### [TODO]--------------- Begin of modifiable block -----------#### 

TENSOR_MP_SIZE=<FILL_IN>
PIPELINE_MP_SIZE=<FILL_IN>
LAYERS=<FILL_IN>
HIDDEN_SZ=<FILL_IN>
NUM_ATTN_HEADS=<FILL_IN>
MICRO_BZ=<FILL_IN>
GLOBAL_BZ=<FILL_IN>
SEQ_LEN=<FILL_IN>
MAX_POS_EM=<FILL_IN>

#### -------------------- end of modifiable blocks ------------------------#### 

##################  DO NOT modify anything below this line ##################
export OMP_NUM_THREADS=1
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"

## We turn off nsys profiling decoration to avoid the small overhead
#nsys profile --stats=false --force-overwrite=true --duration=300 --trace=cudnn,cuda,osrt,nvtx -o $PROFILE_OUTPUT_PATH \
python -m torch.distributed.launch $DISTRIBUTED_ARGS \
    ./Megatron-LM/Dlprof_pretrain_gpt.py \
       --tensor-model-parallel-size ${TENSOR_MP_SIZE} \
       --pipeline-model-parallel-size ${PIPELINE_MP_SIZE} \
       --num-layers ${LAYERS} \
       --hidden-size ${HIDDEN_SZ} \
       --num-attention-heads ${NUM_ATTN_HEADS} \
       --micro-batch-size ${MICRO_BZ} \
       --global-batch-size ${GLOBAL_BZ} \
       --seq-length ${SEQ_LEN} \
       --max-position-embeddings ${MAX_POS_EM} \
       --train-samples 100 \
       --save ${CHECKPOINT_PATH} \
       --load ${CHECKPOINT_PATH} \
       --data-path 1. ${DATA_PATH} \
       --vocab-file ${VOCAB_FILE} \
       --merge-file ${MERGE_FILE} \
       --data-impl mmap \
       --split 949,50,1 \
       --distributed-backend nccl \
       --lr 0.00015 \
       --lr-decay-style cosine \
       --min-lr 1.0e-5 \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --lr-warmup-fraction .01 \
       --checkpoint-activations \
       --log-interval 10 \
       --save-interval 100 \
       --eval-interval 200 \
       --eval-iters 10 \
       --fp16

Overwriting ./Megatron-LM/profile_SVGPT_BIG.sh


Check how big is your model. By modify the parameters in the [params_cnt.sh](./params_cnt.sh)

I got 6.6 Billion :)  what about you ?

In [None]:
!bash params_cnt.sh 

Below is an example of expected outputs:
    
        6
        6675628032


Re-run this cell below to get an even bigger GPT model

Remember to modify the [params count](./params_cnt.sh) to check how big is your model.

Jump back and mdify the SV_GPT_goingBIG.sh, click here to 
<a href="./Lab2-5_run_Megatron_with_varying_config.ipynb#MODIFY_CELL">Jump back to modify and overwrite SV_GPT_goingBIG.sh </a> 
<a id="Rerun_Cell"></a>

In [None]:
!./Megatron-LM/SV_GPT_goingBIG.sh

Below is an example of expected outputs:

        > elapsed time for building blendable dataset indices: 0.00 (sec)
        > finished creating GPT datasets ...
        [after dataloaders are built] datetime: 2021-09-15 11:55:58 
        done with setup ...
        training ...
        time (ms) | model-and-optimizer-setup: 929.42 | train/valid/test-data-iterators-setup: 1004.53
        [after training is done] datetime: 2021-09-15 11:55:58 
        ------------------------------------------------------------------------------------------------------------------
         validation loss at the end of training for val data | lm loss value: 1.171452E+01 | lm loss PPL: 1.223352E+05 | 
        ------------------------------------------------------------------------------------------------------------------
        Evaluating iter 10/10
        -------------------------------------------------------------------------------------------------------------------
         validation loss at the end of training for test data | lm loss value: 1.171400E+01 | lm loss PPL: 1.222719E+05 | 
        -------------------------------------------------------------------------------------------------------------------

--- 

## Additional Resources

Language Models are Few-Shot Learners : https://arxiv.org/pdf/2005.14165.pdf

Efficient Large-Scale Language Model Training on GPU Clusters : https://arxiv.org/pdf/2104.04473.pdf

---

## Congratulations on completing the mission !


-----


## Licensing 

This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). 