In [1]:
# Check if you can see 2 H100 GPUs here
!nvidia-smi

Fri Feb 28 22:57:13 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:BE:00.0 Off |                    0 |
| N/A   33C    P0            104W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  |   00

In [2]:
# Find free port starting from init
def find_freeport(init):
    import socket
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    port = init
    while True:
        try:
            s.bind(("localhost", port))
            break
        except:
            port += 1
    s.close()
    return port

## Part B-0: Single-GPU training

The following cell executes a GPT3-like model with 2 layers and a batch size of 4 on a single GPU. You have a starting reference config file for a single GPU (megatron_configs/single_gpu.yaml). Executing the cell in juypter notebook will train the model for 1 epoch and generate the profiling logs, the trace will be generated on prof_log directory.

In [3]:
!accelerate launch --main_process_port {find_freeport(25900)} --config_file "megatron_configs/single_gpu.yaml" src/prof.py --model_name "model_configs/gpt3_27_2_layer.json" --total_batch_size "4" --logdir "prof_log/single_gpu" 2>/dev/null

Initializing Megatron-LM
using world size: 1, data-parallel size: 1, context-parallel size: 1 tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
using torch.float32 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  add_bias_linear ................................. True
  add_position_embedding .......................... True
  add_qkv_bias .................................... False
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_layernorm_1p .............................. False
  apply_query_key_layer_scaling ................... False
  apply_residual_connection_post_layernorm ........ False
  apply_rope_fusion ............................... 

## Part B-1: Analyzing Multi-GPU training

Next, we will try to train the model using multiple GPUs. We use [Megatron-LM](https://huggingface.co/docs/accelerate/usage_guides/megatron_lm) for generating distributed training runs.

Write new Configuration files and store them in the megatron_config directory (megatron_config/*.yaml). <br>
**For each run, change the argument --config_file and --log_dir appropriately**

<ins> You need to change the following parameters. Refer to megatron-LM link for understanding of how the three parameters are used.<ins> <br>

1.	megatron_lm_pp_degree (PP) <br>
2.	megatron_lm_tp_degree (TP) <br>
3.	megatron_lm_recompute_activations (AR) <br>

<ins>Generate config files following parallelism strategy<ins>: <br>

(a)	Single GPU. (Already Provided) <br>
(b)	Tensor Parallelism (TP=2) on 2 GPUs. (Already Provided) <br>
(c)	Pipeline Parallelism (PP=2) on 2 GPUs. <br>
(d)	Data Parallelism (DP=2) on 2 GPUs. <br>
(e)	Tensor Parallelism (TP=2) + activation recomputation <br>
(f)	Pipeline Parallelism (PP=2) + activation recomputation <br>
(g)	Data Parallelism (DP=2) + activation recomputation <br>

The degree of data parallelism is not explicitly specified but is automatically inferred as follows: <br>
        DP = num_processes / (PP * TP)

In [4]:
## Tensor Parallel with 2 GPU
!accelerate launch --main_process_port {find_freeport(25900)} --config_file "megatron_configs/tensor_parallel.yaml"  src/prof.py --model_name "model_configs/gpt3_27_2_layer.json" --total_batch_size "4" --logdir "prof_log/tensor_parallel"

  def forward(
  def forward(
  def backward(ctx, grad_output):
  def backward(ctx, grad_output):
  def forward(
  def backward(ctx, grad_output):
  def forward(
  def backward(ctx, grad_output):
Initializing Megatron-LM
using world size: 2, data-parallel size: 1, context-parallel size: 1 tensor-model-parallel size: 2, pipeline-model-parallel size: 1 
using torch.float32 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  add_bias_linear ................................. True
  add_position_embedding .......................... True
  add_qkv_bias .................................... False
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_layernorm_1p .........

In [5]:
## Data Parallel with 2 GPU
!accelerate launch --main_process_port {find_freeport(25900)} --config_file "megatron_configs/data_parallel.yaml"  src/prof.py --model_name "model_configs/gpt3_27_2_layer.json" --total_batch_size "4" --logdir "prof_log/data_parallel"

  def forward(
  def forward(
  def backward(ctx, grad_output):
  def backward(ctx, grad_output):
  def forward(
  def forward(
  def backward(ctx, grad_output):
  def backward(ctx, grad_output):
using world size: 2, data-parallel size: 2, context-parallel size: 1 tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
using torch.float32 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  add_bias_linear ................................. True
  add_position_embedding .......................... True
  add_qkv_bias .................................... False
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_layernorm_1p .............................. Fal

In [6]:
## Pipeline Parallel with 2 GPU
!accelerate launch --main_process_port {find_freeport(25900)} --config_file "megatron_configs/pipeline_parallel.yaml"  src/prof.py --model_name "model_configs/gpt3_27_2_layer.json" --total_batch_size "4" --logdir "prof_log/pipeline_parallel"

  def forward(
  def forward(
  def backward(ctx, grad_output):
  def backward(ctx, grad_output):
  def forward(
  def backward(ctx, grad_output):
  def forward(
  def backward(ctx, grad_output):
Initializing Megatron-LM
using world size: 2, data-parallel size: 1, context-parallel size: 1 tensor-model-parallel size: 1, pipeline-model-parallel size: 2 
using torch.float32 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  add_bias_linear ................................. True
  add_position_embedding .......................... True
  add_qkv_bias .................................... False
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_layernorm_1p .........

In [7]:
## Tensor Parallel with 2 GPU with activation recomputation
!accelerate launch --main_process_port {find_freeport(25900)} --config_file "megatron_configs/tensor_parallel_w_activ_recomp.yaml"  src/prof.py --model_name "model_configs/gpt3_27_2_layer.json" --total_batch_size "4" --logdir "prof_log/tensor_parallel_w_activ_recomp"

  def forward(
  def forward(
  def backward(ctx, grad_output):
  def backward(ctx, grad_output):
  def forward(
  def forward(
  def backward(ctx, grad_output):
  def backward(ctx, grad_output):
using world size: 2, data-parallel size: 1, context-parallel size: 1 tensor-model-parallel size: 2, pipeline-model-parallel size: 1 
using torch.float32 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  add_bias_linear ................................. True
  add_position_embedding .......................... True
  add_qkv_bias .................................... False
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_layernorm_1p .............................. Fal

In [8]:
## Data Parallel with 2 GPU with activation recomputation
!accelerate launch --main_process_port {find_freeport(25900)} --config_file "megatron_configs/data_parallel_w_activ_recomp.yaml"  src/prof.py --model_name "model_configs/gpt3_27_2_layer.json" --total_batch_size "4" --logdir "prof_log/data_parallel_w_activ_recomp"

  def forward(
  def backward(ctx, grad_output):
  def forward(
  def backward(ctx, grad_output):
  def forward(
  def backward(ctx, grad_output):
  def forward(
  def backward(ctx, grad_output):
using world size: 2, data-parallel size: 2, context-parallel size: 1 tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
using torch.float32 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  add_bias_linear ................................. True
  add_position_embedding .......................... True
  add_qkv_bias .................................... False
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_layernorm_1p .............................. Fal

In [9]:
## Pipeline Parallel with 2 GPU with activation recomputation
!accelerate launch --main_process_port {find_freeport(25900)} --config_file "megatron_configs/pipeline_parallel_w_activ_recomp.yaml"  src/prof.py --model_name "model_configs/gpt3_27_2_layer.json" --total_batch_size "4" --logdir "prof_log/pipeline_parallel_w_activ_recomp"

  def forward(
  def backward(ctx, grad_output):
  def forward(
  def backward(ctx, grad_output):
  def forward(
  def backward(ctx, grad_output):
  def forward(
  def backward(ctx, grad_output):
using world size: 2, data-parallel size: 1, context-parallel size: 1 tensor-model-parallel size: 1, pipeline-model-parallel size: 2 
using torch.float32 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  add_bias_linear ................................. True
  add_position_embedding .......................... True
  add_qkv_bias .................................... False
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_layernorm_1p .............................. Fal

Use [TensorBoard](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html) to analyze various characteristics of different runs. Fill in the provided **Excel sheet** for all **seven configurations**. Refer to the lab instructions for the details to be filled in the Excel sheet.

To open the TensorBoard session on port X, follow the steps below:

(a) Press ctrl+shift+p (or cmd+shift+p for Mac) on your keyboard to open command palette 
    
(b) Type ">open port in browser" 
    
(c) Select port X
    
If TensorBoard only shows a stale log data, restart the TensorBoard by running cells again to open the tensorboard on a different port.

In [None]:
freeport = find_freeport(6006)
print("Tensorboard will be open on port:", freeport) 
!tensorboard --logdir ~/lab3B/prof_log/ --port {freeport} --bind_all --reload_multifile True 2>/dev/null

Tensorboard will be open on port: 6006
^C


## Part B-2: Training a large model

Train the largest possible GPT-like model with a batch size of 4 on 2 H100 GPUs, by modifying the following parameters:  

1. **Model Size**:  
   - Edit the `"model_configs/gpt3_27.json"` file.  
   - You may only modify the number of layers (`"n_layer": 24`).  Set the number of layers to a multiple of 24.

2. **Distributed Training Configuration**:  
   - Choose any of the six configurations from Part B-1.  

If the training cell executes successfully, it will report the model's size. If the model is too large, it will result in an out-of-memory error, triggering a CUDA or NCCL error.  

In [25]:
## Try varying the number of layers and various parallelism strategies.
!accelerate launch --main_process_port {find_freeport(25900)} --config_file "megatron_configs/pipeline_parallel_w_activ_recomp.yaml"  src/train.py --model_name "model_configs/gpt3_27.json" --total_batch_size "4" 2> /dev/null

Initializing Megatron-LM
using world size: 2, data-parallel size: 1, context-parallel size: 1 tensor-model-parallel size: 1, pipeline-model-parallel size: 2 
using torch.float32 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  add_bias_linear ................................. True
  add_position_embedding .......................... True
  add_qkv_bias .................................... False
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_layernorm_1p .............................. False
  apply_query_key_layer_scaling ................... False
  apply_residual_connection_post_layernorm ........ False
  apply_rope_fusion ............................... 