In [1]:
# Setup and Implementation Part 1
import torch
import sys

print("PyTorch Installation")
print(f"PyTorch Version: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    num_gpus = torch.cuda.device_count()
    print(f"Number of GPUs Available: {num_gpus}")
else:
    print("No GPUs available!")
"""
I'm executing everything on a Kaggle notebook which comes with PyTorch automatically installed and 2 NVIDIA T4 GPUs that I can make use of for free. As a result there's not really any complex configurations, setup steps, or hardware specifications that I have to worry about or document.
"""

PyTorch Installation
PyTorch Version: 2.6.0+cu124
CUDA Available: True
Number of GPUs Available: 2


"\nI'm executing everything on a Kaggle notebook which comes with PyTorch automatically installed and 2 NVIDIA T4 GPUs that I can make use of for free. As a result there's not really any complex configurations, setup steps, or hardware specifications that I have to worry about or document.\n"

In [2]:
# Setup and Implementation Part 2
# distributed_training.py = code that implements a transformer model with pipeline parallelism support 
!cat ../input/temp65/distributed_training.py

# Importing all the required libraries/dependencies
import torch
import torch.nn as nn
from dataclasses import dataclass
import time
import os
import argparse
import torch.distributed as dist
from torch.distributed.pipelining import pipeline, SplitPoint, ScheduleGPipe, Schedule1F1B, ScheduleInterleaved1F1B

# global variables taken from the tutorial code
global rank, device, pp_group, stage_index, num_stages, start_time, end_time

# init_distributed method is the exact same as the tutorial code
def init_distributed():
   global rank, device, pp_group, stage_index, num_stages

   rank = int(os.environ["LOCAL_RANK"])
   world_size = int(os.environ["WORLD_SIZE"])
   device = torch.device(f"cuda:{rank}") if torch.cuda.is_available() else torch.device("cpu")
   dist.init_process_group()

   pp_group = dist.new_group()
   stage_index = rank
   num_stages = world_size

# ModelArgs class is the exact same as the tutorial code with n_processes(number of processes) attribute added to it
@datacla

In [3]:
# Experimental Evaluation Part 1
"Only able to present results where Number of processes = 2 since I only have 2 GPUs and if I execute everything with just CPUs, the whole program execution starts hanging and never finishes."
# GPipe Scheduling Results

'Only able to present results where Number of processes = 2 since I only have 2 GPUs and if I execute everything with just CPUs, the whole program execution starts hanging and never finishes.'

In [4]:
!torchrun --nnodes 1 --nproc_per_node 2 ../input/temp65/distributed_training.py --layers 4 --attention_heads 4 --schedule GPipe --processes 2

W1021 02:11:32.477000 79 torch/distributed/run.py:792] 
W1021 02:11:32.477000 79 torch/distributed/run.py:792] *****************************************
W1021 02:11:32.477000 79 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1021 02:11:32.477000 79 torch/distributed/run.py:792] *****************************************
Rank: 0, Starting Epoch: 1
Rank: 1, Starting Epoch: 1
Rank: 1, Avg Loss: 9.3813
Rank: 1, Starting Epoch: 2
Rank: 0, Starting Epoch: 2
Rank: 0, Starting Epoch: 3
Rank: 1, Avg Loss: 9.2788
Rank: 1, Starting Epoch: 3
Rank: 1, Avg Loss: 9.0630
Total Training Throughput: 15281.787
End To End Training Time: 3.141


In [5]:
!torchrun --nnodes 1 --nproc_per_node 2 ../input/temp65/distributed_training.py --layers 4 --attention_heads 8 --schedule GPipe --processes 2

W1021 02:11:46.172000 117 torch/distributed/run.py:792] 
W1021 02:11:46.172000 117 torch/distributed/run.py:792] *****************************************
W1021 02:11:46.172000 117 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1021 02:11:46.172000 117 torch/distributed/run.py:792] *****************************************
Rank: 0, Starting Epoch: 1
Rank: 1, Starting Epoch: 1
Rank: 1, Avg Loss: 9.3771
Rank: 1, Starting Epoch: 2
Rank: 0, Starting Epoch: 2
Rank: 0, Starting Epoch: 3
Rank: 1, Avg Loss: 9.2850
Rank: 1, Starting Epoch: 3
Rank: 1, Avg Loss: 9.0615
Total Training Throughput: 17518.957
End To End Training Time: 2.740


In [6]:
!torchrun --nnodes 1 --nproc_per_node 2 ../input/temp65/distributed_training.py --layers 4 --attention_heads 12 --schedule GPipe --processes 2

W1021 02:11:56.910000 155 torch/distributed/run.py:792] 
W1021 02:11:56.910000 155 torch/distributed/run.py:792] *****************************************
W1021 02:11:56.910000 155 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1021 02:11:56.910000 155 torch/distributed/run.py:792] *****************************************
Rank: 0, Starting Epoch: 1
Rank: 1, Starting Epoch: 1
Rank: 1, Avg Loss: 9.3792
Rank: 1, Starting Epoch: 2
Rank: 0, Starting Epoch: 2
Rank: 0, Starting Epoch: 3
Rank: 1, Avg Loss: 9.2898
Rank: 1, Starting Epoch: 3
Rank: 1, Avg Loss: 9.0686
Total Training Throughput: 17651.923
End To End Training Time: 2.719


In [7]:
!torchrun --nnodes 1 --nproc_per_node 2 ../input/temp65/distributed_training.py --layers 8 --attention_heads 4 --schedule GPipe --processes 2

W1021 02:12:07.679000 193 torch/distributed/run.py:792] 
W1021 02:12:07.679000 193 torch/distributed/run.py:792] *****************************************
W1021 02:12:07.679000 193 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1021 02:12:07.679000 193 torch/distributed/run.py:792] *****************************************
Rank: 1, Starting Epoch: 1
Rank: 0, Starting Epoch: 1
Rank: 1, Avg Loss: 9.3797
Rank: 1, Starting Epoch: 2
Rank: 0, Starting Epoch: 2
Rank: 0, Starting Epoch: 3
Rank: 1, Avg Loss: 9.3120
Rank: 1, Starting Epoch: 3
Rank: 1, Avg Loss: 9.0902
Total Training Throughput: 12171.319
End To End Training Time: 3.944


In [8]:
!torchrun --nnodes 1 --nproc_per_node 2 ../input/temp65/distributed_training.py --layers 8 --attention_heads 8 --schedule GPipe --processes 2

W1021 02:12:20.497000 231 torch/distributed/run.py:792] 
W1021 02:12:20.497000 231 torch/distributed/run.py:792] *****************************************
W1021 02:12:20.497000 231 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1021 02:12:20.497000 231 torch/distributed/run.py:792] *****************************************
Rank: 1, Starting Epoch: 1
Rank: 0, Starting Epoch: 1
Rank: 1, Avg Loss: 9.3679
Rank: 1, Starting Epoch: 2
Rank: 0, Starting Epoch: 2
Rank: 0, Starting Epoch: 3
Rank: 1, Avg Loss: 9.3177
Rank: 1, Starting Epoch: 3
Rank: 1, Avg Loss: 9.0854
Total Training Throughput: 11738.217
End To End Training Time: 4.089


In [9]:
!torchrun --nnodes 1 --nproc_per_node 2 ../input/temp65/distributed_training.py --layers 8 --attention_heads 12 --schedule GPipe --processes 2

W1021 02:12:33.929000 269 torch/distributed/run.py:792] 
W1021 02:12:33.929000 269 torch/distributed/run.py:792] *****************************************
W1021 02:12:33.929000 269 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1021 02:12:33.929000 269 torch/distributed/run.py:792] *****************************************
Rank: 0, Starting Epoch: 1
Rank: 1, Starting Epoch: 1
Rank: 1, Avg Loss: 9.3757
Rank: 1, Starting Epoch: 2
Rank: 0, Starting Epoch: 2
Rank: 0, Starting Epoch: 3
Rank: 1, Avg Loss: 9.2920
Rank: 1, Starting Epoch: 3
Rank: 1, Avg Loss: 9.0721
Total Training Throughput: 11286.389
End To End Training Time: 4.253


In [10]:
!torchrun --nnodes 1 --nproc_per_node 2 ../input/temp65/distributed_training.py --layers 12 --attention_heads 4 --schedule GPipe --processes 2

W1021 02:12:47.585000 307 torch/distributed/run.py:792] 
W1021 02:12:47.585000 307 torch/distributed/run.py:792] *****************************************
W1021 02:12:47.585000 307 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1021 02:12:47.585000 307 torch/distributed/run.py:792] *****************************************
Rank: 0, Starting Epoch: 1
Rank: 1, Starting Epoch: 1
Rank: 1, Avg Loss: 9.3838
Rank: 1, Starting Epoch: 2
Rank: 0, Starting Epoch: 2
Rank: 0, Starting Epoch: 3
Rank: 1, Avg Loss: 9.3220
Rank: 1, Starting Epoch: 3
Rank: 1, Avg Loss: 9.0916
Total Training Throughput: 9089.485
End To End Training Time: 5.281


In [11]:
!torchrun --nnodes 1 --nproc_per_node 2 ../input/temp65/distributed_training.py --layers 12 --attention_heads 8 --schedule GPipe --processes 2

W1021 02:13:03.411000 345 torch/distributed/run.py:792] 
W1021 02:13:03.411000 345 torch/distributed/run.py:792] *****************************************
W1021 02:13:03.411000 345 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1021 02:13:03.411000 345 torch/distributed/run.py:792] *****************************************
Rank: 0, Starting Epoch: 1
Rank: 1, Starting Epoch: 1
Rank: 1, Avg Loss: 9.3737
Rank: 1, Starting Epoch: 2
Rank: 0, Starting Epoch: 2
Rank: 0, Starting Epoch: 3
Rank: 1, Avg Loss: 9.3002
Rank: 1, Starting Epoch: 3
Rank: 1, Avg Loss: 9.0838
Total Training Throughput: 8906.597
End To End Training Time: 5.389


In [12]:
!torchrun --nnodes 1 --nproc_per_node 2 ../input/temp65/distributed_training.py --layers 12 --attention_heads 12 --schedule GPipe --processes 2

W1021 02:13:19.360000 383 torch/distributed/run.py:792] 
W1021 02:13:19.360000 383 torch/distributed/run.py:792] *****************************************
W1021 02:13:19.360000 383 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1021 02:13:19.360000 383 torch/distributed/run.py:792] *****************************************
Rank: 0, Starting Epoch: 1
Rank: 1, Starting Epoch: 1
Rank: 1, Avg Loss: 9.3715
Rank: 1, Starting Epoch: 2
Rank: 0, Starting Epoch: 2
Rank: 0, Starting Epoch: 3
Rank: 1, Avg Loss: 9.3142
Rank: 1, Starting Epoch: 3
Rank: 1, Avg Loss: 9.0820
Total Training Throughput: 8216.436
End To End Training Time: 5.842


In [13]:
# Experimental Evaluation Part 1
"Only able to present results where Number of processes = 2 since I only have 2 GPUs and if I execute everything with just CPUs, the whole program execution starts hanging and never finishes."
# 1F1B Scheduling Results

'Only able to present results where Number of processes = 2 since I only have 2 GPUs and if I execute everything with just CPUs, the whole program execution starts hanging and never finishes.'

In [14]:
!torchrun --nnodes 1 --nproc_per_node 2 ../input/temp65/distributed_training.py --layers 4 --attention_heads 4 --schedule 1F1B --processes 2

W1021 02:13:35.809000 421 torch/distributed/run.py:792] 
W1021 02:13:35.809000 421 torch/distributed/run.py:792] *****************************************
W1021 02:13:35.809000 421 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1021 02:13:35.809000 421 torch/distributed/run.py:792] *****************************************
Rank: 0, Starting Epoch: 1
Rank: 1, Starting Epoch: 1
Rank: 1, Avg Loss: 9.3778
Rank: 1, Starting Epoch: 2
Rank: 0, Starting Epoch: 2
Rank: 0, Starting Epoch: 3
Rank: 1, Avg Loss: 9.2965
Rank: 1, Starting Epoch: 3
Rank: 1, Avg Loss: 9.0689
Total Training Throughput: 17685.948
End To End Training Time: 2.714


In [15]:
!torchrun --nnodes 1 --nproc_per_node 2 ../input/temp65/distributed_training.py --layers 4 --attention_heads 8 --schedule 1F1B --processes 2

W1021 02:13:46.552000 459 torch/distributed/run.py:792] 
W1021 02:13:46.552000 459 torch/distributed/run.py:792] *****************************************
W1021 02:13:46.552000 459 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1021 02:13:46.552000 459 torch/distributed/run.py:792] *****************************************
Rank: 0, Starting Epoch: 1
Rank: 1, Starting Epoch: 1
Rank: 1, Avg Loss: 9.3763
Rank: 1, Starting Epoch: 2
Rank: 0, Starting Epoch: 2
Rank: 0, Starting Epoch: 3
Rank: 1, Avg Loss: 9.2950
Rank: 1, Starting Epoch: 3
Rank: 1, Avg Loss: 9.0704
Total Training Throughput: 17387.224
End To End Training Time: 2.761


In [16]:
!torchrun --nnodes 1 --nproc_per_node 2 ../input/temp65/distributed_training.py --layers 4 --attention_heads 12 --schedule 1F1B --processes 2

W1021 02:13:57.300000 497 torch/distributed/run.py:792] 
W1021 02:13:57.300000 497 torch/distributed/run.py:792] *****************************************
W1021 02:13:57.300000 497 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1021 02:13:57.300000 497 torch/distributed/run.py:792] *****************************************
Rank: 0, Starting Epoch: 1
Rank: 1, Starting Epoch: 1
Rank: 1, Avg Loss: 9.3659
Rank: 1, Starting Epoch: 2
Rank: 0, Starting Epoch: 2
Rank: 0, Starting Epoch: 3
Rank: 1, Avg Loss: 9.2858
Rank: 1, Starting Epoch: 3
Rank: 1, Avg Loss: 9.0689
Total Training Throughput: 16730.582
End To End Training Time: 2.869


In [17]:
!torchrun --nnodes 1 --nproc_per_node 2 ../input/temp65/distributed_training.py --layers 8 --attention_heads 4 --schedule 1F1B --processes 2

W1021 02:14:08.005000 535 torch/distributed/run.py:792] 
W1021 02:14:08.005000 535 torch/distributed/run.py:792] *****************************************
W1021 02:14:08.005000 535 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1021 02:14:08.005000 535 torch/distributed/run.py:792] *****************************************
Rank: 0, Starting Epoch: 1
Rank: 1, Starting Epoch: 1
Rank: 1, Avg Loss: 9.3806
Rank: 1, Starting Epoch: 2
Rank: 0, Starting Epoch: 2
Rank: 0, Starting Epoch: 3
Rank: 1, Avg Loss: 9.3105
Rank: 1, Starting Epoch: 3
Rank: 1, Avg Loss: 9.0844
Total Training Throughput: 11778.643
End To End Training Time: 4.075


In [18]:
!torchrun --nnodes 1 --nproc_per_node 2 ../input/temp65/distributed_training.py --layers 8 --attention_heads 8 --schedule 1F1B --processes 2

W1021 02:14:21.318000 573 torch/distributed/run.py:792] 
W1021 02:14:21.318000 573 torch/distributed/run.py:792] *****************************************
W1021 02:14:21.318000 573 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1021 02:14:21.318000 573 torch/distributed/run.py:792] *****************************************
Rank: 0, Starting Epoch: 1
Rank: 1, Starting Epoch: 1
Rank: 1, Avg Loss: 9.3787
Rank: 1, Starting Epoch: 2
Rank: 0, Starting Epoch: 2
Rank: 0, Starting Epoch: 3
Rank: 1, Avg Loss: 9.3107
Rank: 1, Starting Epoch: 3
Rank: 1, Avg Loss: 9.0862
Total Training Throughput: 11584.665
End To End Training Time: 4.143


In [19]:
!torchrun --nnodes 1 --nproc_per_node 2 ../input/temp65/distributed_training.py --layers 8 --attention_heads 12 --schedule 1F1B --processes 2

W1021 02:14:34.750000 611 torch/distributed/run.py:792] 
W1021 02:14:34.750000 611 torch/distributed/run.py:792] *****************************************
W1021 02:14:34.750000 611 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1021 02:14:34.750000 611 torch/distributed/run.py:792] *****************************************
Rank: 0, Starting Epoch: 1
Rank: 1, Starting Epoch: 1
Rank: 1, Avg Loss: 9.3776
Rank: 1, Starting Epoch: 2
Rank: 0, Starting Epoch: 2
Rank: 0, Starting Epoch: 3
Rank: 1, Avg Loss: 9.3059
Rank: 1, Starting Epoch: 3
Rank: 1, Avg Loss: 9.0772
Total Training Throughput: 10936.940
End To End Training Time: 4.389


In [20]:
!torchrun --nnodes 1 --nproc_per_node 2 ../input/temp65/distributed_training.py --layers 12 --attention_heads 4 --schedule 1F1B --processes 2

W1021 02:14:48.334000 649 torch/distributed/run.py:792] 
W1021 02:14:48.334000 649 torch/distributed/run.py:792] *****************************************
W1021 02:14:48.334000 649 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1021 02:14:48.334000 649 torch/distributed/run.py:792] *****************************************
Rank: 1, Starting Epoch: 1
Rank: 0, Starting Epoch: 1
Rank: 1, Avg Loss: 9.3918
Rank: 1, Starting Epoch: 2
Rank: 0, Starting Epoch: 2
Rank: 0, Starting Epoch: 3
Rank: 1, Avg Loss: 9.3003
Rank: 1, Starting Epoch: 3
Rank: 1, Avg Loss: 9.0843
Total Training Throughput: 8666.590
End To End Training Time: 5.539


In [21]:
!torchrun --nnodes 1 --nproc_per_node 2 ../input/temp65/distributed_training.py --layers 12 --attention_heads 8 --schedule 1F1B --processes 2

W1021 02:15:04.724000 687 torch/distributed/run.py:792] 
W1021 02:15:04.724000 687 torch/distributed/run.py:792] *****************************************
W1021 02:15:04.724000 687 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1021 02:15:04.724000 687 torch/distributed/run.py:792] *****************************************
Rank: 0, Starting Epoch: 1
Rank: 1, Starting Epoch: 1
Rank: 1, Avg Loss: 9.3752
Rank: 1, Starting Epoch: 2
Rank: 0, Starting Epoch: 2
Rank: 0, Starting Epoch: 3
Rank: 1, Avg Loss: 9.3200
Rank: 1, Starting Epoch: 3
Rank: 1, Avg Loss: 9.0951
Total Training Throughput: 8664.951
End To End Training Time: 5.540


In [22]:
!torchrun --nnodes 1 --nproc_per_node 2 ../input/temp65/distributed_training.py --layers 12 --attention_heads 12 --schedule 1F1B --processes 2

W1021 02:15:21.045000 725 torch/distributed/run.py:792] 
W1021 02:15:21.045000 725 torch/distributed/run.py:792] *****************************************
W1021 02:15:21.045000 725 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1021 02:15:21.045000 725 torch/distributed/run.py:792] *****************************************
Rank: 1, Starting Epoch: 1
Rank: 0, Starting Epoch: 1
Rank: 1, Avg Loss: 9.3766
Rank: 1, Starting Epoch: 2
Rank: 0, Starting Epoch: 2
Rank: 0, Starting Epoch: 3
Rank: 1, Avg Loss: 9.3104
Rank: 1, Starting Epoch: 3
Rank: 1, Avg Loss: 9.0871
Total Training Throughput: 8056.898
End To End Training Time: 5.958


In [23]:
# Experimental Evaluation Part 1
"Only able to present results where Number of processes = 2 since I only have 2 GPUs and if I execute everything with just CPUs, the whole program execution starts hanging and never finishes."
# Interleaved1F1B Scheduling Results
# num_chunks will always be passed in as half of layers(total number of TransformerDecoder layers) to evenly distribute the workload across both GPUs

'Only able to present results where Number of processes = 2 since I only have 2 GPUs and if I execute everything with just CPUs, the whole program execution starts hanging and never finishes.'

In [24]:
!torchrun --nnodes 1 --nproc_per_node 2 ../input/temp65/distributed_training.py --layers 4 --attention_heads 4 --schedule Interleaved1F1B --processes 2 --num_chunks 2

W1021 02:15:37.580000 763 torch/distributed/run.py:792] 
W1021 02:15:37.580000 763 torch/distributed/run.py:792] *****************************************
W1021 02:15:37.580000 763 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1021 02:15:37.580000 763 torch/distributed/run.py:792] *****************************************
Rank: 1, Starting Epoch: 1
Rank: 0, Starting Epoch: 1
Rank: 1, Avg Loss: 9.3784
Rank: 1, Starting Epoch: 2
Rank: 0, Starting Epoch: 2
Rank: 0, Starting Epoch: 3
Rank: 1, Avg Loss: 9.2736
Rank: 1, Starting Epoch: 3
Rank: 1, Avg Loss: 9.0630
Total Training Throughput: 17179.105
End To End Training Time: 2.794


In [25]:
!torchrun --nnodes 1 --nproc_per_node 2 ../input/temp65/distributed_training.py --layers 4 --attention_heads 8 --schedule Interleaved1F1B --processes 2 --num_chunks 2

W1021 02:15:48.382000 801 torch/distributed/run.py:792] 
W1021 02:15:48.382000 801 torch/distributed/run.py:792] *****************************************
W1021 02:15:48.382000 801 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1021 02:15:48.382000 801 torch/distributed/run.py:792] *****************************************
Rank: 1, Starting Epoch: 1
Rank: 0, Starting Epoch: 1
Rank: 1, Avg Loss: 9.3793
Rank: 1, Starting Epoch: 2
Rank: 0, Starting Epoch: 2
Rank: 0, Starting Epoch: 3
Rank: 1, Avg Loss: 9.2828
Rank: 1, Starting Epoch: 3
Rank: 1, Avg Loss: 9.0629
Total Training Throughput: 17085.271
End To End Training Time: 2.809


In [26]:
!torchrun --nnodes 1 --nproc_per_node 2 ../input/temp65/distributed_training.py --layers 4 --attention_heads 12 --schedule Interleaved1F1B --processes 2 --num_chunks 2

W1021 02:15:59.038000 839 torch/distributed/run.py:792] 
W1021 02:15:59.038000 839 torch/distributed/run.py:792] *****************************************
W1021 02:15:59.038000 839 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1021 02:15:59.038000 839 torch/distributed/run.py:792] *****************************************
Rank: 1, Starting Epoch: 1
Rank: 0, Starting Epoch: 1
Rank: 1, Avg Loss: 9.3802
Rank: 1, Starting Epoch: 2
Rank: 0, Starting Epoch: 2
Rank: 0, Starting Epoch: 3
Rank: 1, Avg Loss: 9.2806
Rank: 1, Starting Epoch: 3
Rank: 1, Avg Loss: 9.0609
Total Training Throughput: 16215.554
End To End Training Time: 2.960


In [27]:
!torchrun --nnodes 1 --nproc_per_node 2 ../input/temp65/distributed_training.py --layers 8 --attention_heads 4 --schedule Interleaved1F1B --processes 2 --num_chunks 4

W1021 02:16:09.800000 877 torch/distributed/run.py:792] 
W1021 02:16:09.800000 877 torch/distributed/run.py:792] *****************************************
W1021 02:16:09.800000 877 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1021 02:16:09.800000 877 torch/distributed/run.py:792] *****************************************
Rank: 0, Starting Epoch: 1
Rank: 1, Starting Epoch: 1
Rank: 1, Avg Loss: 9.3748
Rank: 1, Starting Epoch: 2
Rank: 0, Starting Epoch: 2
Rank: 0, Starting Epoch: 3
Rank: 1, Avg Loss: 9.3045
Rank: 1, Starting Epoch: 3
Rank: 1, Avg Loss: 9.0773
Total Training Throughput: 11887.872
End To End Training Time: 4.038


In [28]:
!torchrun --nnodes 1 --nproc_per_node 2 ../input/temp65/distributed_training.py --layers 8 --attention_heads 8 --schedule Interleaved1F1B --processes 2 --num_chunks 4

W1021 02:16:23.560000 915 torch/distributed/run.py:792] 
W1021 02:16:23.560000 915 torch/distributed/run.py:792] *****************************************
W1021 02:16:23.560000 915 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1021 02:16:23.560000 915 torch/distributed/run.py:792] *****************************************
Rank: 0, Starting Epoch: 1
Rank: 1, Starting Epoch: 1
Rank: 1, Avg Loss: 9.3784
Rank: 1, Starting Epoch: 2
Rank: 0, Starting Epoch: 2
Rank: 0, Starting Epoch: 3
Rank: 1, Avg Loss: 9.2984
Rank: 1, Starting Epoch: 3
Rank: 1, Avg Loss: 9.0698
Total Training Throughput: 11762.432
End To End Training Time: 4.081


In [29]:
!torchrun --nnodes 1 --nproc_per_node 2 ../input/temp65/distributed_training.py --layers 8 --attention_heads 12 --schedule Interleaved1F1B --processes 2 --num_chunks 4

W1021 02:16:37.292000 953 torch/distributed/run.py:792] 
W1021 02:16:37.292000 953 torch/distributed/run.py:792] *****************************************
W1021 02:16:37.292000 953 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1021 02:16:37.292000 953 torch/distributed/run.py:792] *****************************************
Rank: 0, Starting Epoch: 1
Rank: 1, Starting Epoch: 1
Rank: 1, Avg Loss: 9.3700
Rank: 1, Starting Epoch: 2
Rank: 0, Starting Epoch: 2
Rank: 0, Starting Epoch: 3
Rank: 1, Avg Loss: 9.3097
Rank: 1, Starting Epoch: 3
Rank: 1, Avg Loss: 9.0819
Total Training Throughput: 11063.757
End To End Training Time: 4.338


In [30]:
!torchrun --nnodes 1 --nproc_per_node 2 ../input/temp65/distributed_training.py --layers 12 --attention_heads 4 --schedule Interleaved1F1B --processes 2 --num_chunks 6

W1021 02:16:51.023000 991 torch/distributed/run.py:792] 
W1021 02:16:51.023000 991 torch/distributed/run.py:792] *****************************************
W1021 02:16:51.023000 991 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1021 02:16:51.023000 991 torch/distributed/run.py:792] *****************************************
Rank: 1, Starting Epoch: 1
Rank: 0, Starting Epoch: 1
Rank: 1, Avg Loss: 9.3700
Rank: 1, Starting Epoch: 2
Rank: 0, Starting Epoch: 2
Rank: 0, Starting Epoch: 3
Rank: 1, Avg Loss: 9.2848
Rank: 1, Starting Epoch: 3
Rank: 1, Avg Loss: 9.0668
Total Training Throughput: 8868.424
End To End Training Time: 5.412


In [31]:
!torchrun --nnodes 1 --nproc_per_node 2 ../input/temp65/distributed_training.py --layers 12 --attention_heads 8 --schedule Interleaved1F1B --processes 2 --num_chunks 6

W1021 02:17:07.136000 1029 torch/distributed/run.py:792] 
W1021 02:17:07.136000 1029 torch/distributed/run.py:792] *****************************************
W1021 02:17:07.136000 1029 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1021 02:17:07.136000 1029 torch/distributed/run.py:792] *****************************************
Rank: 1, Starting Epoch: 1
Rank: 0, Starting Epoch: 1
Rank: 1, Avg Loss: 9.3795
Rank: 1, Starting Epoch: 2
Rank: 0, Starting Epoch: 2
Rank: 0, Starting Epoch: 3
Rank: 1, Avg Loss: 9.3237
Rank: 1, Starting Epoch: 3
Rank: 1, Avg Loss: 9.0831
Total Training Throughput: 8705.337
End To End Training Time: 5.514


In [32]:
!torchrun --nnodes 1 --nproc_per_node 2 ../input/temp65/distributed_training.py --layers 12 --attention_heads 12 --schedule Interleaved1F1B --processes 2 --num_chunks 6

W1021 02:17:23.794000 1067 torch/distributed/run.py:792] 
W1021 02:17:23.794000 1067 torch/distributed/run.py:792] *****************************************
W1021 02:17:23.794000 1067 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1021 02:17:23.794000 1067 torch/distributed/run.py:792] *****************************************
Rank: 0, Starting Epoch: 1
Rank: 1, Starting Epoch: 1
Rank: 1, Avg Loss: 9.3761
Rank: 1, Starting Epoch: 2
Rank: 0, Starting Epoch: 2
Rank: 0, Starting Epoch: 3
Rank: 1, Avg Loss: 9.3210
Rank: 1, Starting Epoch: 3
Rank: 1, Avg Loss: 9.0901
Total Training Throughput: 8293.773
End To End Training Time: 5.787


In [34]:
# Experimental Evaluation Part 1
# Experimental Evaluation Part 2

import pandas as pd

column_names = ['n_layers', 'n_attention_heads', 'n_processes', 'schedule', 'training_throughput', 'total_training_time', 'training_throughput_speedup', 'scaling_efficiency']
df = pd.DataFrame(columns=column_names)
df.loc[len(df)] = [4, 4, 2, "GPipe", 15281.787, 3.141, 1.0, (1.0/2) * 100]
df.loc[len(df)] = [4, 8, 2, "GPipe", 17518.957, 2.740, 1.0, (1.0/2) * 100]
df.loc[len(df)] = [4, 12, 2, "GPipe", 17651.923, 2.719, 1.0, (1.0/2) * 100]
df.loc[len(df)] = [8, 4, 2, "GPipe", 12171.319, 3.944, 1.0, (1.0/2) * 100]
df.loc[len(df)] = [8, 8, 2, "GPipe", 11738.217, 4.089, 1.0, (1.0/2) * 100]
df.loc[len(df)] = [8, 12, 2, "GPipe", 11286.389, 4.253, 1.0, (1.0/2) * 100]
df.loc[len(df)] = [12, 4, 2, "GPipe", 9089.485, 5.281, 1.0, (1.0/2) * 100]
df.loc[len(df)] = [12, 8, 2, "GPipe", 8906.597, 5.389, 1.0, (1.0/2) * 100]
df.loc[len(df)] = [12, 12, 2, "GPipe", 8216.436, 5.842, 1.0, (1.0/2) * 100]

df.loc[len(df)] = [4, 4, 2, "1F1B", 17685.948, 2.714, 17685.948 / 15281.787, ((17685.948 / 15281.787) / 2) * 100]
df.loc[len(df)] = [4, 8, 2, "1F1B", 17387.224, 2.761, 17387.224 / 17518.957, ((17387.224 / 17518.957) / 2) * 100]
df.loc[len(df)] = [4, 12, 2, "1F1B", 16730.582, 2.869, 16730.582 / 17651.923, ((16730.582 / 17651.923) / 2) * 100]
df.loc[len(df)] = [8, 4, 2, "1F1B", 11778.643, 4.075, 11778.643 / 12171.319, ((11778.643 / 12171.319) / 2) * 100]
df.loc[len(df)] = [8, 8, 2, "1F1B", 11584.665, 4.143, 11584.665 / 11738.217, ((11584.665 / 11738.217) / 2) * 100]
df.loc[len(df)] = [8, 12, 2, "1F1B", 10936.940, 4.389, 10936.940 / 11286.389, ((10936.940 / 11286.389) / 2) * 100]
df.loc[len(df)] = [12, 4, 2, "1F1B", 8666.590, 5.539, 8666.590 / 9089.485, ((8666.590 / 9089.485) / 2) * 100]
df.loc[len(df)] = [12, 8, 2, "1F1B", 8664.951, 5.540, 8664.951 / 8906.597, ((8664.951 / 8906.597) / 2) * 100]
df.loc[len(df)] = [12, 12, 2, "1F1B", 8056.898, 5.958, 8056.898 / 8216.436, ((8056.898 / 8216.436) / 2) * 100]

df.loc[len(df)] = [4, 4, 2, "Interleaved1F1B", 17179.105, 2.794, 17179.105 / 15281.787, ((17179.105 / 15281.787) / 2) * 100]
df.loc[len(df)] = [4, 8, 2, "Interleaved1F1B", 17085.271, 2.809, 17085.271 / 17518.957, ((17085.271 / 17518.957) / 2) * 100]
df.loc[len(df)] = [4, 12, 2, "Interleaved1F1B", 16215.554, 2.960, 16215.554 / 17651.923, ((16215.554 / 17651.923) / 2) * 100]
df.loc[len(df)] = [8, 4, 2, "Interleaved1F1B", 11887.872, 4.038, 11887.872 / 12171.319, ((11887.872 / 12171.319) / 2) * 100]
df.loc[len(df)] = [8, 8, 2, "Interleaved1F1B", 11762.432, 4.081, 11762.432 / 11738.217, ((11762.432 / 11738.217) / 2) * 100]
df.loc[len(df)] = [8, 12, 2, "Interleaved1F1B", 11063.757, 4.338, 11063.757 / 11286.389, ((11063.757 / 11286.389) / 2) * 100]
df.loc[len(df)] = [12, 4, 2, "Interleaved1F1B", 8868.424, 5.412, 8868.424 / 9089.485, ((8868.424 / 9089.485) / 2) * 100]
df.loc[len(df)] = [12, 8, 2, "Interleaved1F1B", 8705.337, 5.514, 8705.337 / 8906.597, ((8705.337 / 8906.597) / 2) * 100]
df.loc[len(df)] = [12, 12, 2, "Interleaved1F1B", 8293.773, 5.787, 8293.773 / 8216.436, ((8293.773 / 8216.436) / 2) * 100]
df

Unnamed: 0,n_layers,n_attention_heads,n_processes,schedule,training_throughput,total_training_time,training_throughput_speedup,scaling_efficiency
0,4,4,2,GPipe,15281.787,3.141,1.0,50.0
1,4,8,2,GPipe,17518.957,2.74,1.0,50.0
2,4,12,2,GPipe,17651.923,2.719,1.0,50.0
3,8,4,2,GPipe,12171.319,3.944,1.0,50.0
4,8,8,2,GPipe,11738.217,4.089,1.0,50.0
5,8,12,2,GPipe,11286.389,4.253,1.0,50.0
6,12,4,2,GPipe,9089.485,5.281,1.0,50.0
7,12,8,2,GPipe,8906.597,5.389,1.0,50.0
8,12,12,2,GPipe,8216.436,5.842,1.0,50.0
9,4,4,2,1F1B,17685.948,2.714,1.157322,57.866099


In [35]:
# Experimental Evaluation Part 2

# Speedup Plot
selected_columns = df[['n_layers', 'n_attention_heads', 'n_processes', 'schedule', 'training_throughput_speedup']]
print(selected_columns)
print()

# Scaling Efficiency Plot
selected_columns = df[['n_layers', 'n_attention_heads', 'n_processes', 'schedule', 'scaling_efficiency']]
print(selected_columns)
print()



    n_layers  n_attention_heads  n_processes         schedule  \
0          4                  4            2            GPipe   
1          4                  8            2            GPipe   
2          4                 12            2            GPipe   
3          8                  4            2            GPipe   
4          8                  8            2            GPipe   
5          8                 12            2            GPipe   
6         12                  4            2            GPipe   
7         12                  8            2            GPipe   
8         12                 12            2            GPipe   
9          4                  4            2             1F1B   
10         4                  8            2             1F1B   
11         4                 12            2             1F1B   
12         8                  4            2             1F1B   
13         8                  8            2             1F1B   
14         8             

In [37]:
# Experimental Evaluation Part 3

"""
As seen in the results produced above, GPipe performed better than the 1F1B and Interleaved 1F1B schedules for the most part. The only configuration where 1F1B actually performed better than GPipe was for the 4-layer, 4-attention-head, 2-process configuration. The only configurations where Interleaved 1F1B actually performed better than GPipe were for the 4-layer, 4-attention-head, 2-process configuration, the 8-layer, 8-attention-head, 2-process configuration, and the 12-layer, 12-attention-head, 2-process configuration. The absence of a visible or quantifiable improvement for the 1F1B and Interleaved 1F1B schedules in comparison to the GPipe schedule was probably because of how small the provided data was. Since the input dataset(1 batch) was pretty small, there was probably not as much need for parallelism so the extra communication overhead from the 1F1B and Interleaved 1F1B schedules may have proven to be more damaging to the overall performance. The performance relative to GPipe scheduling kind of displayed more consistently higher values whenever model size was increased to 12 layers for the 1F1B and Interleaved1F1B schedules. For 8 layers and 4 layers, the performance relative to GPipe scheduling kind of oscillated more between relatively higher and lower throughputs for both the 1F1B and Interleaved1F1B schedules. In addition, across all model sizes, when the total number of layers was kept constant, adding more attention heads tended to typically improve the performance of the Interleaved1F1B and 1F1B schedules relative to GPipe scheduling for the larger model sizes(12 layers). No noticeable improvement or decrease in performance really occurred for the smaller model sizes(8 layers and 4 layers) as more attention heads were added.
"""

'\nAs seen in the results produced above, GPipe performed better than the 1F1B and Interleaved 1F1B schedules for the most part. The only configuration where 1F1B actually performed better than GPipe was for the 4-layer, 4-attention-head, 2-process configuration. The only configurations where Interleaved 1F1B actually performed better than GPipe were for the 4-layer, 4-attention-head, 2-process configuration, the 8-layer, 8-attention-head, 2-process configuration, and the 12-layer, 12-attention-head, 2-process configuration. The absence of a visible or quantifiable improvement for the 1F1B and Interleaved 1F1B schedules in comparison to the GPipe schedule was probably because of how small the provided data was. Since the input dataset(1 batch) was pretty small, there was probably not as much need for parallelism so the extra communication overhead from the 1F1B and Interleaved 1F1B schedules may have proven to be more damaging to the overall performance. The performance relative to GPi