# Model Parallelism
---
The content of this notebook elaborates on the concept of model parallelism, its merits & demerits, and proceeds further to demonstrate its implementation.

The model parallel is widely used in distributed training techniques. In the previous notebook, we explained how to use DataParallelism to train a neural network on multiple GPUs; this feature replicates the same model to all GPUs, where each GPU consumes a different partition of the input data. Although it can significantly accelerate the training process, it does not work for some use cases where the model is too large to fit into a single GPU. This shows how to solve that problem by using model parallelism, which, in contrast to DataParallel, splits a single model onto different GPUs rather than replicating the entire model on each GPU (to be concrete, say a model m contains 10 layers: when using DataParallel, each GPU will have a replica of each of these 10 layers, whereas when using model parallel on two GPUs, each GPU could host five layers). The high-level idea of model parallel is to place different parts of a model onto different devices and implement the forward pass method accordingly to move intermediate outputs across devices. As only part of a model operates on any individual device, a set of devices can collectively serve a larger model. This notebook focuses on showing the idea of model parallel.

First, in this notebook, we will focus on system efficiency in model parallelism by discussing vanilla (implementations that lack customization) model parallelism. The vanilla model parallelism is very inefficient regarding GPU computation. To illustrate this, Let's look at a simple DNN model with three layers (layer 1, layer 2, layer 3).  Let's assume we use three GPUs. As shown below, each GPU only holds one layer of the original model.

<center><img src="images/model-partition.png" width="250px" height="250px" alt-text="workflow"/></center>
<center>Model partition on three GPUs</center></br>

The screenshot above shows that GPU1 is holding Layer 1 of the model. Similarly, we have GPU2 holding Layer 2 and GPU3 holding Layer 3. Now, we will discuss forward propagation and backward propagation step by step.

##### Forward Propagation

Given a batch of input training data, we will first conduct forward propagation on the model. Since we partition the model onto GPUs 1-3, the forward propagation happens in the following order:
- GPU1 will first conduct forward propagation of Layer 1 on input data.
- Then, GPU2 will start forward propagation.
- Finally, GPU3 will start forward propagation.
For simplicity, let's depict the forward propagation process for GPU 1, GPU 2, and GPU 3 as `F1, F2, and F3,` respectively.

##### Backward Propagation

After Layer 3's forward propagation, then it starts the backward propagation as follows:
- First, Layer 3 on GPU3 starts backward propagation to generate its local gradients and pass the gradient output to Layer 2 on GPU2.
-  GPU2 will use it (together with the activations generated from the previous forward propagation) to generate Layer 2's local gradients and pass its gradient output to GPU1.
-  Finally, Layer 1 on GPU1 conducts backward propagation and generates local gradients.
We can represent the backward propagation process for GPU 1, GPU 2, and GPU 3 as `B1, B2, and B3,` respectively. The gradients generated are used to update the model parameters.

##### GPU Idle Time Analysis
During the forward propagation, the execution order is `F1->F2->F3`; during the backward propagation, the execution order is B3->B2->B1. We can calculate each GPU's working and idle times as follows:
- GPU1 works in the F1 and B1 time slots but idle in the F2, F3, B3, B2 time slots
- GPU2 works in the F2 and B2 time slots but idle in the F1, F3, B3, B1 time slots
- GPU3 works in the F3 and B3 time slots but idle in the F1, F2, B2, B1 time slots

  <center><img src="images/gpu-idle-time.png" width="350px" height="230px" alt-text="workflow"/></center>
<center>GPU idle time during model parallelism </center></br>

The main reason for this inefficiency is that GPUs with different model partitions need to wait for each other. After the execution of F1, GPU 1 needs to wait for B2 to complete. GPU 2's F2 needs to wait for GPU1's F1 to complete, and GPU2's B2 needs to wait
for GPU 3's B3 to complete. GPU3's F3 needs to wait for GPU2's F2 to complete. Sequential layer dependency is the main reason for system inefficiency. A widely adopted approach to improve system efficiency in model parallelism training is `pipeline parallelism`
Pipeline parallelism introduces more frequent GPU communications. For example, GPU1 needed to send its outputs to GPU2 three times (once for each input) during forward propagation. More frequent small data transmission introduces a high networking communication overhead. This is mainly because small data chunks may not fully saturate the link bandwidth



### Pipeline Input
Pipeline parallelism breaks each batch of training input into smaller micro-batches and conducts data pipelining over these micro-batches. Suppose each training batch contains three input items: input 1, input 2, and input 3. We use this batch to feed in the model. Let F11, F12, and F13 denote GPU1's forward propagation on inputs 1, 2, and 3. We also represent F21, F22, and F23 as GPU 2's forward propagation on inputs 1, 2, and 3 and applied the same representation for GPU3 as F31, F32, and F33. The data pipelining in the forward propagation for input 1 is as follows:
- GPU1 first calculates F11 based on input 1. After this, GPU1 passes its layer output of F11 to GPU2.
- GPU2 starts working on F21, while GPU1 is working on F12.
- After GPU2 is done processing F21, GPU2 can pass its layer output of F21 to GPU3, while GPU 1 passes it output of F12 to GPU2.
- After GPU3 receives GPU2's F21 output, GPU3 can start working on F31, which can happen simultaneously with GPU2's processing of F22.

In the backward propagation, GPU3 first calculates the gradients based on inputs 1, 2, and 3 sequentially. Then, GPU2 starts calculating its local gradients based on inputs 1, 2, and 3. Finally, GPU1 starts calculating the gradients based on inputs 1, 2, and 3. This process reduces the GPU idle time.


**Advantages of Pipeline Parallelism**
- Reduces overall training time
- Reduces GPU idle time while waiting for the predecessor or successor's GPU output

**Disadvantages of Pipeline Parallelism**
- The CPU needs to send more instructions to GPUs. For example, if we break one input batch into N micro-batches for pipeline parallelism, the CPU needs to send N-1 times more instructions to each GPU.
- Although pipeline parallelism reduces GPU idle time, GPU idle time remains. In our illustration, GPU3 still needs to wait for F11 and F21 to finish.
Another methodology to improve system efficiency in model parallelism training is called `intra-layer parallelism.`

#### Layer Split

Another kind of approach to improve model parallelism training efficiency is called `intra-layer model parallelism.` The data structure for holding each layer's neurons can be represented as matrices. One common function during NLP model training is `matrix multiplication.` Therefore, we can split a layer's matrix in some way to enable in-parallel execution. Let's illustrate using layer 1 of a model that takes training data input and performs forward propagation to generate some output to the next layer.

 <center><img src="images/inter-layer-matrix.png" width="400px" height="400px" alt-text="workflow"/></center>


In the screenshot above, each column represents a neuron. Each weight within a column is a neuron weight. Let's assume we have input with a batch size of 4, i.e., an input data matrix of 4 x 4. Each row is a single input data item. For NLP, you can regard each row as an embedded sentence. The forward propagation can be regarded as a matrix multiplication between Layer 1's weight and input batch, which is defined as: `𝑦 = 𝑋 ∗ 𝐴`

Here, y means Layer 1's output to the next layer. What intra-layer split does is:

-  We split the Layer 1 matrix along its columns. For example, we split Layer 1's columns into two halves. Then, A can be written as A_01, and A_23, as shown in the screenshot below. Each split layer maintains only two neurons of the original Layer 1.
- By splitting Layer 1 column-wise into two halves, we pass input X and calculate y as: `𝑦_01, 𝑦_23 = [𝑋 ∗ 𝐴_01, 𝑋 ∗ 𝐴_23]`
- Then, Layer 1 passes `[y_01, y_23]` as the output to Layer 2.

<center><img src="images/matrix-split.png" width="250px" height="250px" alt-text="workflow"/></center>

Intra-layer model parallelism is a good way to split giant NLP models. This is because it allows model partitioning within a layer without introducing significant communication overhead during forward and backward propagation. Basically, for one split, it may only introduce one All-Reduce function in either forward or backward propagation, which is acceptable. Generally, intra-layer model parallelism applies primarily to NLP models. In other words, for convolutional neural network (CNN) or reinforcement learning (RL) models, there may be cases where intra-layer parallelism does not work.

### Model Parallelism Implementation


#### Model parallel training overview

For ease of understanding, we will use a simple DNN model. The code snippet is given below: 

```python

class MyNet(nn.Module):
    def __init__(self):
        super(MyNet, self).__init__()
        self.seq1 = nn.Sequential(
		        nn.Conv2d(1,32,3,1),
		        nn.Dropout2d(0.5),
		        nn.Conv2d(32,64,3,1),
		        nn.Dropout2d(0.75)).to('cuda:0')
        self.seq2 = nn.Sequential(
		        nn.Linear(9216, 128),
		        nn.Linear(128,20),
		        nn.Linear(20,10)).to('cuda:2')

    def forward(self, x):
        x = self.seq1(x.to('cuda:0'))
        x = F.max_pool2d(x,2).to('cuda:1')
        x = torch.flatten(x,1).to('cuda:1')
        x = self.seq2(x.to('cuda:2'))
        output = F.log_softmax(x, dim = 1)
        return output

```
In the code snippet, we define `seq1()` as the layers put on `GPU0` and `seq2()` as the linear layers put on `GPU2`. In the `forward()` function, we define `F.max_pool2d()` and `torch.flatten()` as the layers and put them on `GPU1`. The input data is also passed into GPU0 `x.to('cuda:0').` PyTorch can automatically generate the corresponding backward propagation layer order. Thus, we do not need to specify the inverse order of layer dependency during the backward propagation. 

<center><img src="images/model-parallelism-training-DNN.png" width="250px" height="250px" alt-text="workflow"/></center>
<center> Model parallelism training for DNN model</center>

Let's take a look at communication protocol among GPUs. It is important to specify which GPU takes input data and which GPU generates the final prediction results. 
```python
def train(args):
    ...

    criterion = nn.CrossEntropyLoss()			
    optimizer = torch.optim.SGD(model.parameters(), lr = 1e-3)
    ...
    for epoch in range(args.epochs):
        print(f"Epoch {epoch}")
        for idx, (data, target) in enumerate(trainloader):
            data = data.to('cuda:0')
            optimizer.zero_grad()
            output = model(data)
            target = target.to(output.device)
            loss = F.cross_entropy(output, target)
            loss.backward()
            optimizer.step()
            ...
    return model

```
In the training function, define the loss function as `nn.CrossEntropyLoss()` ans our training optimizer as `optimizer = torch.optim.SGD(model.parameters(), lr = 1e-3).` Within our training loop, we move the data to GPU0 as `data = data.to('cuda:0')` and define the GPU that generates a prediction as `target = target.to(output.device).` Here, `output.device` automatically tracks that the GPU2 is the output device and passes the target to GPU2.


Let's run our [model parallelism code](../source_code/mp/main.py) on three GPUs by executing the command in the cell below

In [None]:
!cd ../source_code && srun -p gpu -N 1 --gres=gpu:3 torchrun --nnodes=1 --nproc-per-node=3 mp/main.py

 
**Like output on DGX A100:**
```text

W0213 12:26:44.999000 828535 site-packages/torch/distributed/run.py:792]
W0213 12:26:44.999000 828535 site-packages/torch/distributed/run.py:792] *****************************************
W0213 12:26:44.999000 828535 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0213 12:26:44.999000 828535 site-packages/torch/distributed/run.py:792] *****************************************
Epoch 0
Epoch 0
batch 0 training :: loss 2.3271937370300293
batch 1 training :: loss 2.3261027336120605
...
batch 468 training :: loss 0.3286305367946625
Training Done!
batch 0 training :: loss 0.45364904403686523
batch 1 training :: loss 0.5076391100883484
batch 2 training :: loss 0.4924725592136383
...
Test Accuracy 0.8819166666666667
Test Accuracy 0.8837333333333334
Test Accuracy 0.8858333333333334
Test Accuracy 0.8872
Test Done!
```

Having gone through this notebook, you can see the strengths, weaknesses, and approaches to improve model parallelism. Let's proceed to the next notebook and learn the concept that improves on model parallelism called `Pipeline parallelism.` Please click the [Next Link](pipeline-parallelism.ipynb).

---
## Licensing

Copyright © 2025 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.