# Pipeline Parallelism
---

The goal of this notebook is to expose you to the ideas behind pipeline parallelism, including Synchronous and Asynchronous, and its implementation.

When large deep neural network models do not fit on the local memory of one device, model parallelism is employed. Parts of the model are placed on different devices, and the parameter of each layer is balanced to the corresponding device. The concept of `pipeline parallelism (PP)` was introduced by authors of [PipeDream](https://people.eecs.berkeley.edu/~matei/papers/2019/sosp_pipedream.pdf) and [GPipe](https://fid3024.github.io/papers/2019%20-%20GPipe:%20Efficient%20Training%20of%20Giant%20Neural%20Networks%20using%20Pipeline%20Parallelism.pdf) as an effective technique of scaling and remedy the cons of model parallelism. According to [DeepSpeed.ai](https://www.deepspeed.ai/tutorials/pipeline/), `Pipeline parallelism` improves both the memory and compute efficiency of deep learning training by partitioning the layers of a model into stages that can be processed in parallel. It allows the execution of a model to be partitioned such that multiple micro-batches can execute different parts of the model code concurrently. Pipeline parallelism can be an effective technique for large-scale training, bandwidth-limited clusters, and large-model inference. Let's dive deep into the pipeline parallelism approaches that include `Synchronous Pipeline` and `Asynchronous Pipeline.`


### Synchronous Pipeline

[GPipe](https://arxiv.org/abs/1811.06965) introduced a synchronous pipeline that allows scaling arbitrary deep neural network architectures beyond the memory limitations of a single device by partitioning the model across different devices and supporting re-materialization(optimization which saves time by recomputing a value instead of loading it from memory) on every device. Each model is specified as a sequence of layers, and consecutive groups of layers can be partitioned into cells. Each cell is placed on a separate device. Based on this, a pipeline parallelism algorithm with batch splitting is applied. It starts by splitting a mini-batch of training examples into smaller micro-batches, then pipelines the execution of each set of micro-batches over cells. The authors applied synchronous mini-batch gradient descent, i.e., gradients are accumulated across all micro-batches in a mini-batch and used at the end of a mini-batch.

<center><img src="images/Gpipe-arc.png" width="550px" height="550px" alt-text="workflow"/></center>
<center><p><b>(a)</b> An example neural network with sequential layers is partitioned across four accelerators.
Fk is the composite forward propagation of the k-th cell. Bk is the back-propagation that depends on Bk+1 from the upper layer and Fk. <b>(b)</b> The naive model parallelism strategy leads to severe under-utilization due to the sequential dependency of the network. <b>(c)</b> Pipeline parallelism divides the input mini-batch into smaller micro-batches, enabling different accelerators to work on different micro-batches simultaneously. Gradients are applied synchronously at the end. <i> source: <a href="https://arxiv.org/abs/1811.06965"> arXiv:1811.06965 [cs.CV]</a> </i> </p>  
</center></br>

You can also refer to the input pipeline section in the [previous notebook](model-parallelism.ipynb) for a similar illustration.

Three key Attributes of the Gpipe method: 
- Efficiency: GPipe achieves almost linear speedup with the number of devices using a novel batch-splitting pipelining algorithm.
- Flexibility: It supports any deep network represented as a sequence of layers. 
- Reliability: It utilizes synchronous gradient descent and guarantees consistent training regardless of the number of partitions. Due to its reliability, GPipe is widely used in the industry and supports GPU and TPU clusters.


### Asynchronous Pipeline

[PipeDream](https://people.eecs.berkeley.edu/~matei/papers/2019/sosp_pipedream.pdf) proposed a `pipeline parallelism (PP)` that combines intra-batch parallelism (*a single iteration of training that is split across available workers*) with inter-batch parallelism (*performing forward and backward pass on micro-batches*). The approach divides the model among available workers, assigning a group of consecutive layers in the model, and then overlaps the computation and communication of different inputs in a pipelined fashion. It simultaneously injects multiple minibatches into the pipeline and performs the forward pass stage. Next, each stage asynchronously sends the output of the forward pass to the next stage while starting to process another minibatch. In the last stage, the backward pass begins immediately after the forward pass ends. On completing its backward pass, each stage asynchronously sends the gradient to the previous stage while starting computation for the next minibatch. The screenshot below illustrates the process.

<center><img src="images/pipedream.png" width="550px" height="550px" alt-text="workflow"/></center>
<center>An example is the PipeDream pipeline with four workers, showing startup and steady states. The backward pass takes twice as long as the forward pass. <br/><i> <a href="https://people.eecs.berkeley.edu/~matei/papers/2019/sosp_pipedream.pdf"> View source </a></i></center>

The form of Pipeline parallelism outperforms intra-batch parallelism methods because the pipelining communicates less and overlaps computation and communication. The communication is peer-to-peer, as opposed to all-to-all.

**Work scheduling and Effective Learning issues**

 PipeDream involves a bi-directional pipeline(forward and backward pass); each worker in the system needs to determine whether to perform its stage’s:
-   forward pass for a minibatch, pushing the minibatch to downstream workers, or 
-   backward pass for a different minibatch, pushing the minibatch to upstream workers 
To resolve the work schedule issue, a method was introduced to estimate the optimal number of mini-batches admitted per input stage (NOAM). Furthermore, it applies a `one-forward-one-backward (1F1B)` schedule mechanism and a `one-forward-one-backward-round-robin (1F1B-RR)` static policy. Please find more [details here](https://people.eecs.berkeley.edu/~matei/papers/2019/sosp_pipedream.pdf). On the effective learning issue, each stage’s forward pass for a minibatch is performed using one version of the parameters, and its backward pass is performed using a different version. To avoid the mismatch between the weight versions, PipeDream introduced a `weight-stashing` technique. The technique maintains multiple versions of the weights, one for each active minibatch. At the end of the forward pass, weights used for that minibatch are stored and used to compute the weight update and upstream weight gradient in the minibatch’s backward pass.

<center><img src="images/pipedream-stashing.png" width="550px" height="550px" alt-text="workflow"/></center>
<center> Weight stashing as minibatch 5 flows across stages.<br/> Arrows point to weight versions used for forward and backward passes for minibatch 5 at the first and third stages <br/><i> <a href="https://people.eecs.berkeley.edu/~matei/papers/2019/sosp_pipedream.pdf"> View source </a></i></center>

### Implementation Approach

Pipeline parallelism is an effective technique for scaling, but it is often challenging to implement because it needs to partition the execution of a model in addition to model weights. It requires model code modification and difficult-to-implement runtime orchestration code. [PiPPy](https://github.com/pytorch/PiPPy) aims to provide a toolkit that automates the process to allow high-productivity scaling of models. PiPPy consists of two parts: a `compiler` and a `runtime.` The `compiler` takes your model code, splits it up, and transforms it into a `Pipe,` a wrapper describing the model at each pipeline stage and their data-flow relationship. The `runtime` executes the PipelineStages in parallel, handling things like micro-batch splitting, scheduling, communication, gradient propagation, etc. Pippy supports pipeline scheduling paradigms, including schedules like `fill-drain (GPipe),` `1F1B,` and `interleaved 1F1B`. PiPPy has been migrated into PyTorch as a subpackage: `torch.distributed.pipelining.` Please find detailed documentation [here](https://pytorch.org/docs/main/distributed.pipelining.html). 

#### torch.distributed.pipelining

The pipelining package provides a toolkit that automates pipeline parallelism implementation on general models. It consists of two parts: a `splitting frontend` and a `distributed runtime.` The splitting frontend splits a model into partitions and captures the data-flow relationship. The distributed runtime executes the pipeline stages on different devices in parallel. It handles processes like `micro-batch splitting,` `scheduling,` `communication,` `gradient propagation,` etc.
There are two major steps involved in implementing pipeline parallelism with `torch.distributed.pipelining:` `PipelineStage` and `PipelineSchedule`

**1. Build PipelineStage**:

The `PipelineStage` allocates communication buffers and creates `send/recv` ops to communicate with its peers. It manages intermediate buffers (forward and backward ops for the stage model). It involves model splitting, which can be done either manually or automatically.

- **Manual Splitting**: To directly construct a PipelineStage, the user must provide a single `nn.Module` instance that owns the relevant `nn.Parameters` and `nn.Buffers.` It also should define a forward() function that executes the operations for that stage. A Transformer class example showing how to build a partitionable model easily is presented below. To quickly configure an entire model per stage, it has to be initialized first, removing unwanted layers for each stage, and creating a `PipelineStage` that wraps the model. The code snippets below exemplify this process.
```python
#source: https://pytorch.org/docs/stable/distributed.pipelining.html

class Transformer(nn.Module):
    def __init__(self, model_args: ModelArgs):
        super().__init__()

        self.tok_embeddings = nn.Embedding(...)

        # Using a ModuleDict lets us delete layers without affecting names,
        # ensuring checkpoints will correctly save and load.
        self.layers = torch.nn.ModuleDict()
        for layer_id in range(model_args.n_layers):
            self.layers[str(layer_id)] = TransformerBlock(...)

        self.output = nn.Linear(...)

    def forward(self, tokens: torch.Tensor):
        # Handling layers being 'None' at runtime enables easy pipeline splitting
        h = self.tok_embeddings(tokens) if self.tok_embeddings else tokens

        for layer in self.layers.values():
            h = layer(h, self.freqs_cis)

        h = self.norm(h) if self.norm else h
        output = self.output(h).float() if self.output else h
        return output

```


The PipelineStage takes an argument `input_args`, which includes the runtime input to the stage that represents one micro-batch of input data. This goes into the forward function of the stage module to determine the input and output shapes required for communication.

```python

with torch.device("meta"):
    assert num_stages == 2, "This is a simple 2-stage example"
    
    # we construct the entire model, then delete the parts we do not need for this stage
    # in practice, this can be done using a helper function that automatically divides up layers across stages.
    model = Transformer()

    if stage_index == 0:
        # prepare the first stage model
        del model.layers["1"]
        model.norm = None
        model.output = None

    elif stage_index == 1:
        # prepare the second stage model
        model.tok_embeddings = None
        del model.layers["0"]

    from torch.distributed.pipelining import PipelineStage
    stage = PipelineStage(
        model,
        stage_index,
        num_stages,
        device,
        input_args=example_input_microbatch,
    )

```


- **Automatic Splitting**

This approach partitions a model that might be difficult to split manually using the pipeline API, as shown in the example below. 

```python

class Model(torch.nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.emb = torch.nn.Embedding(10, 3)
        self.layers = torch.nn.ModuleList(
            Layer() for _ in range(2)
        )
        self.lm = LMHead()

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.emb(x)
        for layer in self.layers:
            x = layer(x)
        x = self.lm(x)
        return x
```
Applying the pipeline, the pipeline API splits the model, giving it a `split_spec,` where 	`SplitPoint.BEGINNING` stands for adding a split point before the execution of a specific submodule in the forward function, and similarly, `SplitPoint.END` for a split point after such.

```python

from torch.distributed.pipelining import pipeline, SplitPoint

# An example micro-batch input
x = torch.LongTensor([1, 2, 4, 5])

pipe = pipeline(
    module=mod,
    mb_args=(x,),
    split_spec={
        "layers.1": SplitPoint.BEGINNING,
    }
)
```



**2. PipelineSchedule for execution**

This is the phase to attach `PipelineStage` to a pipeline schedule `(PipelineSchedule)` and run the schedule with input data. A GPipe example is given below. The example code must be to be launched for each worker using `torchrun --nproc_per_node=2 example.py.`

```python
from torch.distributed.pipelining import ScheduleGPipe

# Create a schedule
schedule = ScheduleGPipe(stage, n_microbatches)

# Input data (whole batch)
x = torch.randn(batch_size, in_dim, device=device)

# Run the pipeline with input `x`
# `x` will be divided into microbatches automatically
if rank == 0:
    schedule.step(x)
else:
    output = schedule.step()

```

####  Transformer Example of PP Using Model Manual Splitting 

We will be using a simplified version of a transformer decoder model. The model architecture has multiple transformer decoder layers, and we want to demonstrate the model manual splitting approach. You can find the detailed [code here](../source_code/pipeline_test.py). For further explanation, please visit the [source here](https://pytorch.org/tutorials/intermediate/pipelining_tutorial.html).

Let's run our code on two GPUs by executing the command in the cell below.

In [None]:
!cd ../source_code && srun -p gpu -N 1 --gres=gpu:2 torchrun --nnodes=1 --nproc-per-node=2 pipeline_test.py

**Likely output:**

```python

W0217 09:17:02.519000 1751757 site-packages/torch/distributed/run.py:792]
W0217 09:17:02.519000 1751757 site-packages/torch/distributed/run.py:792] *****************************************
W0217 09:17:02.519000 1751757 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0217 09:17:02.519000 1751757 site-packages/torch/distributed/run.py:792] *****************************************

losses: [tensor(9.3694, device='cuda:1', grad_fn=<NllLossBackward0>), tensor(9.3886, device='cuda:1', grad_fn=<NllLossBackward0>), tensor(9.3766, device='cuda:1', grad_fn=<NllLossBackward0>), tensor(9.3725, device='cuda:1', grad_fn=<NllLossBackward0>)]

```

#### GPT Model Example of PP using Automatic Model Splitting

This [example code]() uses a GPT model to exemplify the automatic model-splitting approach. In the code, we initialize the GPT2 model configs and create an instance of the model. Furthermore, we generate a micro-batch input example, set the pipeline split spec, and create a pipeline representation.

```python
...
 # Create pipeline representation
    pipe = pipeline(
        gpt2,
        mb_args=(),
        mb_kwargs=mb_inputs,
        split_spec=split_spec,
    )
...
```
In the later part of the code, we created a schedule runtime and attached a stage to the schedule

```python 
...
# Attach to a schedule
 schedule = ScheduleGPipe(stage, args.chunks)

``` 
For more details, please visit the [source here](https://github.com/pytorch/PiPPy/blob/main/examples/huggingface/pippy_gpt2.py).  

Let's run our [pippy_gpt2.py](../source_code/hf_pp/pippy_gpt2.py) code with four GPUs by executing the command in the cell below.

In [None]:
!cd ../source_code && srun -p gpu -N 1 --gres=gpu:4 torchrun --nproc-per-node 4 hf_pp/pippy_gpt2.py

**Likely Output:**

```python

W0217 08:47:36.732000 1724792 site-packages/torch/distributed/run.py:792]
...
decoders_per_rank = 3
[Rank 2] Using device: cuda:2
[Rank 0] Using device: cuda:0
decoders_per_rank = 3
Pipeline stage 3 21M params
Pipeline stage 1 21M params
GPT2Config {
  "_attn_implementation_autoset": true,
  "activation_function": "gelu_new",
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "transformers_version": "4.48.3",
  "use_cache": true,
  "vocab_size": 50257
}

GPT-2 total number of params = 124M
GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=2, bias=False)
)
decoders_per_rank = 3
Pipeline stage 2 21M params
Pipeline stage 0 60M params
Rank 3 completes
Rank 2 completes
Rank 0 completes
Rank 1 completes

```


Let's proceed to the next notebook and learn another concept that improves on model parallelism: `Tensor parallelism.` Please click the [Next Link](tensor-parallelism.ipynb).

---
## References

- https://medium.com/nerd-for-tech/an-overview-of-pipeline-parallelism-and-its-research-progress-7934e5e6d5b8

## Licensing

Copyright © 2025 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.