# Pipeline Parallelism

In this section we'll go over the basic ideas of model parallelization by placing different layers of the model on different devices.  We'll also briefly cover an example using the under-development and subject-to-change [`torch.distributed.pipelining`](https://docs.pytorch.org/docs/stable/distributed.pipelining.html) API.

## Model Parallelism by Layers

The basic idea of layer-based parallelism is to tackle models larger than what a single GPU can fit by splitting the model into units of layers, and propagating the forward and backward passes through those units:

![Diagram of mmodel parallelism by layer, where a model is broken into three units distributed over three GPUs. A forward pass sweeps through in one direction, followed by a backward pass in the other](images/pipeline.png)

This approach has much to commend it.  In particular, the communication-to-computation ratio can be quite favourable.  The only data that needs to be propagated are the activations (forward pass) or gradients (backward pass) at the boundary of the unit, while the volume of the unit is what needs to be computed on.   This makes it very very well suited for parallelism across nodes; the bandwidth and latency requirements for communication can be quite modest.

![Diagram of layer-based parallelism with and without microbatches.  Without breaking batches up the GPUs are idle much of the time.  By breaking SGD-style minibatches further up into microbatches, the GPU utilization is much improved](images/pipeline-microbatches.png)

However, using our normal schedule of forward/backward passes through the GPUs would leave the GPUs idle most of the time.  For instance, if we were using 3 GPUs this way, each GPU would only be working 1/3 of the time, spending the other 2/3 waiting for other units on other GPUs to complete.   This gets worse with more GPUs!

A very common approach in parallel computing when latency becomes an issue - here while waiting for other units on other GPUs to complete - is [pipelining](https://en.wikipedia.org/wiki/Pipeline_(computing)), scheduling multiple computations to be in flight simultaneously, so that there is work to do during what would otherwise be idle cycles.

In this context, we break our SGD-style minibatches into multiple microbatches, processing each of these sub-batches one at a time.  This fills up much of the "bubbles" of idle time, as shown in the figure above.  Gradient accumulation is done throughout each of the microbatches, and an update is performed when all the chunks are completed so that it behaves more like the model saw a single minibatch.

As always when choosing batch sizes, there's a tradeoff.  More and smaller microbatches (increasing the number of chunks per minibatch) provides finer-grained parallelism and so reduces idle time, but too small may not take full advantage of the GPU.  Finding the right size is generally a matter of experimentation.

## A Simple Example

To see how this works in code, let's look at a (slightly modified) example from the `torch.distributed.pipelining` docs:

* [Before](code/pipelining-example-singlegpu.py)
* [After](code/pipelining-example-singlegpu.py)

This takes a simple transformer model with toy data, and splits it across two GPUs using one of `pipelining` two APIs (the less manual of one), and runs a couple of training epochs.  You can run the single-GPU version here:

In [None]:
!python3 ./code/pipelining-example-singlegpu.py

And the pipelining example:

In [None]:
!./code/run_w_torchrun 2 ./code/pipelining-example-multigpu.py


There's some things to notice here!

First, notice how _invasive_ these changes are, compared to DDP.  Splitting a model up across GPUs changes the local representation of the model itself and  changes the training loop.  The partition of the model can be more or less manual, depending on the type of model parallism and the APIs provided by the particular framework.  But doing it effectively will require some understanding of how the model works, and will almost certinaly prescribe the number of GPUs to be used.  With DDP we could run with 1, 2, 3, 4 GPUs, or more if we had them; this example hardcodes 2 GPUs.

For this example, the code looked like:

```python
    model = Transformer(vocab_size=vocab_size)

    #...    

    # example data (microbatch-sized) to prime the pipeline, to create the job graph
    example_input_microbatch = x.chunk(num_microbatches)[0]    # manually split the graph 

    split_spec={"layers.4": SplitPoint.BEGINNING,}
    pipe = pipeline(model, mb_args=(example_input_microbatch,), split_spec=split_spec)
    stage = pipe.build_stage(rank, device, dist.group.WORLD)
```

so we chose where to split the model (in the `split_spec` dictionary), and the number of splits.  We also had to create entirely new object types, something representing the pipeline running data through the model as a whole, and the stage of the model that runs locally.

Note too we had to create some example input data (one micro-batch worth) to 'prime' the pipeline; because many models compute shapes implicitly, to create the graph of data flow through all the stages requires an appropriately-sized microbatch worth of data to flow through.   (Relatedly, we can only handle microbatches of fixed sizes; we need to `drop_last=True` in the data loader, or manually pad the data set, to ensure there's no partial microbatches at the end of the data).

The fact that the model is split up by layer between processes means the training loop changes too:

```python
    # only move data to the device if it's used on that device
    # (e.g. inputs on rank 0, outputs on rank 1)
    if rank == 0:
        x = x.to(device)
    elif rank == 1:
        y = y.to(device)
    
    #...

    # add a ScheduleGPipe scheduler
    schedule = ScheduleGPipe(stage, n_microbatches=num_microbatches, loss_fn=tokenwise_loss_fn)
    # optimizer only applies to stage.submod parameters
    optimizer = optim.SGD(stage.submod.parameters(), lr=lr, momentum=momentum)
    
    for epoch in range(2):
        optimizer.zero_grad()
        if rank == 0:
           schedule.step(x)
        elif rank == 1:
           losses = []
           output = schedule.step(target=y, losses=losses)
           print(f"epoch: {epoch} losses: {torch.mean(losses)}")

        optimizer.step()
```       

So note first that different ranks execute slightly different steps inside the training loop.   Only rank 0 gets the inputs, and passes them into the step!  And only the last rank (here `rank == 1`, more generally we could use `rank == world_size - 1`) computes the final outputs, so that is the only rank that uses the supervised learning outputs and computes the losses.  (Note that we've put the loss function into the scheduler).

Also notice that there is no explicit back-propagation step; that's part of the pipeline and the scheduler.

So there's somewhat more steps 

1. Prepare a distributed runtime, as in the [previous section](4_Distributed_data_parallel.ipynb); get ranks, worldsize, etc from environment variables, call `dist.init_process_group()`, and call `dist.destroy_process_group()` at the end.
2. Import `torch.distributed` and relevant helpers (pipeline, SplitPoint, ScheduleGPipe) from pipelining.
3. Turn the monolithic model into a GPipe pipeline (other schedules are avialable)
    * Create example micro‑batch data (x.chunk(num_microbatches)[0]) so GPipe can trace the graph.
    * Define a split_spec that marks where the model should be cut (e.g. {"layers.4": SplitPoint.BEGINNING}).
    * Creat the pipeline
    * On every rank build the local stage with `stage = pipe.build_stage(rank, device, dist.group.WORLD).`
4. Move only the necessary tensors to each GPU
5. Use the schedule to do the forward/backward passes, calling `stage.step()`; Last rank: schedule.step(target=y, losses=losses) – supplies labels, gathers losses.
6. Scope the optimizer correctly Optimizer should update only the parameters of the local stage: _e.g._ `optim.SGD(stage.submod.parameters(), …)` instead of the whole model.