Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc] pipeline doc typos/improvements #659

Merged
merged 3 commits into from
Mar 14, 2021
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions docs/_tutorials/pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ net = PipelineModule(layers=net.to_layers(), num_stages=2)
```

**Note:**
the `lamda` in the middle of `layers` above is not a `torch.nn.Module`
the `lambda` in the middle of `layers` above is not a `torch.nn.Module`
type. Any object that implements `__call__()` can be a layer in a
`PipelineModule`: this allows for convenient data transformations in the
pipeline.
Expand Down Expand Up @@ -165,7 +165,7 @@ These modifications can be accomplished with a short subclass:
class TransformerBlockPipe(TransformerBlock)
def forward(self, inputs):
hidden, mask = inputs
outputs = super().forward(hidden, mask)
output = super().forward(hidden, mask)
return (output, mask)
stack = [ TransformerBlockPipe() for _ in range(num_layers) ]
```
Expand Down Expand Up @@ -269,17 +269,17 @@ by DeepSpeed:
* `partition_method="uniform"` balances the number of layers per stage.

### Memory-Efficient Model Construction
Building a `Sequential` and providing it `PipelineModule` is a convenient way
Building a `Sequential` container and providing it to a `PipelineModule` is a convenient way
of specifying a pipeline parallel model. However, this approach encounters
scalability issues for massive models. Starting from a `Sequential` allocates
the model in CPU memory redundantly by every worker. A machine with 16 GPUs
scalability issues for massive models. In this approach each worker replicates
the whole model in CPU memory. For example, a machine with 16 GPUs
stas00 marked this conversation as resolved.
Show resolved Hide resolved
must have as much local CPU memory as 16 times the model size.

DeepSpeed provides a `LayerSpec` class that delays the construction of
modules until the model layers have been partitioned across workers. Then,
the modules are built on the GPU that owns the layer.
each GPU allocates only the modules assigned to it.
Copy link
Contributor Author

@stas00 stas00 Jan 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to communicate that when LayerSpec is used, not only that each GPU allocates only the layers it's assigned to but also that no CPU allocation happens at all?

Or should I have rewritten it to say:

Then each worker will allocate only the layers it's assigned to. So continuing the example from the previous paragraph, a machine with 16 GPUs will need to allocate a total of 1x model size on its CPU, compared to 16x in the LayerSpec example.

I think the two paras have GPU and CPU mixed up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, LayerSpec does not allocate modules until the partitioning is complete. At that point, each rank allocates only the layers assigned to it and they are then moved to the GPU. This lets us build distributed models that wouldn't fit in CPU memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is it better to leave the original modification of this PR, or replace with the more exemplifying:

Then each worker will allocate only the layers it's assigned to. So continuing the example from the previous paragraph, a machine with 16 GPUs will need to allocate a total of 1x model size on its CPU, compared to 16x in the LayerSpec example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I went ahead with the longer version. Please double check that it's still correct.


Here's an example of the abbreviated AlexNet model, but expressed only
Here is an example of the abbreviated AlexNet model, but expressed only
with `LayerSpec`s. Note that the syntax is almost unchanged: `nn.ReLU(inplace=True)`
simply becomes `LayerSpec(nn.ReLU, inplace=True)`.
```python
Expand Down