[Discussion] About 3D Parallelism #131

feifeibear · 2022-01-07T09:02:00Z

I read the paper Maximizing Parallelism in Distributed Training for Huge Neural Networks. The idea is elegant and does make sense to me. However, I just wonder about the compatibility of this method with gradient checkpointing (I mentioned it in #117, We call it GC afterward).

Using 3D parallelism, on the activations we have to conduct all-gather across (N/P^2) processors (it is a partial collective communication), where N is the number of GPU for 3-D linear. At least three times such partial collective communication has to be done, during forward, backward, and recomputing of activation during backward using GC. Therefore, it introduces more communication overhead compared with the model parallelism not splitting activations. Did you consider the overhead in the experiment section of the paper?

Also, the tensor of activations is in small size. If partition an activation tensor into N pieces, and send/recv in granularity of one piece of tensor. The bandwidth utilization will be extremely low? This is different from communication on parameters. We can pack a number of layers of parameter tensors and send/recv them in a larger volume to better utilize network bandwidth, but activations come one after another, you cannot treat them the same as the parameter tensors.

PS: a small typo in the arXiv paper. Page 5, 1st line, Bij = [lnp : lnp + np + 1]

kurisusnowdeng · 2022-01-11T04:36:36Z

Hi, @feifeibear . Thank you for sharing the idea! In our opinion, this is basically a trade-off between memory cost and communication cost.

The current design of 3D Linear layer applies an all-gather on the input matrix A and a reduce-scatter on the output matrix C in the forward pass (all-gather on the gradients of C and reduce-scatter on the gradients of A in the backward pass), so that each activation can be 1/N of the original size.

An alternative design is to use an all-reduce on C in the forward pass as well as on the gradients of A in the backward pass, but the activations are 1/N^(2/3) of the original size.

Considering activation checkpointing, as the forward pass is recomputed, the first design applies 2 * all-gather + reduce-scatter on A and 2 * reduce-scatter + all-gather on C in total, while the second design applies 3 * all-reduce. Since all-reduce of the ring algorithm has similar cost to all-gather + reduce-scatter, the total communication costs of both designs seem to be similar.

However, we indeed concern that small tensors decrease the bandwidth utilization, and it is hard to fuse them up. To find the optimal performance, we are testing as much models and networking environments as possible, and let the results tell.

feifeibear · 2022-01-11T09:08:43Z

I agree 3D parallel can shrink the peak activation footprint in one GPU at cost of more communication. The method definitely works in some special cases. Maybe a simple searching method can be derived to figure out which part of the DNN is suitable for 3D parallelism in the constraint of a limited memory budget.

kurisusnowdeng · 2022-01-11T09:18:04Z

I agree 3D parallel can shrink the peak activation footprint in one GPU at cost of more communication. The method definitely works in some special cases. Maybe a simple searching method can be derived to figure out which part of the DNN is suitable for 3D parallelism in the constraint of a limited memory budget.

This can be a good idea. For example, self-attention blocks usually consume more than mlp (ffn) blocks.

github-actions · 2022-01-26T00:14:38Z

This issue is stale because it has been open for 14 days with no activity.

feifeibear · 2022-03-04T09:58:27Z

@1SAA communication profiling results may support some of my assumption iin discussion.

binmakeswell · 2023-04-13T03:36:18Z

We have updated a lot. This issue was closed due to inactivity. Thanks.

github-actions bot added the stale label Jan 26, 2022

binmakeswell added enhancement New feature or request and removed stale labels Apr 13, 2022

binmakeswell closed this as completed Apr 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] About 3D Parallelism #131

[Discussion] About 3D Parallelism #131

feifeibear commented Jan 7, 2022 •

edited

kurisusnowdeng commented Jan 11, 2022

feifeibear commented Jan 11, 2022

kurisusnowdeng commented Jan 11, 2022

github-actions bot commented Jan 26, 2022

feifeibear commented Mar 4, 2022

binmakeswell commented Apr 13, 2023

[Discussion] About 3D Parallelism #131

[Discussion] About 3D Parallelism #131

Comments

feifeibear commented Jan 7, 2022 • edited

kurisusnowdeng commented Jan 11, 2022

feifeibear commented Jan 11, 2022

kurisusnowdeng commented Jan 11, 2022

github-actions bot commented Jan 26, 2022

feifeibear commented Mar 4, 2022

binmakeswell commented Apr 13, 2023

feifeibear commented Jan 7, 2022 •

edited