Add hvd.grouped_allgather and hvd.grouped_reducescatter #3594

maxhgerlach · 2022-07-06T19:35:20Z

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

This adds two new multi-tensor functions to Horovod: hvd.grouped_allgather() and hvd.grouped_reducescatter(). They are very similar to the existing hvd.grouped_allreduce(), introduced with PR #2453, giving users the equivalent extra control over tensor fusion for their Allgather and Reducescatter ops.

To implement this functionality I've added a new attribute output_index to TensorTableEntry. This allows an AllgatherOp or ReducescatterOp that has been enqueued as part of a group to allocate its output tensor properly (in contrast to Allreduce this cannot be done beforehand because the resulting output size is only known after cross-process coordination). I've also added warning log messages in case AllocateOutput ever fails because TensorFlow would crash with just a default "Unknown error." exception message in such a case.

Supported frameworks:

TensorFlow
PyTorch
MXNet

Review process to land

All tests and other checks must succeed.
At least one member of the technical steering committee must review and approve.
If any member of the technical steering committee requests changes, they must be addressed.

github-actions · 2022-07-07T00:23:23Z

Unit Test Results

  1 011 files -   36   1 011 suites - 36 10h 32m 14s ⏱️ - 14m 22s
    814 tests +  27     764 ✔️ +  27     50 💤 ±  0 0 ❌ ±0
20 527 runs +268 14 577 ✔️ +250 5 950 💤 +18 0 ❌ ±0

Results for commit 9bd74dd. ± Comparison against base commit 757883b.

♻️ This comment has been updated with latest results.

github-actions · 2022-07-07T00:23:35Z

Unit Test Results (with flaky tests)

  1 131 files -   54   1 131 suites - 54 11h 2m 1s ⏱️ - 13m 25s
    814 tests +  27     764 ✔️ +  27     50 💤 ±  0 0 ❌ ±0
23 119 runs +272 16 039 ✔️ +240 7 080 💤 +32 0 ❌ ±0

Results for commit 9bd74dd. ± Comparison against base commit 757883b.

♻️ This comment has been updated with latest results.

maxhgerlach · 2022-07-24T09:34:50Z

Rebased to master to pick up test fixes and #3590.

maxhgerlach · 2022-07-25T08:17:41Z

I believe this is ready to be reviewed now. For once all the test suites have passed.

For PyTorch and MXNet I have extended the respective OpContext classes so they can refer to all output tensors of a group. But ultimately they only need to allocate memory for one specific output index (to do an Allgather or a Reducescatter). In contrast, the TFOpContext already knew about all outputs of the group op. Maybe the design could be cleaned up a bit here to reduce confusion with how the two outputs of Alltoall are handled (tensor and splits), but that potential issue didn't seem pressing to me so far.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

(consistency with other ops) Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

…passed to Enqueue..() Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

romerojosh

Excellent work @maxhgerlach! Very well implemented and I have no comments. Also, thanks for catching adding and cleaning up the documentation.

maxhgerlach · 2022-08-02T18:04:44Z

Thanks for taking the time to review, @romerojosh! Fortunately, your careful earlier work was easy to transfer, so this was quite straightforward.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de> Signed-off-by: Lee Yang <leey@nvidia.com>

maxhgerlach mentioned this pull request Jul 6, 2022

Whether to support grouped_allgather in the future? #3325

Closed

maxhgerlach marked this pull request as ready for review July 7, 2022 10:20

maxhgerlach marked this pull request as draft July 21, 2022 14:17

maxhgerlach changed the title ~~Add hvd.grouped_allgather and hvd.grouped_reducescatter (TensorFlow)~~ Add hvd.grouped_allgather and hvd.grouped_reducescatter Jul 21, 2022

maxhgerlach force-pushed the pr_grouped_allgather_reducescatter branch from 16b472e to ece4fd2 Compare July 24, 2022 09:33

maxhgerlach marked this pull request as ready for review July 25, 2022 08:06

maxhgerlach requested review from romerojosh and tgaddair July 25, 2022 08:07

maxhgerlach added 13 commits August 1, 2022 14:20

Add warning messages to log when allocations fail

3338fff

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Add grouped_allgather() and grouped_reducescatter() for TensorFlow

cb2f5a1

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Add doc strings for hvd.grouped_allreduce and hvd.grouped_reducescatter

d5a5fd5

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

PyTorch: Add hvd.grouped_allgather

3d4cb75

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

PyTorch: fix return syntax for Python < 3.8

3536048

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

PyTorch: Have _reducescatter_async raise HorovodInternalError

8905f03

(consistency with other ops) Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

PyTorch: Add hvd.grouped_reducescatter

74fc57d

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

PyTorch: DoGrouped{Allreduce,Allgather}CudaOnCPU fix device argument …

e007926

…passed to Enqueue..() Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Update changelog

96cd16e

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

MXNet: Add hvd.grouped_allgather()

4b4e25b

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

MXNet: Add hvd.grouped_reducescatter()

8542747

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

MXNet: Fix "mixed" build

5908fb6

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

TF: Move new process_sets tests to test_tensorflow_process_sets.py

9bd74dd

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

maxhgerlach force-pushed the pr_grouped_allgather_reducescatter branch from c1823c4 to 9bd74dd Compare August 1, 2022 12:28

romerojosh approved these changes Aug 2, 2022

View reviewed changes

maxhgerlach merged commit 8f450ab into horovod:master Aug 2, 2022

leewyang pushed a commit to leewyang/horovod that referenced this pull request Aug 5, 2022

Add hvd.grouped_allgather and hvd.grouped_reducescatter (horovod#3594)

5b22a73

Signed-off-by: Max H. Gerlach <git@maxgerlach.de> Signed-off-by: Lee Yang <leey@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add hvd.grouped_allgather and hvd.grouped_reducescatter #3594

Add hvd.grouped_allgather and hvd.grouped_reducescatter #3594

maxhgerlach commented Jul 6, 2022 •

edited

github-actions bot commented Jul 7, 2022 •

edited

github-actions bot commented Jul 7, 2022 •

edited

maxhgerlach commented Jul 24, 2022

maxhgerlach commented Jul 25, 2022

romerojosh left a comment

maxhgerlach commented Aug 2, 2022

Add hvd.grouped_allgather and hvd.grouped_reducescatter #3594

Add hvd.grouped_allgather and hvd.grouped_reducescatter #3594

Conversation

maxhgerlach commented Jul 6, 2022 • edited

Checklist before submitting

Description

Review process to land

github-actions bot commented Jul 7, 2022 • edited

Unit Test Results

github-actions bot commented Jul 7, 2022 • edited

Unit Test Results (with flaky tests)

maxhgerlach commented Jul 24, 2022

maxhgerlach commented Jul 25, 2022

romerojosh left a comment

Choose a reason for hiding this comment

maxhgerlach commented Aug 2, 2022

maxhgerlach commented Jul 6, 2022 •

edited

github-actions bot commented Jul 7, 2022 •

edited

github-actions bot commented Jul 7, 2022 •

edited