Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hvd.grouped_allgather and hvd.grouped_reducescatter #3594

Merged

Conversation

maxhgerlach
Copy link
Collaborator

@maxhgerlach maxhgerlach commented Jul 6, 2022

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

This adds two new multi-tensor functions to Horovod: hvd.grouped_allgather() and hvd.grouped_reducescatter(). They are very similar to the existing hvd.grouped_allreduce(), introduced with PR #2453, giving users the equivalent extra control over tensor fusion for their Allgather and Reducescatter ops.

To implement this functionality I've added a new attribute output_index to TensorTableEntry. This allows an AllgatherOp or ReducescatterOp that has been enqueued as part of a group to allocate its output tensor properly (in contrast to Allreduce this cannot be done beforehand because the resulting output size is only known after cross-process coordination). I've also added warning log messages in case AllocateOutput ever fails because TensorFlow would crash with just a default "Unknown error." exception message in such a case.

Supported frameworks:

  • TensorFlow
  • PyTorch
  • MXNet

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

@github-actions
Copy link

github-actions bot commented Jul 7, 2022

Unit Test Results

  1 011 files   -   36    1 011 suites   - 36   10h 32m 14s ⏱️ - 14m 22s
     814 tests +  27       764 ✔️ +  27       50 💤 ±  0  0 ±0 
20 527 runs  +268  14 577 ✔️ +250  5 950 💤 +18  0 ±0 

Results for commit 9bd74dd. ± Comparison against base commit 757883b.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Jul 7, 2022

Unit Test Results (with flaky tests)

  1 131 files   -   54    1 131 suites   - 54   11h 2m 1s ⏱️ - 13m 25s
     814 tests +  27       764 ✔️ +  27       50 💤 ±  0  0 ±0 
23 119 runs  +272  16 039 ✔️ +240  7 080 💤 +32  0 ±0 

Results for commit 9bd74dd. ± Comparison against base commit 757883b.

♻️ This comment has been updated with latest results.

@maxhgerlach maxhgerlach marked this pull request as ready for review July 7, 2022 10:20
@maxhgerlach maxhgerlach marked this pull request as draft July 21, 2022 14:17
@maxhgerlach maxhgerlach changed the title Add hvd.grouped_allgather and hvd.grouped_reducescatter (TensorFlow) Add hvd.grouped_allgather and hvd.grouped_reducescatter Jul 21, 2022
@maxhgerlach maxhgerlach force-pushed the pr_grouped_allgather_reducescatter branch from 16b472e to ece4fd2 Compare July 24, 2022 09:33
@maxhgerlach
Copy link
Collaborator Author

Rebased to master to pick up test fixes and #3590.

@maxhgerlach maxhgerlach marked this pull request as ready for review July 25, 2022 08:06
@maxhgerlach
Copy link
Collaborator Author

I believe this is ready to be reviewed now. For once all the test suites have passed.

For PyTorch and MXNet I have extended the respective OpContext classes so they can refer to all output tensors of a group. But ultimately they only need to allocate memory for one specific output index (to do an Allgather or a Reducescatter). In contrast, the TFOpContext already knew about all outputs of the group op. Maybe the design could be cleaned up a bit here to reduce confusion with how the two outputs of Alltoall are handled (tensor and splits), but that potential issue didn't seem pressing to me so far.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
(consistency with other ops)

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
…passed to Enqueue..()

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
@maxhgerlach maxhgerlach force-pushed the pr_grouped_allgather_reducescatter branch from c1823c4 to 9bd74dd Compare August 1, 2022 12:28
Copy link
Collaborator

@romerojosh romerojosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work @maxhgerlach! Very well implemented and I have no comments. Also, thanks for catching adding and cleaning up the documentation.

@maxhgerlach
Copy link
Collaborator Author

Thanks for taking the time to review, @romerojosh! Fortunately, your careful earlier work was easy to transfer, so this was quite straightforward.

@maxhgerlach maxhgerlach merged commit 8f450ab into horovod:master Aug 2, 2022
leewyang pushed a commit to leewyang/horovod that referenced this pull request Aug 5, 2022
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Lee Yang <leey@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants