New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add hvd.grouped_allgather and hvd.grouped_reducescatter #3594
Add hvd.grouped_allgather and hvd.grouped_reducescatter #3594
Conversation
Unit Test Results (with flaky tests) 1 131 files - 54 1 131 suites - 54 11h 2m 1s ⏱️ - 13m 25s Results for commit 9bd74dd. ± Comparison against base commit 757883b. ♻️ This comment has been updated with latest results. |
16b472e
to
ece4fd2
Compare
Rebased to master to pick up test fixes and #3590. |
I believe this is ready to be reviewed now. For once all the test suites have passed. For PyTorch and MXNet I have extended the respective |
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
(consistency with other ops) Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
…passed to Enqueue..() Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
c1823c4
to
9bd74dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work @maxhgerlach! Very well implemented and I have no comments. Also, thanks for catching adding and cleaning up the documentation.
Thanks for taking the time to review, @romerojosh! Fortunately, your careful earlier work was easy to transfer, so this was quite straightforward. |
Signed-off-by: Max H. Gerlach <git@maxgerlach.de> Signed-off-by: Lee Yang <leey@nvidia.com>
Checklist before submitting
Description
This adds two new multi-tensor functions to Horovod:
hvd.grouped_allgather()
andhvd.grouped_reducescatter()
. They are very similar to the existinghvd.grouped_allreduce()
, introduced with PR #2453, giving users the equivalent extra control over tensor fusion for theirAllgather
andReducescatter
ops.To implement this functionality I've added a new attribute
output_index
toTensorTableEntry
. This allows anAllgatherOp
orReducescatterOp
that has been enqueued as part of a group to allocate its output tensor properly (in contrast toAllreduce
this cannot be done beforehand because the resulting output size is only known after cross-process coordination). I've also added warning log messages in caseAllocateOutput
ever fails because TensorFlow would crash with just a default "Unknown error." exception message in such a case.Supported frameworks:
Review process to land