-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Reducescatter operator #1496
Conversation
Many (but not all) of the "Run PyTests" runs are timing out when hitting the Reducescatter unit tests. I am trying to understand what's causing this. |
968f211
to
2ec57c7
Compare
The two problematic Join tests are the ones that test "not implemented" failures for allgather and broadcast. I think it's OK to disable them for this PR, I'll work on a proper fix separately. |
Looks like there may be a breaking change in tfhead. The build image steps in unit tests are failing now:
|
@jessebenson just triggered a rebuild. TensorFlow's last nightly had a bug in it they've since rolled-back, so should be working now. |
d277dd8
to
6ac8aa1
Compare
Hey @jessebenson, there was another breaking change by TensorFlow that required a fix in #1515. Can you rebase again? |
@tgaddair - will do. I don't currently have ReduceScatter for MLSL, NCCL, or GLOO. Those will take a bit longer. MPI and GLOO ReduceScatter allow different receive counts per rank, while MLSL and NCCL require all ranks have same receive count. I was planning to do a partial Reduce to last rank (if tensor doesn't evenly divide) to solve this, similar in principle to how Horovod currently does hierarchical APIs. GLOO has ReduceScatter, but they don't expose an API to call - compare: |
6ac8aa1
to
f28960c
Compare
9f02651
to
e14ed57
Compare
I think this pull request is in a good state to review. The future work would be Reducescatter implementation for NCCL, GLOO, and the new Intel CCL. MLSL is being removed. However, GLOO doesn't have a proper public API for Reducescatter, CCL doesn't have Reducescatter at all - so that leaves NCCL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jessebenson, and apologies for the delay. Looks good, just one question regarding API alignment before we land.
horovod/tensorflow/__init__.py
Outdated
@@ -118,6 +118,37 @@ def allreduce(tensor, average=None, device_dense='', device_sparse='', | |||
return new_tensor | |||
|
|||
|
|||
def reducescatter(tensor, average=True, device_dense='', compression=Compression.none): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the Adasum PR deprecated the average
param in favor of op
, I'm wondering if we should do the same here. Might we want to support other reduction ops (min, max, product) in the future?
My understanding is that NCCL reduceScatter supports other reductions: https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/api/types.html#c.ncclRedOp_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this seems perfectly reasonable.
Hey @jessebenson, any updates on the API consolidation? |
The |
I would just support the new |
2cc1ee5
to
ac2178b
Compare
It became a bit more involved to update all of mxnet/keras/tensorflow/pytorch. Hopefully I was thorough and careful enough. I also added a unit test to |
@@ -201,7 +201,7 @@ def _broadcast_grad(op, grad): | |||
return grad_reduced | |||
|
|||
|
|||
def _reducescatter(tensor, name=None): | |||
def _reducescatter(tensor, name=None, op=Sum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like everywhere else the default is Average
but here it's Sum
. Is there a reason, or should we make it Average
here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, Average
needs to be handled in the higher-level reducescatter()
in __init__.py
(which is where the division happens). I was trying to match the behavior of Allreduce, where allreduce()
handles Average
and passes Sum
to _allreduce()
.
_allreduce()
and _reducescatter()
defaults are Sum
allreduce()
and reducescatter()
defaults are Average
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me make a small tweak to reducescatter()
though. Since the underlying _reducescatter()
only supports Sum
at the moment, I never actually pass the reduce op through to it. However, I should pass something to it since any errors will get caught in EnqueueTensorReduceScatter
anyway so it's "safe" to pass the op. That way future people aren't confused why they pass Max
and it doesn't show up in the C++ API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just one small question about default param before we land.
a7ce676
to
e326919
Compare
Some tests are failing now - likely a bad interaction with my changes and a recent change from master. I'll have to investigate. |
After #1594 you need to set tensor sizes and data type for requests, like here: https://github.com/horovod/horovod/blob/master/horovod/common/controller.cc#L581 Also there is ongoing #1604 |
@kit1980 - thanks, that is useful to know. What are these tensor sizes used for? Should it correspond to the fusion buffer size, or response size, or something else? For Allreduce/Adasum, the input/output tensors are the same size. |
@jessebenson, currently the sizes are used in this way only for AllReduce and AdaSum. The tensor sizes are used in https://github.com/horovod/horovod/blob/master/horovod/common/controller.cc#L642 and https://github.com/horovod/horovod/blob/master/horovod/common/controller.cc#L651 The sizes set in response in this case mean number of elements in the tensor. Not sure how this will work when input/output sizes are different. Basically, the recent change was that |
14b233e
to
dbc7a88
Compare
Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>
Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>
Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>
…atter() API. Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>
Signed-off-by: Jesse Benson <jesse.benson@microsoft.com>
dbc7a88
to
fd87f52
Compare
Any updates of this PR? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
@legatoo, @ducviet00, and anybody else who's interested: I've revived this PR in #3299 and any feedback would be appreciated. |
Add Reducescatter operator. From Nvidia documentation:
concepts.rst
documentation to describe operator.