Add Reducescatter operator #1496

jessebenson · 2019-11-04T23:05:24Z

Add Reducescatter operator. From Nvidia documentation:

Implement Reducescatter operator on CPU and GPU (with MPI+CUDA)
Support tensor fusion with Reducescatter
Expose Reducescatter in Python through pytorch, tensorflow, keras, mxnet
Add unit tests for pytorch/tensorflow (similar set as covered by allreduce)
Updated concepts.rst documentation to describe operator.

jessebenson · 2019-11-05T01:55:24Z

Many (but not all) of the "Run PyTests" runs are timing out when hitting the Reducescatter unit tests. I am trying to understand what's causing this.
Update 1: the Reducescatter unit tests only fail if the 'Join' unit tests are run first. Join is not enabled in Pytorch v1, so that's why it passes in some. Investigating ...
Update 2: disabled the two 'Join' unit tests causing the issue for now (discussed with the author). It is not related to Reducescatter - if you cause the Join unit tests to run first (prefix 'aaa') then it causes the same issue.

kit1980 · 2019-11-05T23:00:35Z

The two problematic Join tests are the ones that test "not implemented" failures for allgather and broadcast. I think it's OK to disable them for this PR, I'll work on a proper fix separately.

jessebenson · 2019-11-06T19:04:40Z

Looks like there may be a breaking change in tfhead. The build image steps in unit tests are failing now:

ModuleNotFoundError: No module named 'tensorflow_core.keras'

tgaddair · 2019-11-06T19:11:46Z

@jessebenson just triggered a rebuild. TensorFlow's last nightly had a bug in it they've since rolled-back, so should be working now.

tgaddair · 2019-11-15T17:18:54Z

Hey @jessebenson, there was another breaking change by TensorFlow that required a fix in #1515. Can you rebase again?

jessebenson · 2019-11-15T18:28:52Z

@tgaddair - will do.

I don't currently have ReduceScatter for MLSL, NCCL, or GLOO. Those will take a bit longer. MPI and GLOO ReduceScatter allow different receive counts per rank, while MLSL and NCCL require all ranks have same receive count. I was planning to do a partial Reduce to last rank (if tensor doesn't evenly divide) to solve this, similar in principle to how Horovod currently does hierarchical APIs.

GLOO has ReduceScatter, but they don't expose an API to call - compare:
https://github.com/facebookincubator/gloo/blob/master/gloo/reduce_scatter.h
https://github.com/facebookincubator/gloo/blob/master/gloo/allgather.h#L71

jessebenson · 2019-12-09T19:16:40Z

I think this pull request is in a good state to review.

The future work would be Reducescatter implementation for NCCL, GLOO, and the new Intel CCL. MLSL is being removed. However, GLOO doesn't have a proper public API for Reducescatter, CCL doesn't have Reducescatter at all - so that leaves NCCL.

tgaddair

Thanks @jessebenson, and apologies for the delay. Looks good, just one question regarding API alignment before we land.

tgaddair · 2019-12-09T20:56:48Z

horovod/tensorflow/__init__.py

@@ -118,6 +118,37 @@ def allreduce(tensor, average=None, device_dense='', device_sparse='',
        return new_tensor


+def reducescatter(tensor, average=True, device_dense='', compression=Compression.none):


Given that the Adasum PR deprecated the average param in favor of op, I'm wondering if we should do the same here. Might we want to support other reduction ops (min, max, product) in the future?

My understanding is that NCCL reduceScatter supports other reductions: https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/api/types.html#c.ncclRedOp_t

Yes this seems perfectly reasonable.

horovod/tensorflow/mpi_ops.cc

tgaddair · 2019-12-17T00:01:33Z

Hey @jessebenson, any updates on the API consolidation?

jessebenson · 2019-12-17T00:03:48Z

The Allreduce sort of supports both average and op now, and has some legacy logic to decide between them. I'm wondering - should I support just op parameter?

tgaddair · 2019-12-17T00:05:09Z

I would just support the new op param since this is new functionality. Thanks for checking!

jessebenson · 2019-12-17T20:39:45Z

It became a bit more involved to update all of mxnet/keras/tensorflow/pytorch. Hopefully I was thorough and careful enough. I also added a unit test to test_torch.py to verify passing op=hvd.Adasum will give an error.

tgaddair · 2019-12-17T22:19:02Z

horovod/tensorflow/mpi_ops.py

@@ -201,7 +201,7 @@ def _broadcast_grad(op, grad):
    return grad_reduced


-def _reducescatter(tensor, name=None):
+def _reducescatter(tensor, name=None, op=Sum):


Looks like everywhere else the default is Average but here it's Sum. Is there a reason, or should we make it Average here as well?

So, Average needs to be handled in the higher-level reducescatter() in __init__.py (which is where the division happens). I was trying to match the behavior of Allreduce, where allreduce() handles Average and passes Sum to _allreduce().

_allreduce() and _reducescatter() defaults are Sum
allreduce() and reducescatter() defaults are Average

Let me make a small tweak to reducescatter() though. Since the underlying _reducescatter() only supports Sum at the moment, I never actually pass the reduce op through to it. However, I should pass something to it since any errors will get caught in EnqueueTensorReduceScatter anyway so it's "safe" to pass the op. That way future people aren't confused why they pass Max and it doesn't show up in the C++ API.

tgaddair

Looks good, just one small question about default param before we land.

jessebenson · 2019-12-21T02:07:24Z

Some tests are failing now - likely a bad interaction with my changes and a recent change from master. I'll have to investigate.

kit1980 · 2019-12-21T02:25:00Z

Some tests are failing now - likely a bad interaction with my changes and a recent change from master. I'll have to investigate.

After #1594 you need to set tensor sizes and data type for requests, like here: https://github.com/horovod/horovod/blob/master/horovod/common/controller.cc#L581

Also there is ongoing #1604

jessebenson · 2019-12-22T19:13:54Z

@kit1980 - thanks, that is useful to know. What are these tensor sizes used for? Should it correspond to the fusion buffer size, or response size, or something else?

For Allreduce/Adasum, the input/output tensors are the same size.
For AllGather, the input (per rank) is size ~T and output is ~T*N.
For ReduceScatter, the input (per rank) is size T and output is ~T/N.
For Broadcast, it doesn't seem to set the tensor sizes (is it not needed?)

kit1980 · 2019-12-22T19:34:40Z

@jessebenson, currently the sizes are used in this way only for AllReduce and AdaSum.

The tensor sizes are used in https://github.com/horovod/horovod/blob/master/horovod/common/controller.cc#L642 and https://github.com/horovod/horovod/blob/master/horovod/common/controller.cc#L651

The sizes set in response in this case mean number of elements in the tensor. Not sure how this will work when input/output sizes are different.

Basically, the recent change was that FuseResponses used to call tensor_queue_.GetTensorSizeAndType(response.tensor_names()[0], tensor_size, dtype), and now tensor_size and dtype are from response directly (with a caveat that tensor_sizes in response are in the number of elements, so need to multiply on element size to get bytes).

Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>

…atter() API. Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>

Signed-off-by: Jesse Benson <jesse.benson@microsoft.com>

legatoo · 2020-08-18T02:53:43Z

Any updates of this PR?

stale · 2020-11-06T16:44:11Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

maxhgerlach · 2021-12-02T16:25:26Z

@legatoo, @ducviet00, and anybody else who's interested: I've revived this PR in #3299 and any feedback would be appreciated.

jessebenson force-pushed the reducescatter branch from 968f211 to 2ec57c7 Compare November 5, 2019 22:52

jessebenson force-pushed the reducescatter branch from d277dd8 to 6ac8aa1 Compare November 8, 2019 00:53

jessebenson force-pushed the reducescatter branch from 6ac8aa1 to f28960c Compare November 15, 2019 18:57

jessebenson force-pushed the reducescatter branch 3 times, most recently from 9f02651 to e14ed57 Compare December 9, 2019 19:14

tgaddair reviewed Dec 9, 2019

View reviewed changes

jessebenson force-pushed the reducescatter branch 2 times, most recently from 2cc1ee5 to ac2178b Compare December 17, 2019 19:35

tgaddair reviewed Dec 17, 2019

View reviewed changes

jessebenson force-pushed the reducescatter branch from a7ce676 to e326919 Compare December 20, 2019 00:30

jessebenson force-pushed the reducescatter branch from 14b233e to dbc7a88 Compare December 22, 2019 19:35

jessebenson added 2 commits January 30, 2020 20:26

Add Reducescatter operator, and implement with MPI and CUDA.

e8dd2ce

Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>

Whitespace formatting.

009ec95

Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>

jessebenson and others added 3 commits January 30, 2020 20:26

Replace 'average' with 'op' in Python APIs.

6c8c3d6

Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>

Tensorflow reducescatter() API should pass op to underlying _reducesc…

994d1f6

…atter() API. Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>

Add tensor size to response for ReduceScatter.

fd87f52

Signed-off-by: Jesse Benson <jesse.benson@microsoft.com>

jessebenson force-pushed the reducescatter branch from dbc7a88 to fd87f52 Compare January 31, 2020 04:26

albertz mentioned this pull request Jun 3, 2020

Add MPI Scatter support #2007

Open

stale bot added the wontfix label Nov 6, 2020

stale bot closed this Nov 13, 2020

maxhgerlach mentioned this pull request Jun 7, 2021

add partition optimizer status trainer #2951

Closed

4 tasks

maxhgerlach mentioned this pull request Dec 2, 2021

Add Reducescatter op (NCCL, MPI, Gloo) #3299

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Reducescatter operator #1496

Add Reducescatter operator #1496

jessebenson commented Nov 4, 2019 •

edited

jessebenson commented Nov 5, 2019 •

edited

kit1980 commented Nov 5, 2019

jessebenson commented Nov 6, 2019

tgaddair commented Nov 6, 2019

tgaddair commented Nov 15, 2019

jessebenson commented Nov 15, 2019

jessebenson commented Dec 9, 2019

tgaddair left a comment

tgaddair Dec 9, 2019

jessebenson Dec 9, 2019

tgaddair commented Dec 17, 2019

jessebenson commented Dec 17, 2019

tgaddair commented Dec 17, 2019

jessebenson commented Dec 17, 2019

tgaddair Dec 17, 2019

jessebenson Dec 17, 2019

jessebenson Dec 17, 2019

tgaddair left a comment

jessebenson commented Dec 21, 2019

kit1980 commented Dec 21, 2019

jessebenson commented Dec 22, 2019 •

edited

kit1980 commented Dec 22, 2019

legatoo commented Aug 18, 2020

stale bot commented Nov 6, 2020

maxhgerlach commented Dec 2, 2021

		@@ -118,6 +118,37 @@ def allreduce(tensor, average=None, device_dense='', device_sparse='',
		return new_tensor


		def reducescatter(tensor, average=True, device_dense='', compression=Compression.none):

Add Reducescatter operator #1496

Add Reducescatter operator #1496

Conversation

jessebenson commented Nov 4, 2019 • edited

jessebenson commented Nov 5, 2019 • edited

kit1980 commented Nov 5, 2019

jessebenson commented Nov 6, 2019

tgaddair commented Nov 6, 2019

tgaddair commented Nov 15, 2019

jessebenson commented Nov 15, 2019

jessebenson commented Dec 9, 2019

tgaddair left a comment

Choose a reason for hiding this comment

tgaddair Dec 9, 2019

Choose a reason for hiding this comment

jessebenson Dec 9, 2019

Choose a reason for hiding this comment

tgaddair commented Dec 17, 2019

jessebenson commented Dec 17, 2019

tgaddair commented Dec 17, 2019

jessebenson commented Dec 17, 2019

tgaddair Dec 17, 2019

Choose a reason for hiding this comment

jessebenson Dec 17, 2019

Choose a reason for hiding this comment

jessebenson Dec 17, 2019

Choose a reason for hiding this comment

tgaddair left a comment

Choose a reason for hiding this comment

jessebenson commented Dec 21, 2019

kit1980 commented Dec 21, 2019

jessebenson commented Dec 22, 2019 • edited

kit1980 commented Dec 22, 2019

legatoo commented Aug 18, 2020

stale bot commented Nov 6, 2020

maxhgerlach commented Dec 2, 2021

jessebenson commented Nov 4, 2019 •

edited

jessebenson commented Nov 5, 2019 •

edited

jessebenson commented Dec 22, 2019 •

edited