Add support for additional reduction operations for allreduce (min, max, product). #3660

romerojosh · 2022-08-20T15:32:54Z

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

Currently, Horovod only supports allreduce operations using sum or average reduction operations. This PR implements an expanded set of supported reduce operations (min, max, product) for allreduce.

This PR includes changes from #3646 so should not be merged until that one lands.

github-actions · 2022-08-20T17:04:45Z

Unit Test Results

  1 087 files +    38   1 087 suites +38 11h 3m 0s ⏱️ + 6m 8s
    828 tests +    15     770 ✔️ +  15     58 💤 ±    0 0 ❌ ±0
22 010 runs +1 418 15 422 ✔️ +886 6 588 💤 +532 0 ❌ ±0

Results for commit 1129880. ± Comparison against base commit 25ed803.

♻️ This comment has been updated with latest results.

github-actions · 2022-08-20T17:04:58Z

Unit Test Results (with flaky tests)

  1 201 files -     3   1 201 suites - 3 11h 32m 32s ⏱️ - 10m 6s
    828 tests +  15     770 ✔️ +  16     58 💤 ±    0 0 ❌ - 1
24 554 runs +967 16 932 ✔️ +575 7 622 💤 +393 0 ❌ - 1

Results for commit 1129880. ± Comparison against base commit 25ed803.

♻️ This comment has been updated with latest results.

…ax, product). Signed-off-by: Josh Romero <joshr@nvidia.com>

Signed-off-by: Josh Romero <joshr@nvidia.com>

maxhgerlach

I like this PR a lot, nice work @romerojosh! Not only does it add useful functionality, multiple code paths are cleaner now thanks to effectively using the ReduceOp abstraction.

I've left some minor comments.

horovod/common/operations.cc

horovod/common/wire/message_generated.h

horovod/tensorflow/__init__.py

maxhgerlach · 2022-09-14T14:45:49Z

horovod/mxnet/mpi_ops.cc

@@ -490,7 +490,7 @@ void DoHorovodOperationCudaOnCPU(void*, void* on_complete_ptr, void* param) {
    enqueue_result = EnqueueTensorAllreduces(
        hvd_contexts, hvd_cpu_buffers, hvd_cpu_buffers, ready_event_lists,
        ops_param->op_names, device, callbacks,
-        (average) ? ReduceOp::AVERAGE : ReduceOp::SUM, prescale_factor,
+        reduce_op, prescale_factor,


great that the extra layers of converting between average and reduce_op arguments in MXNet could be removed now!

test/parallel/test_tensorflow.py

test/parallel/base_test_mxnet.py

test/parallel/test_torch.py

Signed-off-by: Josh Romero <joshr@nvidia.com>

maxhgerlach

LGTM!

romerojosh force-pushed the add_allreduce_ops branch from 42edd02 to 65d51c9 Compare August 23, 2022 17:39

romerojosh added 6 commits September 12, 2022 10:57

Add support for additional reduction operations for allreduce (min, m…

f27ec7e

…ax, product). Signed-off-by: Josh Romero <joshr@nvidia.com>

Fix compilation errors.

17d304d

Signed-off-by: Josh Romero <joshr@nvidia.com>

Handle reduce_op in FuseResponses.

14ae76f

Signed-off-by: Josh Romero <joshr@nvidia.com>

Fix gloo header.

09b989c

Signed-off-by: Josh Romero <joshr@nvidia.com>

Try slice instead of ellipsis in MXNet tests.

30dff5e

Signed-off-by: Josh Romero <joshr@nvidia.com>

Update more asserts in MXNet mpi_ops.py.

9c26bf8

Signed-off-by: Josh Romero <joshr@nvidia.com>

romerojosh force-pushed the add_allreduce_ops branch from 65d51c9 to 9c26bf8 Compare September 12, 2022 17:57

romerojosh marked this pull request as ready for review September 13, 2022 01:22

romerojosh requested review from tgaddair and maxhgerlach September 13, 2022 01:23

maxhgerlach reviewed Sep 14, 2022

View reviewed changes

Addressing comments.

1129880

Signed-off-by: Josh Romero <joshr@nvidia.com>

maxhgerlach approved these changes Sep 14, 2022

View reviewed changes

romerojosh merged commit 427b633 into horovod:master Sep 19, 2022

maxhgerlach mentioned this pull request Sep 20, 2022

Reducescatter: Add support for additional reduction ops: min, max, product #3709

Open

maxhgerlach mentioned this pull request Oct 4, 2022

AllReduceMin/Max needed #1425

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for additional reduction operations for allreduce (min, max, product). #3660

Add support for additional reduction operations for allreduce (min, max, product). #3660

romerojosh commented Aug 20, 2022

github-actions bot commented Aug 20, 2022 •

edited

github-actions bot commented Aug 20, 2022 •

edited

maxhgerlach left a comment

maxhgerlach Sep 14, 2022

maxhgerlach left a comment

Add support for additional reduction operations for allreduce (min, max, product). #3660

Add support for additional reduction operations for allreduce (min, max, product). #3660

Conversation

romerojosh commented Aug 20, 2022

Checklist before submitting

Description

github-actions bot commented Aug 20, 2022 • edited

Unit Test Results

github-actions bot commented Aug 20, 2022 • edited

Unit Test Results (with flaky tests)

maxhgerlach left a comment

Choose a reason for hiding this comment

maxhgerlach Sep 14, 2022

Choose a reason for hiding this comment

maxhgerlach left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 20, 2022 •

edited

github-actions bot commented Aug 20, 2022 •

edited