Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for additional reduction operations for allreduce (min, max, product). #3660

Merged
merged 7 commits into from Sep 19, 2022

Conversation

romerojosh
Copy link
Collaborator

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

Currently, Horovod only supports allreduce operations using sum or average reduction operations. This PR implements an expanded set of supported reduce operations (min, max, product) for allreduce.

This PR includes changes from #3646 so should not be merged until that one lands.

@github-actions
Copy link

github-actions bot commented Aug 20, 2022

Unit Test Results

  1 087 files  +     38    1 087 suites  +38   11h 3m 0s ⏱️ + 6m 8s
     828 tests +     15       770 ✔️ +  15       58 💤 ±    0  0 ±0 
22 010 runs  +1 418  15 422 ✔️ +886  6 588 💤 +532  0 ±0 

Results for commit 1129880. ± Comparison against base commit 25ed803.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Aug 20, 2022

Unit Test Results (with flaky tests)

  1 201 files   -     3    1 201 suites   - 3   11h 32m 32s ⏱️ - 10m 6s
     828 tests +  15       770 ✔️ +  16       58 💤 ±    0  0  - 1 
24 554 runs  +967  16 932 ✔️ +575  7 622 💤 +393  0  - 1 

Results for commit 1129880. ± Comparison against base commit 25ed803.

♻️ This comment has been updated with latest results.

…ax, product).

Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Copy link
Collaborator

@maxhgerlach maxhgerlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this PR a lot, nice work @romerojosh! Not only does it add useful functionality, multiple code paths are cleaner now thanks to effectively using the ReduceOp abstraction.

I've left some minor comments.

horovod/common/operations.cc Show resolved Hide resolved
horovod/common/wire/message_generated.h Outdated Show resolved Hide resolved
horovod/tensorflow/__init__.py Show resolved Hide resolved
@@ -490,7 +490,7 @@ void DoHorovodOperationCudaOnCPU(void*, void* on_complete_ptr, void* param) {
enqueue_result = EnqueueTensorAllreduces(
hvd_contexts, hvd_cpu_buffers, hvd_cpu_buffers, ready_event_lists,
ops_param->op_names, device, callbacks,
(average) ? ReduceOp::AVERAGE : ReduceOp::SUM, prescale_factor,
reduce_op, prescale_factor,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great that the extra layers of converting between average and reduce_op arguments in MXNet could be removed now!

test/parallel/test_tensorflow.py Outdated Show resolved Hide resolved
test/parallel/test_tensorflow.py Outdated Show resolved Hide resolved
test/parallel/test_tensorflow.py Outdated Show resolved Hide resolved
test/parallel/test_tensorflow.py Outdated Show resolved Hide resolved
test/parallel/base_test_mxnet.py Show resolved Hide resolved
test/parallel/test_torch.py Show resolved Hide resolved
Signed-off-by: Josh Romero <joshr@nvidia.com>
Copy link
Collaborator

@maxhgerlach maxhgerlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants