Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Reducescatter op (NCCL, MPI, Gloo) #3299

Merged
merged 43 commits into from Mar 8, 2022

Conversation

maxhgerlach
Copy link
Collaborator

@maxhgerlach maxhgerlach commented Dec 2, 2021

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

This PR is an update and extension of @jessebenson's PR #1496 from late 2019 / early 2020.

It adds a new collective operation to Horovod, hvd.reducescatter() as illustrated by this figure from NVIDIA's documentation:
image

Since a gradient allreduce can be understood as a reducescatter of the gradients followed by an allgather, having both intermediate operations available provides Horovod users with extra flexibility in their Python training code. For instance, they might experiment with distributing their optimizer states over the worker instances to achieve one of the ZeRO schemes demonstrated in Microsoft's "DeepSpeed".

Included implementations:

  • MPI and MPI GPU: mostly taken from Jesse's commits
  • NCCL: Because there is no "nccReduceScatterV()", this required some extra attention for the case that a tensor cannot be divided over the Horovod processes without a remainder. In this case the first tensor_shape.dim_size(0) % process_set_size ranks receive a slightly larger portion of the reduced tensor than the later ranks (one extra slice each). I've implemented this case via an ncclGroup of ncclReduce() calls, similarly to how Horovod's NCCLAllgather simulates "ncclAllGatherV" with a group of ncclBroadcast() calls.
  • Gloo: For reducescatter there somehow is no high-level API like gloo::allreduce(), gloo::allgatherv, etc. However, there is an algorithm gloo::ReduceScatterHalvingDoubling which can be called directly. Edit: Sadly, this one does not work on macOS, where we need to use the libuv transport, rather than tcp.

oneCCL is not supported yet.

Bindings and tests are provided for TensorFlow, PyTorch, and MXNet.

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

jessebenson and others added 25 commits January 30, 2020 20:26
Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>
Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>
Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>
…atter() API.

Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>
Signed-off-by: Jesse Benson <jesse.benson@microsoft.com>
# Conflicts:
#	docs/gpus.rst
#	horovod/_keras/__init__.py
#	horovod/common/controller.cc
#	horovod/common/message.h
#	horovod/common/operations.cc
#	horovod/common/operations.h
#	horovod/common/ops/collective_operations.cc
#	horovod/common/ops/collective_operations.h
#	horovod/common/ops/gpu_operations.h
#	horovod/common/ops/mpi_gpu_operations.cc
#	horovod/common/ops/mpi_operations.cc
#	horovod/common/ops/operation_manager.cc
#	horovod/common/ops/operation_manager.h
#	horovod/common/tensor_queue.cc
#	horovod/common/wire/message.fbs
#	horovod/common/wire/message_generated.h
#	horovod/keras/__init__.py
#	horovod/mxnet/__init__.py
#	horovod/mxnet/mpi_ops.cc
#	horovod/mxnet/mpi_ops.h
#	horovod/mxnet/mpi_ops.py
#	horovod/tensorflow/__init__.py
#	horovod/tensorflow/keras/__init__.py
#	horovod/tensorflow/mpi_ops.cc
#	horovod/tensorflow/mpi_ops.py
#	horovod/torch/__init__.py
#	horovod/torch/interface.h
#	horovod/torch/interface_cuda.h
#	horovod/torch/mpi_ops.cc
#	horovod/torch/mpi_ops.h
#	horovod/torch/mpi_ops.py
#	horovod/torch/mpi_ops_v2.cc
#	setup.py
#	test/test_tensorflow.py
#	test/test_torch.py
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
…tion from PR horovod#1632

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
…t handling)

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
…n) and fix MPI check

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Limitation: For now this only works with tensors that can be partitioned evenly over the number of Horovod processes.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
…y over the Horovod processes, add tests

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
- process sets
- average reductions
- GPU tests don't depend on MPI

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Also silence some compiler warnings.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
…ss sets and add a test

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
@github-actions
Copy link

github-actions bot commented Dec 2, 2021

Unit Test Results

     802 files  ±       0       802 suites  ±0   9h 51m 7s ⏱️ + 26m 37s
     756 tests +     34       713 ✔️ +  34       43 💤 ±    0  0 ±0 
18 665 runs  +1 292  13 382 ✔️ +972  5 283 💤 +320  0 ±0 

Results for commit 6e38c83. ± Comparison against base commit e02bdca.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Dec 2, 2021

Unit Test Results (with flaky tests)

     886 files  ±       0       886 suites  ±0   10h 13m 52s ⏱️ + 28m 40s
     756 tests +     34       713 ✔️ +     34       43 💤 ±    0  0 ±0 
20 861 runs  +1 496  14 714 ✔️ +1 068  6 147 💤 +428  0 ±0 

Results for commit 6e38c83. ± Comparison against base commit e02bdca.

♻️ This comment has been updated with latest results.

Lacking support in old versions of PyTorch.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
…llowing horovod#3300, horovod#3313

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
@maxhgerlach
Copy link
Collaborator Author

Refreshed this to master

@maxhgerlach

This comment has been minimized.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Copy link
Collaborator

@romerojosh romerojosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxhgerlach Thanks for the great work! This PR is looking very nice, and I especially like that you implemented the ReduceScatterV like support with NCCL. Just left a couple of minor comments for this first pass.

horovod/common/ops/nccl_operations.cc Show resolved Hide resolved
horovod/mxnet/mpi_ops.py Show resolved Hide resolved
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
… argument

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
…a multiple of the size of the process set

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Copy link
Collaborator

@romerojosh romerojosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for making the suggested changes @maxhgerlach and excellent work on this PR. LGTM!

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
@maxhgerlach
Copy link
Collaborator Author

Cheers, I'm stoked to get this one in!

I've merged master once more to make sure no framework update from #3426 breaks any tests. Will land once the CI passes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants