New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Reducescatter op (NCCL, MPI, Gloo) #3299
Conversation
Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>
Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>
Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>
…atter() API. Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>
Signed-off-by: Jesse Benson <jesse.benson@microsoft.com>
# Conflicts: # docs/gpus.rst # horovod/_keras/__init__.py # horovod/common/controller.cc # horovod/common/message.h # horovod/common/operations.cc # horovod/common/operations.h # horovod/common/ops/collective_operations.cc # horovod/common/ops/collective_operations.h # horovod/common/ops/gpu_operations.h # horovod/common/ops/mpi_gpu_operations.cc # horovod/common/ops/mpi_operations.cc # horovod/common/ops/operation_manager.cc # horovod/common/ops/operation_manager.h # horovod/common/tensor_queue.cc # horovod/common/wire/message.fbs # horovod/common/wire/message_generated.h # horovod/keras/__init__.py # horovod/mxnet/__init__.py # horovod/mxnet/mpi_ops.cc # horovod/mxnet/mpi_ops.h # horovod/mxnet/mpi_ops.py # horovod/tensorflow/__init__.py # horovod/tensorflow/keras/__init__.py # horovod/tensorflow/mpi_ops.cc # horovod/tensorflow/mpi_ops.py # horovod/torch/__init__.py # horovod/torch/interface.h # horovod/torch/interface_cuda.h # horovod/torch/mpi_ops.cc # horovod/torch/mpi_ops.h # horovod/torch/mpi_ops.py # horovod/torch/mpi_ops_v2.cc # setup.py # test/test_tensorflow.py # test/test_torch.py
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
…tion from PR horovod#1632 Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
…t handling) Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
…n) and fix MPI check Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Limitation: For now this only works with tensors that can be partitioned evenly over the number of Horovod processes. Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
…y over the Horovod processes, add tests Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
- process sets - average reductions - GPU tests don't depend on MPI Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Also silence some compiler warnings. Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
…ss sets and add a test Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Unit Test Results (with flaky tests) 886 files ± 0 886 suites ±0 10h 13m 52s ⏱️ + 28m 40s Results for commit 6e38c83. ± Comparison against base commit e02bdca. ♻️ This comment has been updated with latest results. |
Lacking support in old versions of PyTorch. Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
…llowing horovod#3300, horovod#3313 Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
a74c56f
to
b85e60d
Compare
Refreshed this to master |
This comment has been minimized.
This comment has been minimized.
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maxhgerlach Thanks for the great work! This PR is looking very nice, and I especially like that you implemented the ReduceScatterV
like support with NCCL. Just left a couple of minor comments for this first pass.
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
… argument Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
…a multiple of the size of the process set Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for making the suggested changes @maxhgerlach and excellent work on this PR. LGTM!
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Cheers, I'm stoked to get this one in! I've merged master once more to make sure no framework update from #3426 breaks any tests. Will land once the CI passes. |
Checklist before submitting
Description
This PR is an update and extension of @jessebenson's PR #1496 from late 2019 / early 2020.
It adds a new collective operation to Horovod,
hvd.reducescatter()
as illustrated by this figure from NVIDIA's documentation:Since a gradient allreduce can be understood as a reducescatter of the gradients followed by an allgather, having both intermediate operations available provides Horovod users with extra flexibility in their Python training code. For instance, they might experiment with distributing their optimizer states over the worker instances to achieve one of the ZeRO schemes demonstrated in Microsoft's "DeepSpeed".
Included implementations:
tensor_shape.dim_size(0) % process_set_size
ranks receive a slightly larger portion of the reduced tensor than the later ranks (one extra slice each). I've implemented this case via anncclGroup
ofncclReduce()
calls, similarly to how Horovod'sNCCLAllgather
simulates "ncclAllGatherV" with a group ofncclBroadcast()
calls.reducescatter
there somehow is no high-level API likegloo::allreduce()
,gloo::allgatherv
, etc. However, there is an algorithmgloo::ReduceScatterHalvingDoubling
which can be called directly. Edit: Sadly, this one does not work on macOS, where we need to use thelibuv
transport, rather thantcp
.oneCCL is not supported yet.
Bindings and tests are provided for TensorFlow, PyTorch, and MXNet.
Review process to land