Add Reducescatter op (NCCL, MPI, Gloo) #3299

maxhgerlach · 2021-12-02T16:20:26Z

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

This PR is an update and extension of @jessebenson's PR #1496 from late 2019 / early 2020.

It adds a new collective operation to Horovod, hvd.reducescatter() as illustrated by this figure from NVIDIA's documentation:

Since a gradient allreduce can be understood as a reducescatter of the gradients followed by an allgather, having both intermediate operations available provides Horovod users with extra flexibility in their Python training code. For instance, they might experiment with distributing their optimizer states over the worker instances to achieve one of the ZeRO schemes demonstrated in Microsoft's "DeepSpeed".

Included implementations:

MPI and MPI GPU: mostly taken from Jesse's commits
NCCL: Because there is no "nccReduceScatterV()", this required some extra attention for the case that a tensor cannot be divided over the Horovod processes without a remainder. In this case the first tensor_shape.dim_size(0) % process_set_size ranks receive a slightly larger portion of the reduced tensor than the later ranks (one extra slice each). I've implemented this case via an ncclGroup of ncclReduce() calls, similarly to how Horovod's NCCLAllgather simulates "ncclAllGatherV" with a group of ncclBroadcast() calls.
Gloo: For reducescatter there somehow is no high-level API like gloo::allreduce(), gloo::allgatherv, etc. However, there is an algorithm gloo::ReduceScatterHalvingDoubling which can be called directly. Edit: Sadly, this one does not work on macOS, where we need to use the libuv transport, rather than tcp.

oneCCL is not supported yet.

Bindings and tests are provided for TensorFlow, PyTorch, and MXNet.

Review process to land

All tests and other checks must succeed.
At least one member of the technical steering committee must review and approve.
If any member of the technical steering committee requests changes, they must be addressed.

Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>

…atter() API. Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>

Signed-off-by: Jesse Benson <jesse.benson@microsoft.com>

# Conflicts: # docs/gpus.rst # horovod/_keras/__init__.py # horovod/common/controller.cc # horovod/common/message.h # horovod/common/operations.cc # horovod/common/operations.h # horovod/common/ops/collective_operations.cc # horovod/common/ops/collective_operations.h # horovod/common/ops/gpu_operations.h # horovod/common/ops/mpi_gpu_operations.cc # horovod/common/ops/mpi_operations.cc # horovod/common/ops/operation_manager.cc # horovod/common/ops/operation_manager.h # horovod/common/tensor_queue.cc # horovod/common/wire/message.fbs # horovod/common/wire/message_generated.h # horovod/keras/__init__.py # horovod/mxnet/__init__.py # horovod/mxnet/mpi_ops.cc # horovod/mxnet/mpi_ops.h # horovod/mxnet/mpi_ops.py # horovod/tensorflow/__init__.py # horovod/tensorflow/keras/__init__.py # horovod/tensorflow/mpi_ops.cc # horovod/tensorflow/mpi_ops.py # horovod/torch/__init__.py # horovod/torch/interface.h # horovod/torch/interface_cuda.h # horovod/torch/mpi_ops.cc # horovod/torch/mpi_ops.h # horovod/torch/mpi_ops.py # horovod/torch/mpi_ops_v2.cc # setup.py # test/test_tensorflow.py # test/test_torch.py

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

…tion from PR horovod#1632 Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

…t handling) Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

…n) and fix MPI check Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Limitation: For now this only works with tensors that can be partitioned evenly over the number of Horovod processes. Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

…y over the Horovod processes, add tests Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

- process sets - average reductions - GPU tests don't depend on MPI Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Also silence some compiler warnings. Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

…ss sets and add a test Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

github-actions · 2021-12-02T19:48:23Z

Unit Test Results

    802 files ±      0     802 suites ±0 9h 51m 7s ⏱️ + 26m 37s
    756 tests +    34     713 ✔️ +  34     43 💤 ±    0 0 ❌ ±0
18 665 runs +1 292 13 382 ✔️ +972 5 283 💤 +320 0 ❌ ±0

Results for commit 6e38c83. ± Comparison against base commit e02bdca.

♻️ This comment has been updated with latest results.

github-actions · 2021-12-02T19:48:44Z

Unit Test Results (with flaky tests)

    886 files ±      0     886 suites ±0 10h 13m 52s ⏱️ + 28m 40s
    756 tests +    34     713 ✔️ +    34     43 💤 ±    0 0 ❌ ±0
20 861 runs +1 496 14 714 ✔️ +1 068 6 147 💤 +428 0 ❌ ±0

Results for commit 6e38c83. ± Comparison against base commit e02bdca.

♻️ This comment has been updated with latest results.

Lacking support in old versions of PyTorch. Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

…llowing horovod#3300, horovod#3313 Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

maxhgerlach · 2021-12-17T15:11:00Z

Refreshed this to master

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

romerojosh

@maxhgerlach Thanks for the great work! This PR is looking very nice, and I especially like that you implemented the ReduceScatterV like support with NCCL. Just left a couple of minor comments for this first pass.

horovod/common/ops/nccl_operations.cc

horovod/mxnet/mpi_ops.py

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

… argument Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

…a multiple of the size of the process set Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

romerojosh

Thanks a lot for making the suggested changes @maxhgerlach and excellent work on this PR. LGTM!

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

maxhgerlach · 2022-03-08T08:51:12Z

Cheers, I'm stoked to get this one in!

I've merged master once more to make sure no framework update from #3426 breaks any tests. Will land once the CI passes.

jessebenson and others added 25 commits January 30, 2020 20:26

Add Reducescatter operator, and implement with MPI and CUDA.

e8dd2ce

Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>

Whitespace formatting.

009ec95

Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>

Replace 'average' with 'op' in Python APIs.

6c8c3d6

Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>

Tensorflow reducescatter() API should pass op to underlying _reducesc…

994d1f6

…atter() API. Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>

Add tensor size to response for ReduceScatter.

fd87f52

Signed-off-by: Jesse Benson <jesse.benson@microsoft.com>

Update CMake and install docs for reducescatter

9610a27

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Silence warnings

5ee205e

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Update reducescatter GPU operations according to CUDA/ROCM generaliza…

d091a77

…tion from PR horovod#1632 Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Update reducescatter MPI ops for process sets and WaitForData()

b2f9dd8

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Update EnqueueTensorReducescatter and add NVTX support

c7b62e1

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Update TensorFlow bindings (process sets, ignore_name_scope, GPU even…

1366c87

…t handling) Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Enable TF tests for reducescatter

bb77144

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Update TF tests to properly skip (instead of passing with early retur…

f26510a

…n) and fix MPI check Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Add NCCLReducescatter

6070fab

Limitation: For now this only works with tensors that can be partitioned evenly over the number of Horovod processes. Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Extend NCCLReducescatter for tensors that cannot be distributed evenl…

19e198a

…y over the Horovod processes, add tests Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Expand TF test coverage for reducescatter

4e66a4a

- process sets - average reductions - GPU tests don't depend on MPI Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Update PyTorch bindings and tests for reducescatter

0a42112

Also silence some compiler warnings. Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Fix TensorFlow gradient for reducescatter with averaging and/or proce…

4408d30

…ss sets and add a test Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Fix PyTorch gradient for reducescatter and add process set tests

854b33f

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Update MXNet bindings for reducescatter and add tests

31a21f3

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Add GlooReducescatter and enable tests without MPI

74b63f5

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Fix Keras argument

bb51e89

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Skip reducescatter with oneCCL

772c231

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Clean up TODOs from intermediate merge

40ce512

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

This was referenced Dec 2, 2021

Add Reducescatter operator #1496

Closed

add partition optimizer status trainer #2951

Closed

maxhgerlach added 2 commits December 2, 2021 17:52

Fix flatbuffers schema (botched in merge) and regenerate header

5f86b17

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Fix MXNet: DoHorovodOperationCudaOnCPU

5738128

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Remove fp16 from test_horovod_reducecsatter_average

24911c4

Lacking support in old versions of PyTorch. Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

maxhgerlach requested review from romerojosh and tgaddair December 4, 2021 13:17

maxhgerlach added 4 commits December 17, 2021 15:48

Update changelog

feafe39

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Disable GlooReducescatter on macOS

f833bdb

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Merge branch 'master' into newpr_reducescatter

6b7e976

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Update TorchTests::test_horovod_reducescatter_duplicate_name_error fo…

b85e60d

…llowing horovod#3300, horovod#3313 Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

maxhgerlach force-pushed the newpr_reducescatter branch from a74c56f to b85e60d Compare December 17, 2021 15:06

This comment has been minimized.

Sign in to view

Merge branch 'master' into newpr_reducescatter

f2d6b9d

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

romerojosh reviewed Feb 8, 2022

View reviewed changes

horovod/common/ops/nccl_operations.cc Show resolved Hide resolved

horovod/mxnet/mpi_ops.py Show resolved Hide resolved

maxhgerlach added 9 commits February 25, 2022 12:14

Merge branch 'master' into newpr_reducescatter

9c410c8

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Add a clarifying comment and an explicit type cast

c85c6e3

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Update deprecation notice for average argument to allreduce functions

251e8b1

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Update MXNet allreduce functions to use op API, deprecate average…

165c484

… argument Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Add missing arg for sphinx

ac8e716

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Update changelog

71558d7

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

TF: Lower reducescatter test accuracy for float16

0f22f3a

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Scatter tensors more evenly in case their outermost dimension is not …

bfb560b

…a multiple of the size of the process set Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Merge branch 'master' into newpr_reducescatter, update CHANGELOG.md

8ad927c

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

maxhgerlach requested a review from romerojosh March 3, 2022 11:34

romerojosh approved these changes Mar 7, 2022

View reviewed changes

Merge branch 'master' into newpr_reducescatter

6e38c83

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

maxhgerlach merged commit 8a12dc9 into horovod:master Mar 8, 2022

maxhgerlach mentioned this pull request Jun 9, 2022

Add NVTX op tracing for Reducescatter, make base class destructors virtual #3574

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Reducescatter op (NCCL, MPI, Gloo) #3299

Add Reducescatter op (NCCL, MPI, Gloo) #3299

maxhgerlach commented Dec 2, 2021 •

edited

github-actions bot commented Dec 2, 2021 •

edited

github-actions bot commented Dec 2, 2021 •

edited

maxhgerlach commented Dec 17, 2021

This comment has been minimized.

romerojosh left a comment

romerojosh left a comment

maxhgerlach commented Mar 8, 2022

Add Reducescatter op (NCCL, MPI, Gloo) #3299

Add Reducescatter op (NCCL, MPI, Gloo) #3299

Conversation

maxhgerlach commented Dec 2, 2021 • edited

Checklist before submitting

Description

Review process to land

github-actions bot commented Dec 2, 2021 • edited

Unit Test Results

github-actions bot commented Dec 2, 2021 • edited

Unit Test Results (with flaky tests)

maxhgerlach commented Dec 17, 2021

This comment has been minimized.

romerojosh left a comment

Choose a reason for hiding this comment

romerojosh left a comment

Choose a reason for hiding this comment

maxhgerlach commented Mar 8, 2022

maxhgerlach commented Dec 2, 2021 •

edited

github-actions bot commented Dec 2, 2021 •

edited

github-actions bot commented Dec 2, 2021 •

edited