implement 2D torus allreduce using NCCL #3608

yundai424 · 2022-07-18T19:58:01Z

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

This PR implements 2D torus allreduce (https://asset-pdf.scinapse.io/prod/2920668770/2920668770.pdf) using NCCL. The algorithm and implementation are essentially the same as the NCCLHierarchicalAllreduce that's already in Horovod but replacing MPI with NCCL to perform parallel cross-node allreduce. I named it as NCCLTorusAllreduce to correctly reflect the algorithm.

The benefit of replacing MPI with NCCL is that the computation for reduce operation can be performed on GPU thus can benefit compute-bound cases. Torus allreduce is more bandwidth optimal compared to tree allreduce and flat ring allreduce especially with multiple NICs per host that satisfies GPU affinity; And is more latency optimal comparing to flat ring allreduce.

We tested the algorithm on BERT-Large pretraining job and observed a 2.4x as fast allreduce operation compared to flat ring (NCCLRingAllreduce) from 2498ms to 978ms, on 32 nodes each with 6 Tesla V100 GPU cards and one 10 Gbps ethernet NIC, without GPUDirect or NVLink or IB.

Review process to land

All tests and other checks must succeed.
At least one member of the technical steering committee must review and approve.
If any member of the technical steering committee requests changes, they must be addressed.

yundai424 · 2022-07-18T21:16:49Z

Hi @romerojosh @maxhgerlach , could you possibly help take a look at this? Thank you so much in advance!

github-actions · 2022-07-19T03:15:32Z

Unit Test Results

  1 093 files +  23   1 093 suites +23 12h 7m 26s ⏱️ +15s
    814 tests ±    0     764 ✔️ ±    0     50 💤 ±    0 0 ❌ ±0
21 978 runs +341 15 642 ✔️ +101 6 336 💤 +240 0 ❌ ±0

Results for commit 7d40f82. ± Comparison against base commit ed32fdc.

♻️ This comment has been updated with latest results.

github-actions · 2022-07-19T03:15:44Z

Unit Test Results (with flaky tests)

  1 239 files +    61   1 239 suites +61 13h 2m 45s ⏱️ + 26m 12s
    814 tests ±      0     764 ✔️ ±    0     50 💤 ±    0 0 ❌ ±0
25 086 runs +1 013 17 570 ✔️ +585 7 516 💤 +428 0 ❌ ±0

Results for commit 7d40f82. ± Comparison against base commit ed32fdc.

♻️ This comment has been updated with latest results.

Signed-off-by: Yun Dai <yudai@yudai-ld2.linkedin.biz>

maxhgerlach · 2022-07-22T14:16:06Z

Hi @yundai424,
there's a bunch of spark related failures in the latest CI run, which I think is just flakiness (have seen that in another unrelated PR), but there are also legit compiler errors:

2022-07-22T00:36:08.0138743Z #44 141.9   In file included from /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/collective_operations.h:26,
2022-07-22T00:36:08.1643489Z #44 141.9                    from /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/gpu_operations.h:39,
2022-07-22T00:36:08.1644052Z #44 141.9                    from /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.h:31,
2022-07-22T00:36:08.1644550Z #44 141.9                    from /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:18:
2022-07-22T00:36:08.1645320Z #44 141.9   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/../half.h: In function ‘void horovod::common::HalfBits2Float(const short unsigned int*, float*)’:
2022-07-22T00:36:08.1646219Z #44 141.9   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/../half.h:76:11: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
2022-07-22T00:36:08.1646674Z #44 141.9      76 |   *res = *reinterpret_cast<float const*>(&f);
2022-07-22T00:36:08.1646953Z #44 141.9         |           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2022-07-22T00:36:08.3154828Z #44 142.1   In file included from /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:18:
2022-07-22T00:36:08.3155930Z #44 142.1   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.h: In constructor ‘horovod::common::NCCLTorusAllreduce::NCCLTorusAllreduce(horovod::common::NCCLContext*, horovod::common::NCCLContext*, horovod::common::GPUContext*, horovod::common::HorovodGlobalState*)’:
2022-07-22T00:36:08.3156887Z #44 142.1   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.h:266:16: warning: ‘horovod::common::NCCLTorusAllreduce::cross_nccl_context_’ will be initialized after [-Wreorder]
2022-07-22T00:36:08.3157355Z #44 142.1     266 |   NCCLContext* cross_nccl_context_;
2022-07-22T00:36:08.3157618Z #44 142.1         |                ^~~~~~~~~~~~~~~~~~~
2022-07-22T00:36:08.3158281Z #44 142.1   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.h:265:17: warning:   ‘horovod::common::NCCLOpContext horovod::common::NCCLTorusAllreduce::local_nccl_op_context_’ [-Wreorder]
2022-07-22T00:36:08.3158753Z #44 142.1     265 |   NCCLOpContext local_nccl_op_context_;
2022-07-22T00:36:08.3159016Z #44 142.1         |                 ^~~~~~~~~~~~~~~~~~~~~~
2022-07-22T00:36:08.3159491Z #44 142.1   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.h:245:3: warning:   when initialized here [-Wreorder]
2022-07-22T00:36:08.3159923Z #44 142.1     245 |   NCCLTorusAllreduce(NCCLContext* local_nccl_context, NCCLContext* cross_nccl_context,
2022-07-22T00:36:08.3160230Z #44 142.1         |   ^~~~~~~~~~~~~~~~~~
2022-07-22T00:36:08.4661797Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc: In member function ‘virtual horovod::common::Status horovod::common::NCCLTorusAllreduce::Execute(std::vector<horovod::common::TensorTableEntry>&, const horovod::common::Response&)’:
2022-07-22T00:36:08.4662645Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:759:33: error: qualified-id in declaration before ‘(’ token
2022-07-22T00:36:08.4663122Z #44 142.2     759 | bool NCCLTorusAllreduce::Enabled(const ParameterManager& param_manager,
2022-07-22T00:36:08.4663436Z #44 142.2         |                                 ^
2022-07-22T00:36:08.4663930Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:768:32: error: qualified-id in declaration before ‘(’ token
2022-07-22T00:36:08.4664366Z #44 142.2     768 | void NCCLBroadcast::WaitForData(std::vector<TensorTableEntry>& entries) {
2022-07-22T00:36:08.4664661Z #44 142.2         |                                ^
2022-07-22T00:36:08.4665157Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:784:30: error: qualified-id in declaration before ‘(’ token
2022-07-22T00:36:08.4665578Z #44 142.2     784 | Status NCCLBroadcast::Execute(std::vector<TensorTableEntry>& entries,
2022-07-22T00:36:08.4665874Z #44 142.2         |                              ^
2022-07-22T00:36:08.4666345Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:834:37: error: qualified-id in declaration before ‘(’ token
2022-07-22T00:36:08.4666787Z #44 142.2     834 | Status NCCLAllgather::AllocateOutput(std::vector<TensorTableEntry>& entries,
2022-07-22T00:36:08.4667092Z #44 142.2         |                                     ^
2022-07-22T00:36:08.4667567Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:890:32: error: qualified-id in declaration before ‘(’ token
2022-07-22T00:36:08.4667986Z #44 142.2     890 | void NCCLAllgather::WaitForData(std::vector<TensorTableEntry>& entries) {
2022-07-22T00:36:08.4668288Z #44 142.2         |                                ^
2022-07-22T00:36:08.4668761Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:906:30: error: qualified-id in declaration before ‘(’ token
2022-07-22T00:36:08.4669161Z #44 142.2     906 | Status NCCLAllgather::Execute(std::vector<TensorTableEntry>& entries,
2022-07-22T00:36:08.4669449Z #44 142.2         |                              ^
2022-07-22T00:36:08.4670172Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:1052:28: error: qualified-id in declaration before ‘(’ token
2022-07-22T00:36:08.4670616Z #44 142.2    1052 | bool NCCLAllgather::Enabled(const ParameterManager& param_manager,
2022-07-22T00:36:08.4670899Z #44 142.2         |                            ^
2022-07-22T00:36:08.4671387Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:1058:31: error: qualified-id in declaration before ‘(’ token
2022-07-22T00:36:08.4671830Z #44 142.2    1058 | void NCCLAlltoall::WaitForData(std::vector<TensorTableEntry>& entries) {
2022-07-22T00:36:08.4672126Z #44 142.2         |                               ^
2022-07-22T00:36:08.4672592Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:1074:29: error: qualified-id in declaration before ‘(’ token
2022-07-22T00:36:08.4673020Z #44 142.2    1074 | Status NCCLAlltoall::Execute(std::vector<TensorTableEntry>& entries,
2022-07-22T00:36:08.4673317Z #44 142.2         |                             ^
2022-07-22T00:36:08.4673784Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:1144:34: error: qualified-id in declaration before ‘(’ token
2022-07-22T00:36:08.4674220Z #44 142.2    1144 | Status NCCLReducescatter::Execute(std::vector<TensorTableEntry>& entries,
2022-07-22T00:36:08.4674522Z #44 142.2         |                                  ^
2022-07-22T00:36:08.4674997Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:1248:32: error: qualified-id in declaration before ‘(’ token
2022-07-22T00:36:08.4675509Z #44 142.2    1248 | bool NCCLReducescatter::Enabled(const ParameterManager& param_manager,
2022-07-22T00:36:08.4675804Z #44 142.2         |                                ^
2022-07-22T00:36:08.4676273Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:1254:41: error: qualified-id in declaration before ‘(’ token
2022-07-22T00:36:08.4676672Z #44 142.2    1254 | Status NCCLReducescatter::AllocateOutput(
2022-07-22T00:36:08.4676936Z #44 142.2         |                                         ^
2022-07-22T00:36:08.4677409Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:1276:36: error: qualified-id in declaration before ‘(’ token
2022-07-22T00:36:08.4677838Z #44 142.2    1276 | void NCCLReducescatter::WaitForData(std::vector<TensorTableEntry>& entries) {
2022-07-22T00:36:08.4678135Z #44 142.2         |                                    ^
2022-07-22T00:36:08.4678674Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:641:11: warning: unused variable ‘total_buffer_len’ [-Wunused-variable]
2022-07-22T00:36:08.4679070Z #44 142.2     641 |   int64_t total_buffer_len = is_root_rank
2022-07-22T00:36:08.4679311Z #44 142.2         |           ^~~~~~~~~~~~~~~~
2022-07-22T00:36:08.4679714Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc: At global scope:
2022-07-22T00:36:08.4680253Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:1293:1: error: expected ‘}’ at end of input
2022-07-22T00:36:08.4680600Z #44 142.2    1293 | } // namespace horovod
2022-07-22T00:36:08.4680799Z #44 142.2         | ^
2022-07-22T00:36:08.4681214Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:24:19: note: to match this ‘{’
2022-07-22T00:36:08.4681531Z #44 142.2      24 | namespace horovod {
2022-07-22T00:36:08.4681754Z #44 142.2         |                   ^
2022-07-22T00:36:09.6440951Z #44 143.5   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc: In member function ‘virtual horovod::common::Status horovod::common::NCCLTorusAllreduce::Execute(std::vector<horovod::common::TensorTableEntry>&, const horovod::common::Response&)’:
2022-07-22T00:36:09.7745844Z #44 143.5   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:542:24: warning: control reaches end of non-void function [-Wreturn-type]
2022-07-22T00:36:09.7746582Z #44 143.5     542 |   std::vector<int32_t> local_nccl_device_map;
2022-07-22T00:36:09.7746864Z #44 143.5         |                        ^~~~~~~~~~~~~~~~~~~~~
2022-07-22T00:36:09.7747259Z #44 143.6   make[2]: *** [horovod/mxnet/CMakeFiles/mxnet.dir/build.make:349: horovod/mxnet/CMakeFiles/mxnet.dir/__/common/ops/nccl_operations.cc.o] Error 1

Clicking on "View raw logs" gave me this link: https://pipelines.actions.githubusercontent.com/serviceHosts/f4558236-f66f-4dfe-a2d7-34c1fd73cdf3/_apis/pipelines/1/runs/12519/signedlogcontent/37?urlExpires=2022-07-22T14%3A15%3A21.6410597Z&urlSigningMethod=HMACV1&urlSignature=jMzk2dL0akA8NUBBGGC053lIa%2BuheBe83mMFI551F88%3D

Signed-off-by: Yun Dai <yudai@yudai-ld2.linkedin.biz>

yundai424 · 2022-07-23T07:06:53Z

Hi @maxhgerlach thank you for taking a close look! I missed out a closing bracket during the most recent update. Thanks for catching this!

romerojosh

@yundai424 Thanks a lot for the contribution! The implementation looks good to me, but I did leave a few small comments for you in this initial pass.

romerojosh · 2022-07-27T16:30:11Z

horovod/common/ops/nccl_operations.cc

-    throw std::logic_error("Communicator type " +
-                           std::to_string(communicator_type_) +
-                           " is not supported in NCCL mode.");
+    nccl_rank = process_set.controller->GetCrossRank();


Can you modify this to check for else if (communicator_type_ == Communicator::CROSS) explicitly to make this more clear?

Thank you so much for the review! I've updated the diff with the suggestions and tested with bert pretraining and it seems all correct.

horovod/common/ops/nccl_operations.cc

…need it Signed-off-by: Yun Dai <yudai@yudai-ld2.linkedin.biz>

maxhgerlach

I think this is a great addition for applications where this type of hierarchical allreduce beats the performance of both ring and tree algorithms in NCCL, which I assume can depend on the type of networking that's available. I look forward to giving this a try on local hardware.

I've only left a couple of minor remarks on the PR for now.

@romerojosh, would it make sense to eventually drop NCCLHierarchicalAllreduce entirely in favor of NCCLTorusAllreduce? Or does it still make sense to mix NCCL and MPI like that on some clusters?

maxhgerlach · 2022-07-27T17:15:16Z

horovod/runner/launch.py

+    group_torus_allreduce.add_argument('--torus-allreduce',
+                                              action=make_override_true_action(override_args),
+                                              help='Perform 2D NCCL torus allreduce between workers instead of '
+                                                   'ring allreduce. Torus allreduce is the same as hierarchical allreduce '


I believe that NCCL on its own will choose between two allreduce algorithms (ring or tree), so the documentation string may be slightly misleading here.

Appreciate the suggestions! I've updated the diff accordingly and tested it. Please let me know how that looks. Thanks!
I feel that as NCCL starts to provide tree allreduce starting NCCL 2.4, we might consider migrating from mixing with MPI to pure NCCL since IIUC the original idea behind using MPI for cross node allreduce was to benefit from its tree topology but I might be missing something 🧐

Thank you for the updates! I think this looks very nice.

I'm not really aware of the detailed history behind the hierarchical allreduce in Horovod. But yes, it seems to me that it was introduced because at the time GPU-aware MPI would scale better to a large number of nodes than NCCL. And the tree allreduce, that was introduced later to NCCL, appears to have better scaling for very large numbers of GPUs, where the latency of ring allreduce becomes expensive. So you may be very right. Mixing MPI and NCCL colllectives comes with its own costs anyway.

horovod/common/ops/nccl_operations.cc

romerojosh · 2022-07-29T17:27:15Z

@romerojosh, would it make sense to eventually drop NCCLHierarchicalAllreduce entirely in favor of NCCLTorusAllreduce? Or does it still make sense to mix NCCL and MPI like that on some clusters?

I think so. We were already discussing dropping NCCLHierarchicalAllreduce anyway since we aren't sure if it is very useful these days. A fully NCCL-based solution seems more useful to keep around (and clearly has some more recent relevance as per this PR being posted).

maxhgerlach

LGTM, provided that there are no surprises in CI. Thanks for the great PR, @yundai424!

yundai424 · 2022-08-03T05:11:23Z

@romerojosh would you mind help taking another look at it to check if there's anything missed here? Thanks a lot!

Also I noticed there're two Build and Test GPU tasks that appear Pending but in fact passed if I clicked Details.. Is that expected? 🧐

maxhgerlach · 2022-08-03T10:50:08Z

Also I noticed there're two Build and Test GPU tasks that appear Pending but in fact passed if I clicked Details.. Is that expected? 🧐

That's currently a bug with all PRs originating from forked repos.

romerojosh

@yundai424 thanks for the contribution! LGTM, except for a minor nit about whitespace. Once you update that, we can merge once tests pass.

romerojosh · 2022-08-16T17:11:31Z

horovod/common/ops/nccl_operations.cc

-    throw std::logic_error("Communicator type " +
-                           std::to_string(communicator_type_) +
-                           " is not supported in NCCL mode.");
+     throw std::logic_error("Communicator type " +


Nit: Extraneous spaces added here should be removed.

Hi @romerojosh thanks for catching this! Updated now

…n and update doc Signed-off-by: Yun Dai <yudai@yudai-ld2.linkedin.biz>

yundai424 force-pushed the master branch 2 times, most recently from acadc18 to c431f03 Compare July 18, 2022 20:58

chongxiaoc requested review from maxhgerlach and romerojosh July 18, 2022 22:32

implement torus allreduce

11ff952

Signed-off-by: Yun Dai <yudai@yudai-ld2.linkedin.biz>

yundai424 force-pushed the master branch from c431f03 to 11ff952 Compare July 21, 2022 21:47

bring back the missed closing bracket

d557e48

Signed-off-by: Yun Dai <yudai@yudai-ld2.linkedin.biz>

romerojosh reviewed Jul 27, 2022

View reviewed changes

explicitly check communicator type; remove WaitForEvents as we don't …

b118658

…need it Signed-off-by: Yun Dai <yudai@yudai-ld2.linkedin.biz>

maxhgerlach reviewed Jul 28, 2022

View reviewed changes

maxhgerlach approved these changes Aug 1, 2022

View reviewed changes

maxhgerlach requested a review from romerojosh August 15, 2022 09:57

romerojosh approved these changes Aug 16, 2022

View reviewed changes

switch to fused ScaleMemcpyOutFusionBuffer; remove redundant assertio…

7d40f82

…n and update doc Signed-off-by: Yun Dai <yudai@yudai-ld2.linkedin.biz>

yundai424 force-pushed the master branch from bdc5304 to 7d40f82 Compare August 16, 2022 17:23

maxhgerlach merged commit 2b173c3 into horovod:master Aug 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement 2D torus allreduce using NCCL #3608

implement 2D torus allreduce using NCCL #3608

yundai424 commented Jul 18, 2022 •

edited

yundai424 commented Jul 18, 2022 •

edited

github-actions bot commented Jul 19, 2022 •

edited

github-actions bot commented Jul 19, 2022 •

edited

maxhgerlach commented Jul 22, 2022

yundai424 commented Jul 23, 2022

romerojosh left a comment

romerojosh Jul 27, 2022

yundai424 Jul 28, 2022

maxhgerlach left a comment

maxhgerlach Jul 27, 2022

yundai424 Jul 31, 2022 •

edited

maxhgerlach Aug 1, 2022

romerojosh commented Jul 29, 2022

maxhgerlach left a comment

yundai424 commented Aug 3, 2022

maxhgerlach commented Aug 3, 2022

romerojosh left a comment

romerojosh Aug 16, 2022

yundai424 Aug 16, 2022

implement 2D torus allreduce using NCCL #3608

implement 2D torus allreduce using NCCL #3608

Conversation

yundai424 commented Jul 18, 2022 • edited

Checklist before submitting

Description

Review process to land

yundai424 commented Jul 18, 2022 • edited

github-actions bot commented Jul 19, 2022 • edited

Unit Test Results

github-actions bot commented Jul 19, 2022 • edited

Unit Test Results (with flaky tests)

maxhgerlach commented Jul 22, 2022

yundai424 commented Jul 23, 2022

romerojosh left a comment

Choose a reason for hiding this comment

romerojosh Jul 27, 2022

Choose a reason for hiding this comment

yundai424 Jul 28, 2022

Choose a reason for hiding this comment

maxhgerlach left a comment

Choose a reason for hiding this comment

maxhgerlach Jul 27, 2022

Choose a reason for hiding this comment

yundai424 Jul 31, 2022 • edited

Choose a reason for hiding this comment

maxhgerlach Aug 1, 2022

Choose a reason for hiding this comment

romerojosh commented Jul 29, 2022

maxhgerlach left a comment

Choose a reason for hiding this comment

yundai424 commented Aug 3, 2022

maxhgerlach commented Aug 3, 2022

romerojosh left a comment

Choose a reason for hiding this comment

romerojosh Aug 16, 2022

Choose a reason for hiding this comment

yundai424 Aug 16, 2022

Choose a reason for hiding this comment

yundai424 commented Jul 18, 2022 •

edited

yundai424 commented Jul 18, 2022 •

edited

github-actions bot commented Jul 19, 2022 •

edited

github-actions bot commented Jul 19, 2022 •

edited

yundai424 Jul 31, 2022 •

edited