Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement 2D torus allreduce using NCCL #3608

Merged
merged 4 commits into from Aug 17, 2022
Merged

Conversation

yundai424
Copy link
Contributor

@yundai424 yundai424 commented Jul 18, 2022

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

This PR implements 2D torus allreduce (https://asset-pdf.scinapse.io/prod/2920668770/2920668770.pdf) using NCCL. The algorithm and implementation are essentially the same as the NCCLHierarchicalAllreduce that's already in Horovod but replacing MPI with NCCL to perform parallel cross-node allreduce. I named it as NCCLTorusAllreduce to correctly reflect the algorithm.

The benefit of replacing MPI with NCCL is that the computation for reduce operation can be performed on GPU thus can benefit compute-bound cases. Torus allreduce is more bandwidth optimal compared to tree allreduce and flat ring allreduce especially with multiple NICs per host that satisfies GPU affinity; And is more latency optimal comparing to flat ring allreduce.

We tested the algorithm on BERT-Large pretraining job and observed a 2.4x as fast allreduce operation compared to flat ring (NCCLRingAllreduce) from 2498ms to 978ms, on 32 nodes each with 6 Tesla V100 GPU cards and one 10 Gbps ethernet NIC, without GPUDirect or NVLink or IB.

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

@yundai424 yundai424 force-pushed the master branch 2 times, most recently from acadc18 to c431f03 Compare July 18, 2022 20:58
@yundai424
Copy link
Contributor Author

yundai424 commented Jul 18, 2022

Hi @romerojosh @maxhgerlach , could you possibly help take a look at this? Thank you so much in advance!

@github-actions
Copy link

github-actions bot commented Jul 19, 2022

Unit Test Results

  1 093 files  +  23    1 093 suites  +23   12h 7m 26s ⏱️ +15s
     814 tests ±    0       764 ✔️ ±    0       50 💤 ±    0  0 ±0 
21 978 runs  +341  15 642 ✔️ +101  6 336 💤 +240  0 ±0 

Results for commit 7d40f82. ± Comparison against base commit ed32fdc.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Jul 19, 2022

Unit Test Results (with flaky tests)

  1 239 files  +     61    1 239 suites  +61   13h 2m 45s ⏱️ + 26m 12s
     814 tests ±       0       764 ✔️ ±    0       50 💤 ±    0  0 ±0 
25 086 runs  +1 013  17 570 ✔️ +585  7 516 💤 +428  0 ±0 

Results for commit 7d40f82. ± Comparison against base commit ed32fdc.

♻️ This comment has been updated with latest results.

Signed-off-by: Yun Dai <yudai@yudai-ld2.linkedin.biz>
@maxhgerlach
Copy link
Collaborator

Hi @yundai424,
there's a bunch of spark related failures in the latest CI run, which I think is just flakiness (have seen that in another unrelated PR), but there are also legit compiler errors:

2022-07-22T00:36:08.0138743Z #44 141.9   In file included from /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/collective_operations.h:26,
2022-07-22T00:36:08.1643489Z #44 141.9                    from /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/gpu_operations.h:39,
2022-07-22T00:36:08.1644052Z #44 141.9                    from /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.h:31,
2022-07-22T00:36:08.1644550Z #44 141.9                    from /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:18:
2022-07-22T00:36:08.1645320Z #44 141.9   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/../half.h: In function ‘void horovod::common::HalfBits2Float(const short unsigned int*, float*)’:
2022-07-22T00:36:08.1646219Z #44 141.9   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/../half.h:76:11: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
2022-07-22T00:36:08.1646674Z #44 141.9      76 |   *res = *reinterpret_cast<float const*>(&f);
2022-07-22T00:36:08.1646953Z #44 141.9         |           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2022-07-22T00:36:08.3154828Z #44 142.1   In file included from /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:18:
2022-07-22T00:36:08.3155930Z #44 142.1   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.h: In constructor ‘horovod::common::NCCLTorusAllreduce::NCCLTorusAllreduce(horovod::common::NCCLContext*, horovod::common::NCCLContext*, horovod::common::GPUContext*, horovod::common::HorovodGlobalState*)’:
2022-07-22T00:36:08.3156887Z #44 142.1   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.h:266:16: warning: ‘horovod::common::NCCLTorusAllreduce::cross_nccl_context_’ will be initialized after [-Wreorder]
2022-07-22T00:36:08.3157355Z #44 142.1     266 |   NCCLContext* cross_nccl_context_;
2022-07-22T00:36:08.3157618Z #44 142.1         |                ^~~~~~~~~~~~~~~~~~~
2022-07-22T00:36:08.3158281Z #44 142.1   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.h:265:17: warning:   ‘horovod::common::NCCLOpContext horovod::common::NCCLTorusAllreduce::local_nccl_op_context_’ [-Wreorder]
2022-07-22T00:36:08.3158753Z #44 142.1     265 |   NCCLOpContext local_nccl_op_context_;
2022-07-22T00:36:08.3159016Z #44 142.1         |                 ^~~~~~~~~~~~~~~~~~~~~~
2022-07-22T00:36:08.3159491Z #44 142.1   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.h:245:3: warning:   when initialized here [-Wreorder]
2022-07-22T00:36:08.3159923Z #44 142.1     245 |   NCCLTorusAllreduce(NCCLContext* local_nccl_context, NCCLContext* cross_nccl_context,
2022-07-22T00:36:08.3160230Z #44 142.1         |   ^~~~~~~~~~~~~~~~~~
2022-07-22T00:36:08.4661797Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc: In member function ‘virtual horovod::common::Status horovod::common::NCCLTorusAllreduce::Execute(std::vector<horovod::common::TensorTableEntry>&, const horovod::common::Response&)’:
2022-07-22T00:36:08.4662645Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:759:33: error: qualified-id in declaration before ‘(’ token
2022-07-22T00:36:08.4663122Z #44 142.2     759 | bool NCCLTorusAllreduce::Enabled(const ParameterManager& param_manager,
2022-07-22T00:36:08.4663436Z #44 142.2         |                                 ^
2022-07-22T00:36:08.4663930Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:768:32: error: qualified-id in declaration before ‘(’ token
2022-07-22T00:36:08.4664366Z #44 142.2     768 | void NCCLBroadcast::WaitForData(std::vector<TensorTableEntry>& entries) {
2022-07-22T00:36:08.4664661Z #44 142.2         |                                ^
2022-07-22T00:36:08.4665157Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:784:30: error: qualified-id in declaration before ‘(’ token
2022-07-22T00:36:08.4665578Z #44 142.2     784 | Status NCCLBroadcast::Execute(std::vector<TensorTableEntry>& entries,
2022-07-22T00:36:08.4665874Z #44 142.2         |                              ^
2022-07-22T00:36:08.4666345Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:834:37: error: qualified-id in declaration before ‘(’ token
2022-07-22T00:36:08.4666787Z #44 142.2     834 | Status NCCLAllgather::AllocateOutput(std::vector<TensorTableEntry>& entries,
2022-07-22T00:36:08.4667092Z #44 142.2         |                                     ^
2022-07-22T00:36:08.4667567Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:890:32: error: qualified-id in declaration before ‘(’ token
2022-07-22T00:36:08.4667986Z #44 142.2     890 | void NCCLAllgather::WaitForData(std::vector<TensorTableEntry>& entries) {
2022-07-22T00:36:08.4668288Z #44 142.2         |                                ^
2022-07-22T00:36:08.4668761Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:906:30: error: qualified-id in declaration before ‘(’ token
2022-07-22T00:36:08.4669161Z #44 142.2     906 | Status NCCLAllgather::Execute(std::vector<TensorTableEntry>& entries,
2022-07-22T00:36:08.4669449Z #44 142.2         |                              ^
2022-07-22T00:36:08.4670172Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:1052:28: error: qualified-id in declaration before ‘(’ token
2022-07-22T00:36:08.4670616Z #44 142.2    1052 | bool NCCLAllgather::Enabled(const ParameterManager& param_manager,
2022-07-22T00:36:08.4670899Z #44 142.2         |                            ^
2022-07-22T00:36:08.4671387Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:1058:31: error: qualified-id in declaration before ‘(’ token
2022-07-22T00:36:08.4671830Z #44 142.2    1058 | void NCCLAlltoall::WaitForData(std::vector<TensorTableEntry>& entries) {
2022-07-22T00:36:08.4672126Z #44 142.2         |                               ^
2022-07-22T00:36:08.4672592Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:1074:29: error: qualified-id in declaration before ‘(’ token
2022-07-22T00:36:08.4673020Z #44 142.2    1074 | Status NCCLAlltoall::Execute(std::vector<TensorTableEntry>& entries,
2022-07-22T00:36:08.4673317Z #44 142.2         |                             ^
2022-07-22T00:36:08.4673784Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:1144:34: error: qualified-id in declaration before ‘(’ token
2022-07-22T00:36:08.4674220Z #44 142.2    1144 | Status NCCLReducescatter::Execute(std::vector<TensorTableEntry>& entries,
2022-07-22T00:36:08.4674522Z #44 142.2         |                                  ^
2022-07-22T00:36:08.4674997Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:1248:32: error: qualified-id in declaration before ‘(’ token
2022-07-22T00:36:08.4675509Z #44 142.2    1248 | bool NCCLReducescatter::Enabled(const ParameterManager& param_manager,
2022-07-22T00:36:08.4675804Z #44 142.2         |                                ^
2022-07-22T00:36:08.4676273Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:1254:41: error: qualified-id in declaration before ‘(’ token
2022-07-22T00:36:08.4676672Z #44 142.2    1254 | Status NCCLReducescatter::AllocateOutput(
2022-07-22T00:36:08.4676936Z #44 142.2         |                                         ^
2022-07-22T00:36:08.4677409Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:1276:36: error: qualified-id in declaration before ‘(’ token
2022-07-22T00:36:08.4677838Z #44 142.2    1276 | void NCCLReducescatter::WaitForData(std::vector<TensorTableEntry>& entries) {
2022-07-22T00:36:08.4678135Z #44 142.2         |                                    ^
2022-07-22T00:36:08.4678674Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:641:11: warning: unused variable ‘total_buffer_len’ [-Wunused-variable]
2022-07-22T00:36:08.4679070Z #44 142.2     641 |   int64_t total_buffer_len = is_root_rank
2022-07-22T00:36:08.4679311Z #44 142.2         |           ^~~~~~~~~~~~~~~~
2022-07-22T00:36:08.4679714Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc: At global scope:
2022-07-22T00:36:08.4680253Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:1293:1: error: expected ‘}’ at end of input
2022-07-22T00:36:08.4680600Z #44 142.2    1293 | } // namespace horovod
2022-07-22T00:36:08.4680799Z #44 142.2         | ^
2022-07-22T00:36:08.4681214Z #44 142.2   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:24:19: note: to match this ‘{’
2022-07-22T00:36:08.4681531Z #44 142.2      24 | namespace horovod {
2022-07-22T00:36:08.4681754Z #44 142.2         |                   ^
2022-07-22T00:36:09.6440951Z #44 143.5   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc: In member function ‘virtual horovod::common::Status horovod::common::NCCLTorusAllreduce::Execute(std::vector<horovod::common::TensorTableEntry>&, const horovod::common::Response&)’:
2022-07-22T00:36:09.7745844Z #44 143.5   /tmp/pip-req-build-fv9v6wbv/horovod/common/ops/nccl_operations.cc:542:24: warning: control reaches end of non-void function [-Wreturn-type]
2022-07-22T00:36:09.7746582Z #44 143.5     542 |   std::vector<int32_t> local_nccl_device_map;
2022-07-22T00:36:09.7746864Z #44 143.5         |                        ^~~~~~~~~~~~~~~~~~~~~
2022-07-22T00:36:09.7747259Z #44 143.6   make[2]: *** [horovod/mxnet/CMakeFiles/mxnet.dir/build.make:349: horovod/mxnet/CMakeFiles/mxnet.dir/__/common/ops/nccl_operations.cc.o] Error 1

Clicking on "View raw logs" gave me this link: https://pipelines.actions.githubusercontent.com/serviceHosts/f4558236-f66f-4dfe-a2d7-34c1fd73cdf3/_apis/pipelines/1/runs/12519/signedlogcontent/37?urlExpires=2022-07-22T14%3A15%3A21.6410597Z&urlSigningMethod=HMACV1&urlSignature=jMzk2dL0akA8NUBBGGC053lIa%2BuheBe83mMFI551F88%3D

Signed-off-by: Yun Dai <yudai@yudai-ld2.linkedin.biz>
@yundai424
Copy link
Contributor Author

Hi @maxhgerlach thank you for taking a close look! I missed out a closing bracket during the most recent update. Thanks for catching this!

Copy link
Collaborator

@romerojosh romerojosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yundai424 Thanks a lot for the contribution! The implementation looks good to me, but I did leave a few small comments for you in this initial pass.

throw std::logic_error("Communicator type " +
std::to_string(communicator_type_) +
" is not supported in NCCL mode.");
nccl_rank = process_set.controller->GetCrossRank();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you modify this to check for else if (communicator_type_ == Communicator::CROSS) explicitly to make this more clear?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for the review! I've updated the diff with the suggestions and tested with bert pretraining and it seems all correct.

horovod/common/ops/nccl_operations.cc Outdated Show resolved Hide resolved
horovod/common/ops/nccl_operations.cc Outdated Show resolved Hide resolved
…need it

Signed-off-by: Yun Dai <yudai@yudai-ld2.linkedin.biz>
Copy link
Collaborator

@maxhgerlach maxhgerlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great addition for applications where this type of hierarchical allreduce beats the performance of both ring and tree algorithms in NCCL, which I assume can depend on the type of networking that's available. I look forward to giving this a try on local hardware.

I've only left a couple of minor remarks on the PR for now.

@romerojosh, would it make sense to eventually drop NCCLHierarchicalAllreduce entirely in favor of NCCLTorusAllreduce? Or does it still make sense to mix NCCL and MPI like that on some clusters?

group_torus_allreduce.add_argument('--torus-allreduce',
action=make_override_true_action(override_args),
help='Perform 2D NCCL torus allreduce between workers instead of '
'ring allreduce. Torus allreduce is the same as hierarchical allreduce '
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that NCCL on its own will choose between two allreduce algorithms (ring or tree), so the documentation string may be slightly misleading here.

Copy link
Contributor Author

@yundai424 yundai424 Jul 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the suggestions! I've updated the diff accordingly and tested it. Please let me know how that looks. Thanks!
I feel that as NCCL starts to provide tree allreduce starting NCCL 2.4, we might consider migrating from mixing with MPI to pure NCCL since IIUC the original idea behind using MPI for cross node allreduce was to benefit from its tree topology but I might be missing something 🧐

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the updates! I think this looks very nice.

I'm not really aware of the detailed history behind the hierarchical allreduce in Horovod. But yes, it seems to me that it was introduced because at the time GPU-aware MPI would scale better to a large number of nodes than NCCL. And the tree allreduce, that was introduced later to NCCL, appears to have better scaling for very large numbers of GPUs, where the latency of ring allreduce becomes expensive. So you may be very right. Mixing MPI and NCCL colllectives comes with its own costs anyway.

horovod/common/ops/nccl_operations.cc Outdated Show resolved Hide resolved
horovod/common/ops/nccl_operations.cc Outdated Show resolved Hide resolved
horovod/common/ops/nccl_operations.cc Outdated Show resolved Hide resolved
horovod/common/ops/nccl_operations.cc Outdated Show resolved Hide resolved
@romerojosh
Copy link
Collaborator

@romerojosh, would it make sense to eventually drop NCCLHierarchicalAllreduce entirely in favor of NCCLTorusAllreduce? Or does it still make sense to mix NCCL and MPI like that on some clusters?

I think so. We were already discussing dropping NCCLHierarchicalAllreduce anyway since we aren't sure if it is very useful these days. A fully NCCL-based solution seems more useful to keep around (and clearly has some more recent relevance as per this PR being posted).

Copy link
Collaborator

@maxhgerlach maxhgerlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, provided that there are no surprises in CI. Thanks for the great PR, @yundai424!

@yundai424
Copy link
Contributor Author

@romerojosh would you mind help taking another look at it to check if there's anything missed here? Thanks a lot!

Also I noticed there're two Build and Test GPU tasks that appear Pending but in fact passed if I clicked Details.. Is that expected? 🧐

@maxhgerlach
Copy link
Collaborator

Also I noticed there're two Build and Test GPU tasks that appear Pending but in fact passed if I clicked Details.. Is that expected? 🧐

That's currently a bug with all PRs originating from forked repos.

Copy link
Collaborator

@romerojosh romerojosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yundai424 thanks for the contribution! LGTM, except for a minor nit about whitespace. Once you update that, we can merge once tests pass.

throw std::logic_error("Communicator type " +
std::to_string(communicator_type_) +
" is not supported in NCCL mode.");
throw std::logic_error("Communicator type " +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Extraneous spaces added here should be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @romerojosh thanks for catching this! Updated now

…n and update doc

Signed-off-by: Yun Dai <yudai@yudai-ld2.linkedin.biz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants