New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement 2D torus allreduce using NCCL #3608
Conversation
acadc18
to
c431f03
Compare
Hi @romerojosh @maxhgerlach , could you possibly help take a look at this? Thank you so much in advance! |
Unit Test Results (with flaky tests) 1 239 files + 61 1 239 suites +61 13h 2m 45s ⏱️ + 26m 12s Results for commit 7d40f82. ± Comparison against base commit ed32fdc. ♻️ This comment has been updated with latest results. |
Signed-off-by: Yun Dai <yudai@yudai-ld2.linkedin.biz>
Hi @yundai424,
Clicking on "View raw logs" gave me this link: https://pipelines.actions.githubusercontent.com/serviceHosts/f4558236-f66f-4dfe-a2d7-34c1fd73cdf3/_apis/pipelines/1/runs/12519/signedlogcontent/37?urlExpires=2022-07-22T14%3A15%3A21.6410597Z&urlSigningMethod=HMACV1&urlSignature=jMzk2dL0akA8NUBBGGC053lIa%2BuheBe83mMFI551F88%3D |
Signed-off-by: Yun Dai <yudai@yudai-ld2.linkedin.biz>
Hi @maxhgerlach thank you for taking a close look! I missed out a closing bracket during the most recent update. Thanks for catching this! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yundai424 Thanks a lot for the contribution! The implementation looks good to me, but I did leave a few small comments for you in this initial pass.
throw std::logic_error("Communicator type " + | ||
std::to_string(communicator_type_) + | ||
" is not supported in NCCL mode."); | ||
nccl_rank = process_set.controller->GetCrossRank(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you modify this to check for else if (communicator_type_ == Communicator::CROSS)
explicitly to make this more clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for the review! I've updated the diff with the suggestions and tested with bert pretraining and it seems all correct.
…need it Signed-off-by: Yun Dai <yudai@yudai-ld2.linkedin.biz>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a great addition for applications where this type of hierarchical allreduce beats the performance of both ring and tree algorithms in NCCL, which I assume can depend on the type of networking that's available. I look forward to giving this a try on local hardware.
I've only left a couple of minor remarks on the PR for now.
@romerojosh, would it make sense to eventually drop NCCLHierarchicalAllreduce
entirely in favor of NCCLTorusAllreduce
? Or does it still make sense to mix NCCL and MPI like that on some clusters?
group_torus_allreduce.add_argument('--torus-allreduce', | ||
action=make_override_true_action(override_args), | ||
help='Perform 2D NCCL torus allreduce between workers instead of ' | ||
'ring allreduce. Torus allreduce is the same as hierarchical allreduce ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that NCCL on its own will choose between two allreduce algorithms (ring or tree), so the documentation string may be slightly misleading here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appreciate the suggestions! I've updated the diff accordingly and tested it. Please let me know how that looks. Thanks!
I feel that as NCCL starts to provide tree allreduce starting NCCL 2.4, we might consider migrating from mixing with MPI to pure NCCL since IIUC the original idea behind using MPI for cross node allreduce was to benefit from its tree topology but I might be missing something 🧐
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the updates! I think this looks very nice.
I'm not really aware of the detailed history behind the hierarchical allreduce in Horovod. But yes, it seems to me that it was introduced because at the time GPU-aware MPI would scale better to a large number of nodes than NCCL. And the tree allreduce, that was introduced later to NCCL, appears to have better scaling for very large numbers of GPUs, where the latency of ring allreduce becomes expensive. So you may be very right. Mixing MPI and NCCL colllectives comes with its own costs anyway.
I think so. We were already discussing dropping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, provided that there are no surprises in CI. Thanks for the great PR, @yundai424!
@romerojosh would you mind help taking another look at it to check if there's anything missed here? Thanks a lot! Also I noticed there're two |
That's currently a bug with all PRs originating from forked repos. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yundai424 thanks for the contribution! LGTM, except for a minor nit about whitespace. Once you update that, we can merge once tests pass.
throw std::logic_error("Communicator type " + | ||
std::to_string(communicator_type_) + | ||
" is not supported in NCCL mode."); | ||
throw std::logic_error("Communicator type " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Extraneous spaces added here should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @romerojosh thanks for catching this! Updated now
…n and update doc Signed-off-by: Yun Dai <yudai@yudai-ld2.linkedin.biz>
Checklist before submitting
Description
This PR implements 2D torus allreduce (https://asset-pdf.scinapse.io/prod/2920668770/2920668770.pdf) using NCCL. The algorithm and implementation are essentially the same as the
NCCLHierarchicalAllreduce
that's already in Horovod but replacing MPI with NCCL to perform parallel cross-node allreduce. I named it asNCCLTorusAllreduce
to correctly reflect the algorithm.The benefit of replacing MPI with NCCL is that the computation for reduce operation can be performed on GPU thus can benefit compute-bound cases. Torus allreduce is more bandwidth optimal compared to tree allreduce and flat ring allreduce especially with multiple NICs per host that satisfies GPU affinity; And is more latency optimal comparing to flat ring allreduce.
We tested the algorithm on BERT-Large pretraining job and observed a 2.4x as fast allreduce operation compared to flat ring (NCCLRingAllreduce) from 2498ms to 978ms, on 32 nodes each with 6 Tesla V100 GPU cards and one 10 Gbps ethernet NIC, without GPUDirect or NVLink or IB.
Review process to land