New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[gpu_operations] add support for batched memory copies in GPUAllgather #3590
[gpu_operations] add support for batched memory copies in GPUAllgather #3590
Conversation
Signed-off-by: Vignesh Kothapalli <k.vignesh140@gmail.com>
1eebbb9
to
9568c46
Compare
Hi @romerojosh , can you help review this? |
Unit Test Results (with flaky tests) 1 104 files + 69 1 104 suites +69 10h 49m 56s ⏱️ + 26m 36s For more details on these failures, see this check. Results for commit 9568c46. ± Comparison against base commit b67d756. ♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 👍
Thanks for the great contribution @kvignesh1420!
@romerojosh thanks for the review. As a follow up of this PR, what do you think about a new cc: @chongxiaoc |
@kvignesh1420 The benefit of using the struct with arrays rather than pointers is that it removes the need to use any additional GPU memory allocations or memcopies of the parameters, as we can pass the struct directly into the kernel as an argument (so long as the argument is under 4KB as you've discovered). |
@romerojosh that makes sense 👍 |
GPU tests on Buildkite failed for this PR because there were problems cloning the Eigen repository at the time (https://buildkite.com/horovod/horovod/builds/7951#0181bd82-600d-455e-9ce6-eba81da4e564):
I've re-triggered the |
@maxhgerlach seems like the tests failed again due to gpg key retrieval timeout: |
No reason to worry for now: those passed on retry. I think we are good to merge. |
Checklist before submitting
Description
This PR builds on the work of #2435 by switching to a batched device-to-device memory copies approach for
GPUAllgather::MemcpyInFusionBuffer
andGPUAllgather::MemcpyOutFusionBuffer
. The behavior can be reverted back by settingHOROVOD_BATCH_D2D_MEMCOPIES=false
which is similar to the existing GPUAllreduce implementation.Review process to land