Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gpu_operations] add support for batched memory copies in GPUAllgather #3590

Merged
merged 1 commit into from Jul 7, 2022

Conversation

kvignesh1420
Copy link
Contributor

Checklist before submitting

  • Did you read the contributor guide? YES
  • Did you update the docs? Not yet. Open to suggestions for an appropriate doc to update.
  • Did you write any tests to validate this change? Existing GPU allgather tests validate the logic
  • Did you update the CHANGELOG, if this change affects users? There are no user facing modifications.

Description

This PR builds on the work of #2435 by switching to a batched device-to-device memory copies approach for GPUAllgather::MemcpyInFusionBuffer and GPUAllgather::MemcpyOutFusionBuffer. The behavior can be reverted back by setting HOROVOD_BATCH_D2D_MEMCOPIES=false which is similar to the existing GPUAllreduce implementation.

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

Signed-off-by: Vignesh Kothapalli <k.vignesh140@gmail.com>
@kvignesh1420 kvignesh1420 marked this pull request as ready for review July 2, 2022 00:18
@chongxiaoc
Copy link
Collaborator

Hi @romerojosh , can you help review this?

@github-actions
Copy link

github-actions bot commented Jul 2, 2022

Unit Test Results

     923 files  +  30       923 suites  +30   9h 50m 52s ⏱️ - 1m 3s
     781 tests ±    0       737 ✔️ ±    0       44 💤 ±    0  0 ±0 
19 807 runs  +746  14 132 ✔️ +466  5 675 💤 +280  0 ±0 

Results for commit 9568c46. ± Comparison against base commit b67d756.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Jul 2, 2022

Unit Test Results (with flaky tests)

  1 104 files  +     69    1 104 suites  +69   10h 49m 56s ⏱️ + 26m 36s
     781 tests ±       0       735 ✔️  -     2       44 💤 ±    0  2 +2 
23 718 runs  +1 351  16 417 ✔️ +823  7 299 💤 +526  2 +2 

For more details on these failures, see this check.

Results for commit 9568c46. ± Comparison against base commit b67d756.

♻️ This comment has been updated with latest results.

Copy link
Collaborator

@romerojosh romerojosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍
Thanks for the great contribution @kvignesh1420!

@kvignesh1420
Copy link
Contributor Author

kvignesh1420 commented Jul 5, 2022

@romerojosh thanks for the review. As a follow up of this PR, what do you think about a new BatchedD2DParams struct which can handle more than 160 entries at once? (maybe a struct with pointer entries instead of the arrays themselves?) I observed that the limitation of 160 entries per batch is due to the size restriction of 4KB for formal parameters to the cuda kernel. I am not sure if the performance benefit would be significant but I am happy to discuss this further if you are interested.

cc: @chongxiaoc

@romerojosh
Copy link
Collaborator

@kvignesh1420 The benefit of using the struct with arrays rather than pointers is that it removes the need to use any additional GPU memory allocations or memcopies of the parameters, as we can pass the struct directly into the kernel as an argument (so long as the argument is under 4KB as you've discovered).

@kvignesh1420
Copy link
Contributor Author

@romerojosh that makes sense 👍

@maxhgerlach
Copy link
Collaborator

GPU tests on Buildkite failed for this PR because there were problems cloning the Eigen repository at the time (https://buildkite.com/horovod/horovod/builds/7951#0181bd82-600d-455e-9ce6-eba81da4e564):

fatal: unable to access 'https://gitlab.com/cantonios/eigen.git/': The requested URL returned error: 503
Failed to clone 'third_party/eigen' a second time, aborting

I've re-triggered the CI (Results) workflow, hopefully it will go through now: https://github.com/horovod/horovod/actions/runs/2600228008

@kvignesh1420
Copy link
Contributor Author

GPU tests on Buildkite failed for this PR because there were problems cloning the Eigen repository at the time (https://buildkite.com/horovod/horovod/builds/7951#0181bd82-600d-455e-9ce6-eba81da4e564):

fatal: unable to access 'https://gitlab.com/cantonios/eigen.git/': The requested URL returned error: 503
Failed to clone 'third_party/eigen' a second time, aborting

I've re-triggered the CI (Results) workflow, hopefully it will go through now: https://github.com/horovod/horovod/actions/runs/2600228008

@maxhgerlach seems like the tests failed again due to gpg key retrieval timeout:
https://buildkite.com/horovod/horovod/builds/7970#0181d499-2c8b-4e58-a9d8-ddfb32e9c749

@maxhgerlach
Copy link
Collaborator

@maxhgerlach seems like the tests failed again due to gpg key retrieval timeout: https://buildkite.com/horovod/horovod/builds/7970#0181d499-2c8b-4e58-a9d8-ddfb32e9c749

No reason to worry for now: those passed on retry. I think we are good to merge.

@maxhgerlach maxhgerlach merged commit 9d56c5a into horovod:master Jul 7, 2022
@kvignesh1420 kvignesh1420 deleted the allgather-blockd2dmemcpy branch July 8, 2022 00:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants