[gpu_operations] add support for batched memory copies in GPUAllgather #3590

kvignesh1420 · 2022-07-02T00:05:45Z

Checklist before submitting

Did you read the contributor guide? YES
Did you update the docs? Not yet. Open to suggestions for an appropriate doc to update.
Did you write any tests to validate this change? Existing GPU allgather tests validate the logic
Did you update the CHANGELOG, if this change affects users? There are no user facing modifications.

Description

This PR builds on the work of #2435 by switching to a batched device-to-device memory copies approach for GPUAllgather::MemcpyInFusionBuffer and GPUAllgather::MemcpyOutFusionBuffer. The behavior can be reverted back by setting HOROVOD_BATCH_D2D_MEMCOPIES=false which is similar to the existing GPUAllreduce implementation.

Review process to land

All tests and other checks must succeed.
At least one member of the technical steering committee must review and approve.
If any member of the technical steering committee requests changes, they must be addressed.

Signed-off-by: Vignesh Kothapalli <k.vignesh140@gmail.com>

chongxiaoc · 2022-07-02T00:21:37Z

Hi @romerojosh , can you help review this?

github-actions · 2022-07-02T06:30:20Z

Unit Test Results

    923 files +  30     923 suites +30 9h 50m 52s ⏱️ - 1m 3s
    781 tests ±    0     737 ✔️ ±    0     44 💤 ±    0 0 ❌ ±0
19 807 runs +746 14 132 ✔️ +466 5 675 💤 +280 0 ❌ ±0

Results for commit 9568c46. ± Comparison against base commit b67d756.

♻️ This comment has been updated with latest results.

github-actions · 2022-07-02T06:30:33Z

Unit Test Results (with flaky tests)

  1 104 files +    69   1 104 suites +69 10h 49m 56s ⏱️ + 26m 36s
    781 tests ±      0     735 ✔️ -     2     44 💤 ±    0 2 ❌ +2
23 718 runs +1 351 16 417 ✔️ +823 7 299 💤 +526 2 ❌ +2

For more details on these failures, see this check.

Results for commit 9568c46. ± Comparison against base commit b67d756.

♻️ This comment has been updated with latest results.

romerojosh

LGTM! 👍
Thanks for the great contribution @kvignesh1420!

kvignesh1420 · 2022-07-05T18:51:43Z

@romerojosh thanks for the review. As a follow up of this PR, what do you think about a new BatchedD2DParams struct which can handle more than 160 entries at once? (maybe a struct with pointer entries instead of the arrays themselves?) I observed that the limitation of 160 entries per batch is due to the size restriction of 4KB for formal parameters to the cuda kernel. I am not sure if the performance benefit would be significant but I am happy to discuss this further if you are interested.

cc: @chongxiaoc

romerojosh · 2022-07-06T16:14:29Z

@kvignesh1420 The benefit of using the struct with arrays rather than pointers is that it removes the need to use any additional GPU memory allocations or memcopies of the parameters, as we can pass the struct directly into the kernel as an argument (so long as the argument is under 4KB as you've discovered).

kvignesh1420 · 2022-07-06T16:41:23Z

@romerojosh that makes sense 👍

maxhgerlach · 2022-07-06T17:37:05Z

GPU tests on Buildkite failed for this PR because there were problems cloning the Eigen repository at the time (https://buildkite.com/horovod/horovod/builds/7951#0181bd82-600d-455e-9ce6-eba81da4e564):

fatal: unable to access 'https://gitlab.com/cantonios/eigen.git/': The requested URL returned error: 503
Failed to clone 'third_party/eigen' a second time, aborting

I've re-triggered the CI (Results) workflow, hopefully it will go through now: https://github.com/horovod/horovod/actions/runs/2600228008

kvignesh1420 · 2022-07-06T23:23:01Z

GPU tests on Buildkite failed for this PR because there were problems cloning the Eigen repository at the time (https://buildkite.com/horovod/horovod/builds/7951#0181bd82-600d-455e-9ce6-eba81da4e564):
fatal: unable to access 'https://gitlab.com/cantonios/eigen.git/': The requested URL returned error: 503
Failed to clone 'third_party/eigen' a second time, aborting
I've re-triggered the CI (Results) workflow, hopefully it will go through now: https://github.com/horovod/horovod/actions/runs/2600228008

@maxhgerlach seems like the tests failed again due to gpg key retrieval timeout:
https://buildkite.com/horovod/horovod/builds/7970#0181d499-2c8b-4e58-a9d8-ddfb32e9c749

maxhgerlach · 2022-07-07T08:22:36Z

@maxhgerlach seems like the tests failed again due to gpg key retrieval timeout: https://buildkite.com/horovod/horovod/builds/7970#0181d499-2c8b-4e58-a9d8-ddfb32e9c749

No reason to worry for now: those passed on retry. I think we are good to merge.

[gpu_operations] add support for blocked memory copies in GPUAllgather

9568c46

Signed-off-by: Vignesh Kothapalli <k.vignesh140@gmail.com>

kvignesh1420 force-pushed the allgather-blockd2dmemcpy branch from 1eebbb9 to 9568c46 Compare July 2, 2022 00:08

kvignesh1420 marked this pull request as ready for review July 2, 2022 00:18

chongxiaoc requested a review from romerojosh July 2, 2022 00:21

romerojosh approved these changes Jul 5, 2022

View reviewed changes

maxhgerlach merged commit 9d56c5a into horovod:master Jul 7, 2022

kvignesh1420 deleted the allgather-blockd2dmemcpy branch July 8, 2022 00:35

This was referenced Jul 24, 2022

Add hvd.grouped_allgather and hvd.grouped_reducescatter #3594

Merged

Add support for batched memory copies in GPUReducescatter #3621

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[gpu_operations] add support for batched memory copies in GPUAllgather #3590

[gpu_operations] add support for batched memory copies in GPUAllgather #3590

kvignesh1420 commented Jul 2, 2022

chongxiaoc commented Jul 2, 2022

github-actions bot commented Jul 2, 2022 •

edited

github-actions bot commented Jul 2, 2022 •

edited

romerojosh left a comment

kvignesh1420 commented Jul 5, 2022 •

edited

romerojosh commented Jul 6, 2022

kvignesh1420 commented Jul 6, 2022

maxhgerlach commented Jul 6, 2022

kvignesh1420 commented Jul 6, 2022

maxhgerlach commented Jul 7, 2022

[gpu_operations] add support for batched memory copies in GPUAllgather #3590

[gpu_operations] add support for batched memory copies in GPUAllgather #3590

Conversation

kvignesh1420 commented Jul 2, 2022

Checklist before submitting

Description

Review process to land

chongxiaoc commented Jul 2, 2022

github-actions bot commented Jul 2, 2022 • edited

Unit Test Results

github-actions bot commented Jul 2, 2022 • edited

Unit Test Results (with flaky tests)

romerojosh left a comment

Choose a reason for hiding this comment

kvignesh1420 commented Jul 5, 2022 • edited

romerojosh commented Jul 6, 2022

kvignesh1420 commented Jul 6, 2022

maxhgerlach commented Jul 6, 2022

kvignesh1420 commented Jul 6, 2022

maxhgerlach commented Jul 7, 2022

github-actions bot commented Jul 2, 2022 •

edited

github-actions bot commented Jul 2, 2022 •

edited

kvignesh1420 commented Jul 5, 2022 •

edited