Adding support for batched D2D memcopy kernel on GPU. #2435

romerojosh · 2020-11-10T22:18:38Z

This PR introduces a batched device-to-device (D2D) memcopy via a CUDA kernel to replace the individual calls to cudaMemcpyAsync when packing the fusion buffer. As with any type of kernel fusion, this reduces latency by limiting the number of small kernels calls and instead launching fewer, larger kernels. While these packing D2D memcopies are generally hidden behind compute, in high performance cases requiring low latency from Horovod, fusing these D2D memcopies improves performance. Even in more typical cases, this can still improve the performance by reducing any exposed Horovod processing time.

For performance and simplicity, this implementation will pad the destination fusion buffer address for each input tensor to a 16 byte aligned address to allow for using the 16 byte loads/stores in the CUDA kernel, assuming the input tensors are also 16 byte aligned. This should be true in most cases, but the kernel will adjust the load/store size if the input is not 16 byte aligned.

Currently, the feature is toggle-able via environment variable HOROVOD_BATCH_D2D_MEMCOPIES but that can be removed. I've left it in for now to enable simpler performance testing/comparison.

Signed-off-by: Josh Romero <joshr@nvidia.com>

tgaddair

LGTM! What are your thoughts on making this the default behavior?

horovod/common/ops/cuda/cuda_kernels.h

tgaddair · 2020-11-13T16:33:13Z

horovod/common/global_state.h

@@ -110,6 +110,9 @@ struct HorovodGlobalState {
  // benefit from a smaller chunk size.
  int64_t adasum_mpi_chunk_size = 1<<30;

+  // Enable use of batched d2d memcopy kernel on GPU
+  bool batch_d2d_memcopies = false;


Is the plan to change the default to true after more testing is done to verify performance improvements? Is there a chance that performance could degrade with this setting?

I think we should enable this by default, though it might be good to allow it to be disabled via the env var just in case people see performance degradations. I haven't seen any performance degradations in my testing on V100 cards. I will update the PR to default this to true. This will also make sure testing is run using the new kernel, since right now it is not picking it up.

Signed-off-by: Josh Romero <joshr@nvidia.com>

tgaddair · 2020-11-13T18:23:54Z

horovod/common/operations.cc

+      std::getenv(HOROVOD_BATCH_D2D_MEMCOPIES);
+  if (horovod_batch_d2d_memcopies != nullptr &&
+      std::strtol(horovod_batch_d2d_memcopies, nullptr, 10) > 0) {
+    state.batch_d2d_memcopies = true;


Since the default is now true, we should also change this behavior such that if the user specifies HOROVOD_BATCH_D2D_MEMCOPIES=0 it sets the value to false.

Yep, good catch. Just fixed this.

Signed-off-by: Josh Romero <joshr@nvidia.com>

github-actions · 2020-11-13T22:55:16Z

Unit Test Results

    542 files +  21   542 suites +21 4h 27m 42s ⏱️ +59s
    521 tests ±    0   494 ✔️ -     1     26 💤 ±    0 1 ❌ +1
10 540 runs +426 8 406 ✔️ +295 2 133 💤 +130 1 ❌ +1

For more details on these failures, see this check.

Results for commit 18cfedd. ± Comparison against base commit c7a48a0.

This comment has been minimized.

Sign in to view

Adding support for batched D2D memcopy kernel on GPU.

5446245

Signed-off-by: Josh Romero <joshr@nvidia.com>

romerojosh force-pushed the cuda_batch_d2d_pr branch from fbb8ce2 to 5446245 Compare November 11, 2020 19:37

This comment has been minimized.

Sign in to view

romerojosh requested a review from tgaddair November 12, 2020 20:08

tgaddair approved these changes Nov 13, 2020

View reviewed changes

Set batch d2d default to true.

1158677

Signed-off-by: Josh Romero <joshr@nvidia.com>

tgaddair reviewed Nov 13, 2020

View reviewed changes

Fix env var logic to disable batch d2d memcopies correctly.

7ba209e

Signed-off-by: Josh Romero <joshr@nvidia.com>

This comment has been minimized.

Sign in to view

tgaddair merged commit 18cfedd into horovod:master Nov 13, 2020

kvignesh1420 mentioned this pull request Jul 2, 2022

[gpu_operations] add support for batched memory copies in GPUAllgather #3590

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for batched D2D memcopy kernel on GPU. #2435

Adding support for batched D2D memcopy kernel on GPU. #2435

romerojosh commented Nov 10, 2020

This comment has been minimized.

This comment has been minimized.

tgaddair left a comment

tgaddair Nov 13, 2020

romerojosh Nov 13, 2020

tgaddair Nov 13, 2020

romerojosh Nov 13, 2020

This comment has been minimized.

github-actions bot commented Nov 13, 2020

Adding support for batched D2D memcopy kernel on GPU. #2435

Adding support for batched D2D memcopy kernel on GPU. #2435

Conversation

romerojosh commented Nov 10, 2020

This comment has been minimized.

This comment has been minimized.

tgaddair left a comment

Choose a reason for hiding this comment

tgaddair Nov 13, 2020

Choose a reason for hiding this comment

romerojosh Nov 13, 2020

Choose a reason for hiding this comment

tgaddair Nov 13, 2020

Choose a reason for hiding this comment

romerojosh Nov 13, 2020

Choose a reason for hiding this comment

This comment has been minimized.

github-actions bot commented Nov 13, 2020

Unit Test Results