Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for batched D2D memcopy kernel on GPU. #2435

Merged
merged 3 commits into from Nov 13, 2020

Conversation

romerojosh
Copy link
Collaborator

This PR introduces a batched device-to-device (D2D) memcopy via a CUDA kernel to replace the individual calls to cudaMemcpyAsync when packing the fusion buffer. As with any type of kernel fusion, this reduces latency by limiting the number of small kernels calls and instead launching fewer, larger kernels. While these packing D2D memcopies are generally hidden behind compute, in high performance cases requiring low latency from Horovod, fusing these D2D memcopies improves performance. Even in more typical cases, this can still improve the performance by reducing any exposed Horovod processing time.

For performance and simplicity, this implementation will pad the destination fusion buffer address for each input tensor to a 16 byte aligned address to allow for using the 16 byte loads/stores in the CUDA kernel, assuming the input tensors are also 16 byte aligned. This should be true in most cases, but the kernel will adjust the load/store size if the input is not 16 byte aligned.

Currently, the feature is toggle-able via environment variable HOROVOD_BATCH_D2D_MEMCOPIES but that can be removed. I've left it in for now to enable simpler performance testing/comparison.

@github-actions

This comment has been minimized.

Signed-off-by: Josh Romero <joshr@nvidia.com>
@github-actions

This comment has been minimized.

Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! What are your thoughts on making this the default behavior?

horovod/common/ops/cuda/cuda_kernels.h Show resolved Hide resolved
@@ -110,6 +110,9 @@ struct HorovodGlobalState {
// benefit from a smaller chunk size.
int64_t adasum_mpi_chunk_size = 1<<30;

// Enable use of batched d2d memcopy kernel on GPU
bool batch_d2d_memcopies = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan to change the default to true after more testing is done to verify performance improvements? Is there a chance that performance could degrade with this setting?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should enable this by default, though it might be good to allow it to be disabled via the env var just in case people see performance degradations. I haven't seen any performance degradations in my testing on V100 cards. I will update the PR to default this to true. This will also make sure testing is run using the new kernel, since right now it is not picking it up.

Signed-off-by: Josh Romero <joshr@nvidia.com>
std::getenv(HOROVOD_BATCH_D2D_MEMCOPIES);
if (horovod_batch_d2d_memcopies != nullptr &&
std::strtol(horovod_batch_d2d_memcopies, nullptr, 10) > 0) {
state.batch_d2d_memcopies = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the default is now true, we should also change this behavior such that if the user specifies HOROVOD_BATCH_D2D_MEMCOPIES=0 it sets the value to false.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, good catch. Just fixed this.

Signed-off-by: Josh Romero <joshr@nvidia.com>
@github-actions

This comment has been minimized.

@tgaddair tgaddair merged commit 18cfedd into horovod:master Nov 13, 2020
@github-actions
Copy link

Unit Test Results

     542 files  +  21     542 suites  +21   4h 27m 42s ⏱️ +59s
     521 tests ±    0     494 ✔️  -     1       26 💤 ±    0  1 ❌ +1 
10 540 runs  +426  8 406 ✔️ +295  2 133 💤 +130  1 ❌ +1 

For more details on these failures, see this check.

Results for commit 18cfedd. ± Comparison against base commit c7a48a0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants