New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for batched D2D memcopy kernel on GPU. #2435
Conversation
This comment has been minimized.
This comment has been minimized.
Signed-off-by: Josh Romero <joshr@nvidia.com>
fbb8ce2
to
5446245
Compare
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! What are your thoughts on making this the default behavior?
horovod/common/global_state.h
Outdated
@@ -110,6 +110,9 @@ struct HorovodGlobalState { | |||
// benefit from a smaller chunk size. | |||
int64_t adasum_mpi_chunk_size = 1<<30; | |||
|
|||
// Enable use of batched d2d memcopy kernel on GPU | |||
bool batch_d2d_memcopies = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the plan to change the default to true after more testing is done to verify performance improvements? Is there a chance that performance could degrade with this setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should enable this by default, though it might be good to allow it to be disabled via the env var just in case people see performance degradations. I haven't seen any performance degradations in my testing on V100 cards. I will update the PR to default this to true. This will also make sure testing is run using the new kernel, since right now it is not picking it up.
Signed-off-by: Josh Romero <joshr@nvidia.com>
horovod/common/operations.cc
Outdated
std::getenv(HOROVOD_BATCH_D2D_MEMCOPIES); | ||
if (horovod_batch_d2d_memcopies != nullptr && | ||
std::strtol(horovod_batch_d2d_memcopies, nullptr, 10) > 0) { | ||
state.batch_d2d_memcopies = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the default is now true
, we should also change this behavior such that if the user specifies HOROVOD_BATCH_D2D_MEMCOPIES=0
it sets the value to false
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, good catch. Just fixed this.
Signed-off-by: Josh Romero <joshr@nvidia.com>
This comment has been minimized.
This comment has been minimized.
Unit Test Results 542 files + 21 542 suites +21 4h 27m 42s ⏱️ +59s For more details on these failures, see this check. Results for commit 18cfedd. ± Comparison against base commit c7a48a0. |
This PR introduces a batched device-to-device (D2D) memcopy via a CUDA kernel to replace the individual calls to
cudaMemcpyAsync
when packing the fusion buffer. As with any type of kernel fusion, this reduces latency by limiting the number of small kernels calls and instead launching fewer, larger kernels. While these packing D2D memcopies are generally hidden behind compute, in high performance cases requiring low latency from Horovod, fusing these D2D memcopies improves performance. Even in more typical cases, this can still improve the performance by reducing any exposed Horovod processing time.For performance and simplicity, this implementation will pad the destination fusion buffer address for each input tensor to a 16 byte aligned address to allow for using the 16 byte loads/stores in the CUDA kernel, assuming the input tensors are also 16 byte aligned. This should be true in most cases, but the kernel will adjust the load/store size if the input is not 16 byte aligned.
Currently, the feature is toggle-able via environment variable
HOROVOD_BATCH_D2D_MEMCOPIES
but that can be removed. I've left it in for now to enable simpler performance testing/comparison.