New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cuda multi-GPU support: Make some variables device-specific, update Kokkos::fence #6753
Cuda multi-GPU support: Make some variables device-specific, update Kokkos::fence #6753
Conversation
Retest this please |
Retest this please. |
The failures in two Cuda build are appear to be unrelated to this pull request and more of a machine issue. |
inline static std::set<int> cuda_devices = {}; | ||
inline static std::map<int, unsigned long*> constantMemHostStagingPerDevice = | ||
{}; | ||
inline static std::map<int, cudaEvent_t> constantMemReusablePerDevice = {}; | ||
inline static std::map<int, std::mutex> constantMemMutexPerDevice; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you put much thought whether to use std::{map,set} versus their unordered counterpart?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I assume these are holding few elements and would never be called in a performance-critical code path.
If we cared about performance, we would rather use a std::vector
(or small_vector
) here anyway and manually search through it where necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(with the = {}
change for consistency)
Co-authored-by: Damien L-G <dalg24@gmail.com>
Failures are unrelated. I intend to merge when/if the last CUDA build passes
|
Part of #6091. Some of the
static
CudaInternal
variables should be created per device as the comments in the code are already saying. Namely, these areconstantMemHostStaging
,constantMemReusable
, andconstantMemMutex
.This pull request turns each of them into a
static std::map
that is lazily initialized. To properly deallocate the respective memory inimpl_finalize()
, we also have to store the devices thatKokkos
is using inside another lazily initializedstd::set
.This allows us to also update
cuda_device_synchronize
resp.Kokkos::fence
already to fence all devices that we know of.I decided to
fence
all devices incuda_device_synchronize
instead of adding adevice id
parameter and run through all of them at the call site since it appeared that this matches the implied behavior the best and requires fewer changes.