Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda multi-GPU support: Make some variables device-specific, update Kokkos::fence #6753

Merged

Conversation

masterleinad
Copy link
Contributor

@masterleinad masterleinad commented Jan 26, 2024

Part of #6091. Some of the static CudaInternal variables should be created per device as the comments in the code are already saying. Namely, these are constantMemHostStaging, constantMemReusable, and constantMemMutex.
This pull request turns each of them into a static std::map that is lazily initialized. To properly deallocate the respective memory in impl_finalize(), we also have to store the devices that Kokkos is using inside another lazily initialized std::set.
This allows us to also update cuda_device_synchronize resp. Kokkos::fence already to fence all devices that we know of.

I decided to fence all devices in cuda_device_synchronize instead of adding a device id parameter and run through all of them at the call site since it appeared that this matches the implied behavior the best and requires fewer changes.

@masterleinad masterleinad changed the title Kokkos::fence should fence all devices Cuda mult-GPU support: Make some variables device-specific, update Kokkos::fence Jan 26, 2024
@masterleinad
Copy link
Contributor Author

Retest this please

@masterleinad
Copy link
Contributor Author

Retest this please.

@masterleinad masterleinad changed the title Cuda mult-GPU support: Make some variables device-specific, update Kokkos::fence Cuda multi-GPU support: Make some variables device-specific, update Kokkos::fence Jan 31, 2024
@masterleinad masterleinad marked this pull request as ready for review January 31, 2024 16:53
@masterleinad
Copy link
Contributor Author

The failures in two Cuda build are appear to be unrelated to this pull request and more of a machine issue.

Comment on lines 123 to 127
inline static std::set<int> cuda_devices = {};
inline static std::map<int, unsigned long*> constantMemHostStagingPerDevice =
{};
inline static std::map<int, cudaEvent_t> constantMemReusablePerDevice = {};
inline static std::map<int, std::mutex> constantMemMutexPerDevice;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you put much thought whether to use std::{map,set} versus their unordered counterpart?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I assume these are holding few elements and would never be called in a performance-critical code path.
If we cared about performance, we would rather use a std::vector(or small_vector) here anyway and manually search through it where necessary.

core/src/Cuda/Kokkos_Cuda_Instance.cpp Outdated Show resolved Hide resolved
core/src/Cuda/Kokkos_Cuda_Instance.cpp Show resolved Hide resolved
core/src/Cuda/Kokkos_Cuda_Instance.cpp Outdated Show resolved Hide resolved
Copy link
Member

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(with the = {} change for consistency)

Co-authored-by: Damien L-G <dalg24@gmail.com>
@dalg24
Copy link
Member

dalg24 commented Feb 1, 2024

Failures are unrelated. I intend to merge when/if the last CUDA build passes

  • massive runtime failures in HIP ROCm 5.6 build that are clearly unrelated
  • machine issue in gcc build
[ 28%] Building CXX object core/unit_test/CMakeFiles/Kokkos_CoreUnitTest_Serial1.dir/serial/TestSerial_HostSharedPtr.cpp.o
Cannot contact CpuNode6: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@12972753:JNLP4-connect connection from 10.64.193.57/10.64.193.57:34306": Remote call on JNLP4-connect connection from 10.64.193.57/10.64.193.57:34306 failed. The channel is closing down or has closed down
wrapper script does not seem to be touching the log file in /var/jenkins/workspace/Kokkos_PR-6753@tmp/durable-05336f30
(JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)
  • in openmptarget build
49: [ RUN      ] simd.device_reductions
49: "PluginInterface" error: Failure to synchronize stream (nil): Error in cuStreamSynchronize: an illegal memory access was encountered
49: Libomptarget error: Consult https://openmp.llvm.org/design/Runtimes.html for debugging options.
49: Kokkos_OpenMPTarget_ParallelFor_Range.hpp:54:1: Libomptarget fatal error 1: failure of target construct while offloading is mandatory
49/49 Test #49: Kokkos_UnitTest_SIMD ........................Subprocess aborted***Exception:   1.27 sec

@dalg24 dalg24 merged commit 7d2ea72 into kokkos:develop Feb 2, 2024
30 of 31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants