Cuda multi-GPU support: Make some variables device-specific, update Kokkos::fence #6753

masterleinad · 2024-01-26T21:02:37Z

Part of #6091. Some of the static CudaInternal variables should be created per device as the comments in the code are already saying. Namely, these are constantMemHostStaging, constantMemReusable, and constantMemMutex.
This pull request turns each of them into a static std::map that is lazily initialized. To properly deallocate the respective memory in impl_finalize(), we also have to store the devices that Kokkos is using inside another lazily initialized std::set.
This allows us to also update cuda_device_synchronize resp. Kokkos::fence already to fence all devices that we know of.

I decided to fence all devices in cuda_device_synchronize instead of adding a device id parameter and run through all of them at the call site since it appeared that this matches the implied behavior the best and requires fewer changes.

masterleinad · 2024-01-28T16:30:00Z

Retest this please

masterleinad · 2024-01-29T13:30:25Z

Retest this please.

masterleinad · 2024-01-31T16:54:34Z

The failures in two Cuda build are appear to be unrelated to this pull request and more of a machine issue.

dalg24 · 2024-02-01T16:26:18Z

core/src/Cuda/Kokkos_Cuda_Instance.hpp

+  inline static std::set<int> cuda_devices = {};
+  inline static std::map<int, unsigned long*> constantMemHostStagingPerDevice =
+      {};
+  inline static std::map<int, cudaEvent_t> constantMemReusablePerDevice = {};
+  inline static std::map<int, std::mutex> constantMemMutexPerDevice;


Did you put much thought whether to use std::{map,set} versus their unordered counterpart?

No, I assume these are holding few elements and would never be called in a performance-critical code path.
If we cared about performance, we would rather use a std::vector(or small_vector) here anyway and manually search through it where necessary.

core/src/Cuda/Kokkos_Cuda_Instance.cpp

core/src/Cuda/Kokkos_Cuda_Instance.hpp

dalg24

(with the = {} change for consistency)

Co-authored-by: Damien L-G <dalg24@gmail.com>

dalg24 · 2024-02-01T23:48:14Z

Failures are unrelated. I intend to merge when/if the last CUDA build passes

massive runtime failures in HIP ROCm 5.6 build that are clearly unrelated
machine issue in gcc build

[ 28%] Building CXX object core/unit_test/CMakeFiles/Kokkos_CoreUnitTest_Serial1.dir/serial/TestSerial_HostSharedPtr.cpp.o
Cannot contact CpuNode6: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@12972753:JNLP4-connect connection from 10.64.193.57/10.64.193.57:34306": Remote call on JNLP4-connect connection from 10.64.193.57/10.64.193.57:34306 failed. The channel is closing down or has closed down
wrapper script does not seem to be touching the log file in /var/jenkins/workspace/Kokkos_PR-6753@tmp/durable-05336f30
(JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)

in openmptarget build

49: [ RUN      ] simd.device_reductions
49: "PluginInterface" error: Failure to synchronize stream (nil): Error in cuStreamSynchronize: an illegal memory access was encountered
49: Libomptarget error: Consult https://openmp.llvm.org/design/Runtimes.html for debugging options.
49: Kokkos_OpenMPTarget_ParallelFor_Range.hpp:54:1: Libomptarget fatal error 1: failure of target construct while offloading is mandatory
49/49 Test #49: Kokkos_UnitTest_SIMD ........................Subprocess aborted***Exception:   1.27 sec

masterleinad added 2 commits January 26, 2024 16:39

Kokkos::fence should fence all devices

0794ade

Create a couple more variables per device

c473b3f

masterleinad changed the title ~~Kokkos::fence should fence all devices~~ Cuda mult-GPU support: Make some variables device-specific, update Kokkos::fence Jan 26, 2024

Don't forget desul::Impl::init_lock_arrays();

191003f

masterleinad mentioned this pull request Jan 31, 2024

Cuda multi-GPU support: Pass the correct device id to get_cuda_kernel_func_attributes #6767

Merged

masterleinad changed the title ~~Cuda mult-GPU support: Make some variables device-specific, update Kokkos::fence~~ Cuda multi-GPU support: Make some variables device-specific, update Kokkos::fence Jan 31, 2024

masterleinad marked this pull request as ready for review January 31, 2024 16:53

masterleinad added the Backend - CUDA label Feb 1, 2024

dalg24 reviewed Feb 1, 2024

View reviewed changes

Address reviewer comments

1b9118c

dalg24 reviewed Feb 1, 2024

View reviewed changes

core/src/Cuda/Kokkos_Cuda_Instance.hpp Outdated Show resolved Hide resolved

dalg24 approved these changes Feb 1, 2024

View reviewed changes

Add {} for std::map initialization

871ab8f

Co-authored-by: Damien L-G <dalg24@gmail.com>

tcclevenger approved these changes Feb 1, 2024

View reviewed changes

dalg24 merged commit 7d2ea72 into kokkos:develop Feb 2, 2024
30 of 31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda multi-GPU support: Make some variables device-specific, update Kokkos::fence #6753

Cuda multi-GPU support: Make some variables device-specific, update Kokkos::fence #6753

masterleinad commented Jan 26, 2024 •

edited

masterleinad commented Jan 28, 2024

masterleinad commented Jan 29, 2024

masterleinad commented Jan 31, 2024

dalg24 Feb 1, 2024

masterleinad Feb 1, 2024

dalg24 left a comment

dalg24 commented Feb 1, 2024

Cuda multi-GPU support: Make some variables device-specific, update Kokkos::fence #6753

Cuda multi-GPU support: Make some variables device-specific, update Kokkos::fence #6753

Conversation

masterleinad commented Jan 26, 2024 • edited

masterleinad commented Jan 28, 2024

masterleinad commented Jan 29, 2024

masterleinad commented Jan 31, 2024

dalg24 Feb 1, 2024

Choose a reason for hiding this comment

masterleinad Feb 1, 2024

Choose a reason for hiding this comment

dalg24 left a comment

Choose a reason for hiding this comment

dalg24 commented Feb 1, 2024

masterleinad commented Jan 26, 2024 •

edited