Allow using multiple CUDA devices #6091

masterleinad · 2023-04-28T17:51:14Z

Based on top of #5989. This pull request explores how multi-device support could look like using TestCuda_InterOp_Streams. The proposed constructor for the execution space instance is

Cuda(int device_id, cudaStream_t stream);

assuming a cudaStream_t is provided as well. It shouldn't be a problem to allow for omitting that argument, though.
Of course, the tricky part is sharing device allocations between devices. deep_copy should work independently of the device and we can allocate on the correct device if an execution space instance is provided in the View allocation.
For accessing allocations on a different device, the example uses cudaDeviceEnablePeerAccess which comes at a cost, of course. In the end, I think the user should be responsible if they want cross-device access.

Of course, we need a corresponding pull request on the desul-side if we decide that we care about supporting to use the arbitrary size atomics on multiple GPUs within this pull request.

~~The other changes in this pull request, show that it's crucial to only set the device in #5989 if we actually know the device to use. Setting the device to the default is more harmful than useful.~~

crtrott · 2023-05-04T23:01:29Z

I don't get your point @masterleinad regarding the default device. We always set a device don't we? So the default execution space instance is gonna be associated with a device which is based on our common decision process (kokkos-device-id, map-by etc.). That default execution space and the associated device needs to be used for any operation which doesn't get an execution space instance passed.

masterleinad · 2023-05-05T13:13:21Z

I don't get your point @masterleinad regarding the default device. We always set a device don't we? So the default execution space instance is gonna be associated with a device which is based on our common decision process (kokkos-device-id, map-by etc.). That default execution space and the associated device needs to be used for any operation which doesn't get an execution space instance passed.

In the end, we have (or could infer/pass) an execution space instance (or at least the correct device) in most cases anyway so the question would really only be about cases where that is not the case.
While looking into making a simple example work using multiple devices, I discovered places where setting the device to the default device would be wrong, see https://github.com/kokkos/kokkos/pull/6091/files/a44c11735437a56eab8205b92d3ce5f02637dd94..63270b8f1b9208e130f3fc160a039831b8f5f963 or just makes the implementation more difficult by figuring out at which point we actually set m_cudaDev(cuda_kernel_arch). A pretty annoying case was get_cuda_kernel_func_attributes where we don't have the correct device available and just not changing it would (likely) do the right thing.

Of course, I see the failure in #5713 but we should really see how many places are left where we can't know about the correct device to use. On top of my head, there would be Kokkos::fence() which should then just call cudaDeviceSynchronize on all devices we know about.

masterleinad · 2023-07-28T19:20:42Z

core/src/Cuda/Kokkos_CudaSpace.cpp

+  Kokkos::Tools::Experimental::Impl::profile_fence_event<Kokkos::Cuda>(
+      "Kokkos::Impl::DeepCopyAsyncCuda: Deep Copy Stream Sync",
      Kokkos::Tools::Experimental::SpecialSynchronizationCases::
          DeepCopyResourceSynchronization,
-      "Kokkos::Impl::DeepCopyAsyncCuda: Deep Copy Stream Sync");
+      [&]() { KOKKOS_IMPL_CUDA_SAFE_CALL(cudaStreamSynchronize(s)); });


This was the only case where this overload of cuda_stream_synchronize was used and it felt easier to just inline the call explicitly.

masterleinad · 2023-07-28T19:24:25Z

core/src/Cuda/Kokkos_CudaSpace.cpp

-      error_code = Impl::CudaInternal::singleton().cuda_malloc_async_wrapper(
-          &ptr, arg_alloc_size);


I don't see a good reason to use the singleton here instead of the provided execution space instance. The default constructed execution space instance corresponds to the singleton anyway.

masterleinad · 2023-07-28T19:24:58Z

core/src/Cuda/Kokkos_Cuda.hpp

@@ -183,6 +183,8 @@ class Cuda {

  Cuda(cudaStream_t stream, bool manage_stream = false);

+  Cuda(int device_id, cudaStream_t stream, bool manage_stream = false);


That's the proposed new constructor.

masterleinad · 2023-07-28T19:28:05Z

core/src/Cuda/Kokkos_Cuda_Error.hpp

-void cuda_stream_synchronize(
-    const cudaStream_t stream,
-    Kokkos::Tools::Experimental::SpecialSynchronizationCases reason,
-    const std::string& name);
 void cuda_device_synchronize(const std::string& name);
-void cuda_stream_synchronize(const cudaStream_t stream,
-                             const std::string& name);


Ultimately, these functions don't need to be exposed since they are only used internally. Also note, that the last overload didn't have an implementation.

masterleinad · 2023-07-28T19:28:34Z

core/src/Cuda/Kokkos_Cuda_Instance.cpp

+  KOKKOS_IMPL_CUDA_SAFE_CALL(
+      cudaMalloc(reinterpret_cast<void **>(&d_arch), sizeof(int)));
+  KOKKOS_IMPL_CUDA_SAFE_CALL(
+      cudaMemcpy(d_arch, &arch, sizeof(int), cudaMemcpyDefault));



We shouldn't change the device here.

masterleinad · 2023-07-28T19:45:36Z

core/src/Cuda/Kokkos_Cuda_Instance.hpp

+  inline static std::set<int> cuda_devices = {};
+  inline static std::map<int, unsigned long*> constantMemHostStagingPerDevice =
+      {};
+  inline static std::map<int, cudaEvent_t> constantMemReusablePerDevice = {};
+  inline static std::map<int, std::mutex> constantMemMutexPerDevice;


It's convenient to store the cuda_devices in a static variable so we can iterate over them. set might not be the most performant choice here but is convenient and it shouldn't matter much since we don't use it in performance-critical code. Similarly, std::map is mostly convenient.

masterleinad · 2023-07-28T19:46:17Z

core/src/Cuda/Kokkos_Cuda_Instance.hpp

-  template <bool setCudaDevice = true>
-  cudaError_t cuda_device_synchronize_wrapper() const {
-    if constexpr (setCudaDevice) set_cuda_device();
-    return cudaDeviceSynchronize();
-  }
-


Not used anymore.

masterleinad · 2023-07-28T19:46:57Z

core/src/Cuda/Kokkos_Cuda_Instance.hpp

    if constexpr (setCudaDevice) set_cuda_device();
-    return cudaStreamSynchronize(stream);
+    return cudaStreamSynchronize(get_stream<false>());


We always have a valid stream and that is the only one to use here.

masterleinad · 2023-07-28T19:50:01Z

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp

+    KOKKOS_IMPL_CUDA_SAFE_CALL(cudaSetDevice(cuda_device));
+    KOKKOS_IMPL_CUDA_SAFE_CALL(cudaFuncGetAttributes(&attr, func));


It turned out to be problematic to call this function with the wrong device.
Again, the assumption is that all the devices have the same properties so it doesn't matter that this function uses a static variable.

masterleinad · 2023-07-28T19:52:28Z

core/unit_test/cuda/TestCuda_InterOp_StreamsMultiGPU.cpp

The test is basically the same as TestCuda_InterOp_Streams.cpp but duplicates all kernels to run on two different devices interleaved.

masterleinad · 2023-07-28T19:57:00Z

With this pull request, we still have

$ git grep -n "FIXME_CUDA_MULTIPLE_DEVICES"
core/src/Cuda/Kokkos_Cuda_Task.hpp:225:  // FIXME_CUDA_MULTIPLE_DEVICES
core/src/Cuda/Kokkos_Cuda_Task.hpp:465:  // FIXME_CUDA_MULTIPLE_DEVICES
core/src/Cuda/Kokkos_Cuda_ZeroMemset.hpp:39:    // FIXME_CUDA_MULTIPLE_DEVICES

I think we can discuss fixing tasks later and ZeroMemset called without arguments replaces running a functor on the default execution space instance anyway. Selecting the correct device matters here, though.

…vices

masterleinad mentioned this pull request May 1, 2023

Encapsulate CudaAPI calls in CudaInternal, call cudaSetDevice for thread safety #5989

Closed

4 tasks

crtrott modified the milestone: Release 4.1 May 3, 2023

crtrott mentioned this pull request May 4, 2023

Kokkos + Legion interop #2651

Open

masterleinad force-pushed the cuda_multiple_devices branch from 616f483 to 2788732 Compare July 28, 2023 19:17

masterleinad commented Jul 28, 2023

View reviewed changes

masterleinad force-pushed the cuda_multiple_devices branch from f3ac51a to d3a7cca Compare July 29, 2023 01:52

Introduce constructor for multi-GPU support.

da30f18

masterleinad force-pushed the cuda_multiple_devices branch from d3a7cca to 983702e Compare July 31, 2023 21:07

masterleinad added 10 commits August 1, 2023 16:57

Update Kokkos_CudaSpace.cpp

8e5bf4b

Allow multiple Cuda devices

f959366

Fix up

f54d6c4

Use max device

f7d01d5

New test

8f83885

Working

c397355

fix more

bb294a0

Indentation

7e6a9d6

Move example/test to TestCuda_InterOp_StreamsMultiGPU.cpp

cfd5122

device_id might be unused in desul

7d6e04a

masterleinad force-pushed the cuda_multiple_devices branch from 983702e to 7d6e04a Compare August 1, 2023 17:15

masterleinad added this to the Release 4.2 milestone Aug 2, 2023

masterleinad added 2 commits August 2, 2023 13:58

Actually populate cuda_devices

aa59adf

Maybe unused variable desul

6b4643a

masterleinad marked this pull request as ready for review August 2, 2023 14:49

masterleinad added 2 commits August 2, 2023 10:58

Remove unused aliases

3f0b87c

Fix compiling with tuning

ffe359c

masterleinad force-pushed the cuda_multiple_devices branch from 7f3df2a to ffe359c Compare August 2, 2023 15:25

Daniel Arndt added 2 commits August 16, 2023 19:19

Fix test

8578b7a

Constructor shouldn't take manage_stream

4240473

masterleinad mentioned this pull request Aug 16, 2023

Cuda: Detect device from stream for multi-GPU support #6361

Merged

masterleinad marked this pull request as draft August 16, 2023 19:45

Use correct device and stream in Cuda*Space::allocate

b249700

masterleinad force-pushed the cuda_multiple_devices branch from c174a35 to b249700 Compare August 17, 2023 21:31

masterleinad mentioned this pull request Aug 28, 2023

Cuda: Allocate using the correct device #6392

Merged

masterleinad mentioned this pull request Sep 26, 2023

desul: Allow using lock arrays in case of multiple GPUs #6469

Draft

masterleinad modified the milestones: Release 4.2, Release 4.3 Nov 7, 2023

masterleinad added 2 commits January 4, 2024 08:56

Merge remote-tracking branch 'upstream/develop' into cuda_multiple_de…

3f63ec1

…vices

Some more fixes

8180a53

masterleinad force-pushed the cuda_multiple_devices branch from b53ee3f to 8180a53 Compare January 4, 2024 18:18

Hack for using the correct device for deallocating views

4ee8181

masterleinad mentioned this pull request Jan 8, 2024

Cuda multi-GPU support: Allow execution space instance constructor to run #6706

Merged

masterleinad mentioned this pull request Jan 23, 2024

Avoid calling wrapper functions with singleton for Cuda #6737

Closed

masterleinad added 2 commits January 25, 2024 23:04

Merge remote-tracking branch 'upstream/develop' into cuda_multiple_de…

58abb85

…vices

Fix up

2e96a7c

This was referenced Jan 26, 2024

Cuda multi-GPU support: Make some variables device-specific, update Kokkos::fence #6753

Merged

Cuda multi-GPU support: Pass the correct device id to get_cuda_kernel_func_attributes #6767

Merged

masterleinad added 2 commits January 31, 2024 18:41

Merge remote-tracking branch 'upstream/develop' into cuda_multiple_de…

eb0ad60

…vices

Fixes

1f624a5

This was referenced Feb 2, 2024

Allow using lock arrays in case of multiple GPUs desul/desul#110

Open

multi-GPU support: Add test for all policies #6782

Merged

masterleinad closed this Feb 8, 2024

ajpowelsnl mentioned this pull request Feb 28, 2024

Release Themes for 2024 #6804

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow using multiple CUDA devices #6091

Allow using multiple CUDA devices #6091

masterleinad commented Apr 28, 2023 •

edited

Loading

crtrott commented May 4, 2023

masterleinad commented May 5, 2023

masterleinad Jul 28, 2023

masterleinad Jul 28, 2023

masterleinad Jul 28, 2023

masterleinad Jul 28, 2023

masterleinad Jul 28, 2023

masterleinad Jul 28, 2023

masterleinad Jul 28, 2023

masterleinad Jul 28, 2023

masterleinad Jul 28, 2023

masterleinad Jul 28, 2023

masterleinad commented Jul 28, 2023

		error_code = Impl::CudaInternal::singleton().cuda_malloc_async_wrapper(
		&ptr, arg_alloc_size);

		@@ -183,6 +183,8 @@ class Cuda {

		Cuda(cudaStream_t stream, bool manage_stream = false);

		Cuda(int device_id, cudaStream_t stream, bool manage_stream = false);

		KOKKOS_IMPL_CUDA_SAFE_CALL(cudaSetDevice(cuda_device));
		KOKKOS_IMPL_CUDA_SAFE_CALL(cudaFuncGetAttributes(&attr, func));

Allow using multiple CUDA devices #6091

Allow using multiple CUDA devices #6091

Conversation

masterleinad commented Apr 28, 2023 • edited Loading

crtrott commented May 4, 2023

masterleinad commented May 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masterleinad commented Jul 28, 2023

masterleinad commented Apr 28, 2023 •

edited

Loading