Avoid calling wrapper functions with singleton for Cuda #6737

masterleinad · 2024-01-23T16:49:26Z

Related to #6091. This pull request replaces all instances of calling CudaInternal::singleton with a wrapper since these places are possibly using the wrong device. It seems better to be explicit even in cases where we want to use the default execution space instance/the default device.

I would go through unused wrapper functions and remove them in a follow-up pull request.

dalg24 · 2024-01-23T20:04:00Z

core/src/Cuda/Kokkos_Cuda_Instance.cpp

+        KOKKOS_IMPL_CUDA_SAFE_CALL(cudaSetDevice(Cuda().cuda_device()));
+        KOKKOS_IMPL_CUDA_SAFE_CALL(cudaDeviceSynchronize());


How is default constructing Cuda better?

To me that indicates a concise choice (as opposed to just not changing those instances) but I agree that it's debatable. I'm fine with whatever finds more support.

Cuda().cuda_device() = CudaInternal::singleton().m_cudaDev, correct? So it is the same.

We could call API functions from

Cuda().impl_internal_space_instance()->cuda_..._wrapper()

My opinion is having all calls run through the wrappers (with exception of Cuda/CudaInternal initialization) makes it less likely to miss setting a device ID, since we need the device ID set to Cuda().impl_internal_space_instance()->m_cudaDev anyways, which is exactly what the wrappers do.

@tcclevenger I reverted changes to places where it doesn't make a difference or it's appropriate to use the default execution space instance.

core/src/Cuda/Kokkos_Cuda_Task.hpp

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp

tcclevenger · 2024-01-24T16:23:22Z

core/src/Cuda/Kokkos_Cuda_Instance.cpp

+        KOKKOS_IMPL_CUDA_SAFE_CALL(cudaSetDevice(Cuda().cuda_device()));
+        KOKKOS_IMPL_CUDA_SAFE_CALL(cudaDeviceSynchronize());


Cuda().cuda_device() = CudaInternal::singleton().m_cudaDev, correct? So it is the same.

We could call API functions from

Cuda().impl_internal_space_instance()->cuda_..._wrapper()

My opinion is having all calls run through the wrappers (with exception of Cuda/CudaInternal initialization) makes it less likely to miss setting a device ID, since we need the device ID set to Cuda().impl_internal_space_instance()->m_cudaDev anyways, which is exactly what the wrappers do.

dalg24

Is the description out of date or did you miss these occurrences

core/src/Cuda/Kokkos_CudaSpace.cpp:47:        (CudaInternal::singleton().cuda_stream_create_wrapper(&s)));
core/src/Cuda/Kokkos_CudaSpace.cpp:70:  KOKKOS_IMPL_CUDA_SAFE_CALL((CudaInternal::singleton().cuda_memcpy_wrapper(
core/src/Cuda/Kokkos_CudaSpace.cpp:84:      (CudaInternal::singleton().cuda_memcpy_async_wrapper(
core/src/Cuda/Kokkos_Cuda_Instance.cpp:149:            (CudaInternal::singleton().cuda_device_synchronize_wrapper()));
core/src/Cuda/Kokkos_Cuda_Instance.cpp:154:            (CudaInternal::singleton().cuda_device_synchronize_wrapper()));

dalg24 · 2024-01-30T00:10:28Z

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp

What is the point of these cudaFuncSetAttributes changes if they do not resolve the issue?

dalg24 · 2024-01-30T20:53:06Z

core/src/Cuda/Kokkos_Cuda_Task.hpp

@@ -468,10 +465,12 @@ class TaskQueueSpecializationConstrained<
  static void execute(scheduler_type const& scheduler) {
    const int shared_per_warp = 2048;
    const int warps_per_block = 4;
+    const Kokkos::Cuda exec   = Cuda();  // FIXME_CUDA_MULTIPLE_DEVICES


Why doesn't this one do scheduler.get_execution_space()

Oh it was commented :/

And Daniel checked and that still does not work

dalg24 · 2024-01-30T20:54:50Z

core/src/Cuda/Kokkos_Cuda_Instance.cpp

@@ -168,18 +168,6 @@ void cuda_stream_synchronize(const cudaStream_t stream, const CudaInternal *ptr,
      });
 }

-void cuda_stream_synchronize(


Why did you inline this one?

dalg24 · 2024-01-30T21:06:25Z

core/src/Cuda/Kokkos_Cuda_ZeroMemset.hpp

-            dst.data(), 0,
-            dst.size() * sizeof(typename View<T, P...>::value_type))));
+        cudaMemset(dst.data(), 0,
+                   dst.size() * sizeof(typename View<T, P...>::value_type)));


Wasn't the previous version running on the stream of the default exec space while this one is a blocking call?

Daniel pointed out #6187

masterleinad · 2024-01-30T22:29:34Z

We decided to split this pull request up.

masterleinad requested a review from tcclevenger January 23, 2024 16:49

masterleinad added the Backend - CUDA label Jan 23, 2024

Avoid calling wrapper functions with singleton for Cuda

69dce84

masterleinad force-pushed the cuda_dont_use_singleton_wrapper branch from 3b6a6e5 to 69dce84 Compare January 23, 2024 18:55

masterleinad marked this pull request as ready for review January 23, 2024 18:56

dalg24 reviewed Jan 23, 2024

View reviewed changes

tcclevenger reviewed Jan 24, 2024

View reviewed changes

Rework in Cuda_Task.hpp

c902473

masterleinad force-pushed the cuda_dont_use_singleton_wrapper branch from 75f3ea3 to c902473 Compare January 24, 2024 17:49

Revert to singleton+wrapper where it doesn't help

5052ca5

masterleinad requested review from tcclevenger and dalg24 January 25, 2024 20:41

tcclevenger approved these changes Jan 29, 2024

View reviewed changes

dalg24 reviewed Jan 30, 2024

View reviewed changes

dalg24 mentioned this pull request Jan 30, 2024

Drop CudaInternal::cuda_get_last_error_wrapper() #6761

Merged

masterleinad mentioned this pull request Jan 30, 2024

Avoid calling wrapper functions with singleton in Kokkos_CudaSpace.cpp where we want a different device #6762

Merged

masterleinad marked this pull request as draft January 30, 2024 22:24

masterleinad mentioned this pull request Jan 30, 2024

Avoid calling wrapper functions with singleton in Kokkos_Cuda_Task.cpp #6763

Merged

dalg24 mentioned this pull request Jan 30, 2024

Drop 2-arguments ZeroMemset constructor overloads #6764

Merged

masterleinad closed this Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid calling wrapper functions with singleton for Cuda #6737

Avoid calling wrapper functions with singleton for Cuda #6737

masterleinad commented Jan 23, 2024 •

edited

dalg24 Jan 23, 2024

masterleinad Jan 23, 2024

tcclevenger Jan 24, 2024

masterleinad Jan 25, 2024

tcclevenger Jan 24, 2024

dalg24 left a comment

dalg24 Jan 30, 2024

dalg24 Jan 30, 2024

dalg24 Jan 30, 2024

dalg24 Jan 30, 2024

dalg24 Jan 30, 2024

dalg24 Jan 30, 2024

dalg24 Jan 30, 2024

masterleinad commented Jan 30, 2024

		KOKKOS_IMPL_CUDA_SAFE_CALL(cudaSetDevice(Cuda().cuda_device()));
		KOKKOS_IMPL_CUDA_SAFE_CALL(cudaDeviceSynchronize());

Avoid calling wrapper functions with singleton for Cuda #6737

Avoid calling wrapper functions with singleton for Cuda #6737

Conversation

masterleinad commented Jan 23, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dalg24 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masterleinad commented Jan 30, 2024

masterleinad commented Jan 23, 2024 •

edited