Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid calling wrapper functions with singleton for Cuda #6737

Conversation

masterleinad
Copy link
Contributor

@masterleinad masterleinad commented Jan 23, 2024

Related to #6091. This pull request replaces all instances of calling CudaInternal::singleton with a wrapper since these places are possibly using the wrong device. It seems better to be explicit even in cases where we want to use the default execution space instance/the default device.

I would go through unused wrapper functions and remove them in a follow-up pull request.

@masterleinad masterleinad force-pushed the cuda_dont_use_singleton_wrapper branch from 3b6a6e5 to 69dce84 Compare January 23, 2024 18:55
@masterleinad masterleinad marked this pull request as ready for review January 23, 2024 18:56
Comment on lines 148 to 149
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaSetDevice(Cuda().cuda_device()));
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaDeviceSynchronize());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is default constructing Cuda better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me that indicates a concise choice (as opposed to just not changing those instances) but I agree that it's debatable. I'm fine with whatever finds more support.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cuda().cuda_device() = CudaInternal::singleton().m_cudaDev, correct? So it is the same.

We could call API functions from

Cuda().impl_internal_space_instance()->cuda_..._wrapper()

My opinion is having all calls run through the wrappers (with exception of Cuda/CudaInternal initialization) makes it less likely to miss setting a device ID, since we need the device ID set to Cuda().impl_internal_space_instance()->m_cudaDev anyways, which is exactly what the wrappers do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tcclevenger I reverted changes to places where it doesn't make a difference or it's appropriate to use the default execution space instance.

core/src/Cuda/Kokkos_Cuda_Task.hpp Outdated Show resolved Hide resolved
core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp Show resolved Hide resolved
Comment on lines 148 to 149
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaSetDevice(Cuda().cuda_device()));
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaDeviceSynchronize());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cuda().cuda_device() = CudaInternal::singleton().m_cudaDev, correct? So it is the same.

We could call API functions from

Cuda().impl_internal_space_instance()->cuda_..._wrapper()

My opinion is having all calls run through the wrappers (with exception of Cuda/CudaInternal initialization) makes it less likely to miss setting a device ID, since we need the device ID set to Cuda().impl_internal_space_instance()->m_cudaDev anyways, which is exactly what the wrappers do.

@masterleinad masterleinad force-pushed the cuda_dont_use_singleton_wrapper branch from 75f3ea3 to c902473 Compare January 24, 2024 17:49
Copy link
Member

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the description out of date or did you miss these occurrences

core/src/Cuda/Kokkos_CudaSpace.cpp:47:        (CudaInternal::singleton().cuda_stream_create_wrapper(&s)));
core/src/Cuda/Kokkos_CudaSpace.cpp:70:  KOKKOS_IMPL_CUDA_SAFE_CALL((CudaInternal::singleton().cuda_memcpy_wrapper(
core/src/Cuda/Kokkos_CudaSpace.cpp:84:      (CudaInternal::singleton().cuda_memcpy_async_wrapper(
core/src/Cuda/Kokkos_Cuda_Instance.cpp:149:            (CudaInternal::singleton().cuda_device_synchronize_wrapper()));
core/src/Cuda/Kokkos_Cuda_Instance.cpp:154:            (CudaInternal::singleton().cuda_device_synchronize_wrapper()));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of these cudaFuncSetAttributes changes if they do not resolve the issue?

@@ -468,10 +465,12 @@ class TaskQueueSpecializationConstrained<
static void execute(scheduler_type const& scheduler) {
const int shared_per_warp = 2048;
const int warps_per_block = 4;
const Kokkos::Cuda exec = Cuda(); // FIXME_CUDA_MULTIPLE_DEVICES
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't this one do scheduler.get_execution_space()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh it was commented :/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And Daniel checked and that still does not work

@@ -168,18 +168,6 @@ void cuda_stream_synchronize(const cudaStream_t stream, const CudaInternal *ptr,
});
}

void cuda_stream_synchronize(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you inline this one?

dst.data(), 0,
dst.size() * sizeof(typename View<T, P...>::value_type))));
cudaMemset(dst.data(), 0,
dst.size() * sizeof(typename View<T, P...>::value_type)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't the previous version running on the stream of the default exec space while this one is a blocking call?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Daniel pointed out #6187

@masterleinad
Copy link
Contributor Author

We decided to split this pull request up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants