New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid calling wrapper functions with singleton for Cuda #6737
Avoid calling wrapper functions with singleton for Cuda #6737
Conversation
3b6a6e5
to
69dce84
Compare
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaSetDevice(Cuda().cuda_device())); | ||
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaDeviceSynchronize()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is default constructing Cuda
better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me that indicates a concise choice (as opposed to just not changing those instances) but I agree that it's debatable. I'm fine with whatever finds more support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cuda().cuda_device() = CudaInternal::singleton().m_cudaDev
, correct? So it is the same.
We could call API functions from
Cuda().impl_internal_space_instance()->cuda_..._wrapper()
My opinion is having all calls run through the wrappers (with exception of Cuda/CudaInternal initialization) makes it less likely to miss setting a device ID, since we need the device ID set to Cuda().impl_internal_space_instance()->m_cudaDev
anyways, which is exactly what the wrappers do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tcclevenger I reverted changes to places where it doesn't make a difference or it's appropriate to use the default execution space instance.
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaSetDevice(Cuda().cuda_device())); | ||
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaDeviceSynchronize()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cuda().cuda_device() = CudaInternal::singleton().m_cudaDev
, correct? So it is the same.
We could call API functions from
Cuda().impl_internal_space_instance()->cuda_..._wrapper()
My opinion is having all calls run through the wrappers (with exception of Cuda/CudaInternal initialization) makes it less likely to miss setting a device ID, since we need the device ID set to Cuda().impl_internal_space_instance()->m_cudaDev
anyways, which is exactly what the wrappers do.
75f3ea3
to
c902473
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the description out of date or did you miss these occurrences
core/src/Cuda/Kokkos_CudaSpace.cpp:47: (CudaInternal::singleton().cuda_stream_create_wrapper(&s)));
core/src/Cuda/Kokkos_CudaSpace.cpp:70: KOKKOS_IMPL_CUDA_SAFE_CALL((CudaInternal::singleton().cuda_memcpy_wrapper(
core/src/Cuda/Kokkos_CudaSpace.cpp:84: (CudaInternal::singleton().cuda_memcpy_async_wrapper(
core/src/Cuda/Kokkos_Cuda_Instance.cpp:149: (CudaInternal::singleton().cuda_device_synchronize_wrapper()));
core/src/Cuda/Kokkos_Cuda_Instance.cpp:154: (CudaInternal::singleton().cuda_device_synchronize_wrapper()));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the point of these cudaFuncSetAttributes
changes if they do not resolve the issue?
@@ -468,10 +465,12 @@ class TaskQueueSpecializationConstrained< | |||
static void execute(scheduler_type const& scheduler) { | |||
const int shared_per_warp = 2048; | |||
const int warps_per_block = 4; | |||
const Kokkos::Cuda exec = Cuda(); // FIXME_CUDA_MULTIPLE_DEVICES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doesn't this one do scheduler.get_execution_space()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh it was commented :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And Daniel checked and that still does not work
@@ -168,18 +168,6 @@ void cuda_stream_synchronize(const cudaStream_t stream, const CudaInternal *ptr, | |||
}); | |||
} | |||
|
|||
void cuda_stream_synchronize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you inline this one?
dst.data(), 0, | ||
dst.size() * sizeof(typename View<T, P...>::value_type)))); | ||
cudaMemset(dst.data(), 0, | ||
dst.size() * sizeof(typename View<T, P...>::value_type))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't the previous version running on the stream of the default exec space while this one is a blocking call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Daniel pointed out #6187
We decided to split this pull request up. |
Related to #6091. This pull request replaces all instances of calling
CudaInternal::singleton
with a wrapper since these places are possibly using the wrong device. It seems better to be explicit even in cases where we want to use the default execution space instance/the default device.I would go through unused wrapper functions and remove them in a follow-up pull request.