-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create cudaAPI function wrappers #6299
Create cudaAPI function wrappers #6299
Conversation
I personally like this approach better. |
I'm starting to agree. Extra ~100 lines of code are worth the simplicity. Any opinion on putting the function definitions in a separate header or the cpp file? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me.
core/src/Cuda/Kokkos_CudaSpace.cpp
Outdated
ptr, bytes, space.cuda_device(), space.cuda_stream())); | ||
KOKKOS_IMPL_CUDA_SAFE_CALL( | ||
(space.impl_internal_space_instance()->cuda_mem_prefetch_async_wrapper( | ||
ptr, bytes, space.cuda_device(), space.cuda_stream()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@masterleinad Does it make sense to use m_cudaDev
here? Or do we ever need a different cuda device than the Instance device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, we should always use the instance's member variables in these wrappers and not even expose a device id or a stream in the interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. When adding support for multiple devices we will need to again review all the places where the singleton is used anyway.
@@ -43,7 +43,8 @@ | |||
cudaStream_t Kokkos::Impl::cuda_get_deep_copy_stream() { | |||
static cudaStream_t s = nullptr; | |||
if (s == nullptr) { | |||
cudaStreamCreate(&s); | |||
KOKKOS_IMPL_CUDA_SAFE_CALL( | |||
(CudaInternal::singleton().cuda_stream_create_wrapper(&s))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on the discussion of Wed 26 july, can we please open an issue to follow up to where we can the name of this singleton
into something else since it is not a singleton?
You will have to resolve conflicts. |
Since all calls to cudaAPI are where wrappers are defined, remove includes from other parts of Cuda
- allow for different stream as input - default to stream from instance - add helper function "get_input_stream" for selecting correct stream
Variable will soon become non-static. Static information is unecessary.
89a1c4c
to
c893105
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow on optimization can we identify calls which always will be preceded by another call so one wouldn't need to call setCudaDevice?
I thought we already convinced ourselves that the performance impact of this pull request is negligible? |
Yeah I guess so. |
I had that at one point in a previous commit on the other PR, but like Daniel said there wasn't a real performance impact, so I removed it since it made the already complicated looking template params even more complicated. Now that the calls are much simpler (with no templates needed for inputs), I'll add back in a follow up branch and get some performance numbers. |
CUDA 11.6 failed with non-available container, but all other CUDA builds passed so I am merging. |
Thanks @crtrott. Setting up an issue for HIP version. |
An alternative to #5989. Here I use individual wrappers for each cudaAPI call.
Benefits
Negative
Kokkos_Cuda_Instance.hpp
where wrappers are definedNotes
CudaInternal
class much more readable (at the cost of more code lines)Ping @crtrott.