Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create cudaAPI function wrappers #6299

Conversation

tcclevenger
Copy link
Contributor

An alternative to #5989. Here I use individual wrappers for each cudaAPI call.

Benefits

  • Much easier to read where wrapper is called (vs. function ptrs)
  • Reading runtime errors give location of call with name matching cudaAPI function name (difference only in case)
  • Less code where function is called (vs. function ptrs where type casts are required)

Negative

  • Much more code in Kokkos_Cuda_Instance.hpp where wrappers are defined

Notes

  • I did not use the wrappers in the unit tests
  • A separate impl file could be added for the wrappers to make CudaInternal class much more readable (at the cost of more code lines)

Ping @crtrott.

@crtrott
Copy link
Member

crtrott commented Jul 21, 2023

I personally like this approach better.

@tcclevenger
Copy link
Contributor Author

I personally like this approach better.

I'm starting to agree. Extra ~100 lines of code are worth the simplicity. Any opinion on putting the function definitions in a separate header or the cpp file?

Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me.

ptr, bytes, space.cuda_device(), space.cuda_stream()));
KOKKOS_IMPL_CUDA_SAFE_CALL(
(space.impl_internal_space_instance()->cuda_mem_prefetch_async_wrapper(
ptr, bytes, space.cuda_device(), space.cuda_stream())));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masterleinad Does it make sense to use m_cudaDev here? Or do we ever need a different cuda device than the Instance device.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, we should always use the instance's member variables in these wrappers and not even expose a device id or a stream in the interface.

@tcclevenger tcclevenger marked this pull request as ready for review July 26, 2023 16:42
Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. When adding support for multiple devices we will need to again review all the places where the singleton is used anyway.

@@ -43,7 +43,8 @@
cudaStream_t Kokkos::Impl::cuda_get_deep_copy_stream() {
static cudaStream_t s = nullptr;
if (s == nullptr) {
cudaStreamCreate(&s);
KOKKOS_IMPL_CUDA_SAFE_CALL(
(CudaInternal::singleton().cuda_stream_create_wrapper(&s)));
Copy link
Contributor

@fnrizzi fnrizzi Jul 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on the discussion of Wed 26 july, can we please open an issue to follow up to where we can the name of this singleton into something else since it is not a singleton?

@masterleinad
Copy link
Contributor

You will have to resolve conflicts.

Thomas Conrad Clevenger and others added 7 commits July 27, 2023 09:11
Since all calls to cudaAPI are where wrappers are defined, remove includes from other parts of Cuda
- allow for different stream as input
- default to stream from instance
- add helper function "get_input_stream" for selecting correct stream
Variable will soon become non-static. Static information is unecessary.
@tcclevenger tcclevenger force-pushed the thread_saftey_for_cuda_api_calls_individual_wrappers branch from 89a1c4c to c893105 Compare July 27, 2023 15:12
Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow on optimization can we identify calls which always will be preceded by another call so one wouldn't need to call setCudaDevice?

@masterleinad
Copy link
Contributor

As a follow on optimization can we identify calls which always will be preceded by another call so one wouldn't need to call setCudaDevice?

I thought we already convinced ourselves that the performance impact of this pull request is negligible?

@crtrott
Copy link
Member

crtrott commented Jul 27, 2023

Yeah I guess so.

@tcclevenger
Copy link
Contributor Author

As a follow on optimization can we identify calls which always will be preceded by another call so one wouldn't need to call setCudaDevice?

I had that at one point in a previous commit on the other PR, but like Daniel said there wasn't a real performance impact, so I removed it since it made the already complicated looking template params even more complicated. Now that the calls are much simpler (with no templates needed for inputs), I'll add back in a follow up branch and get some performance numbers.

@crtrott
Copy link
Member

crtrott commented Jul 28, 2023

CUDA 11.6 failed with non-available container, but all other CUDA builds passed so I am merging.

@crtrott crtrott merged commit 39de959 into kokkos:develop Jul 28, 2023
27 of 28 checks passed
@tcclevenger tcclevenger deleted the thread_saftey_for_cuda_api_calls_individual_wrappers branch July 31, 2023 14:06
@tcclevenger
Copy link
Contributor Author

Thanks @crtrott. Setting up an issue for HIP version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants