Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda: Detect device from stream for multi-GPU support #6361

Merged
merged 14 commits into from
Dec 21, 2023

Conversation

masterleinad
Copy link
Contributor

@masterleinad masterleinad commented Aug 16, 2023

Part of #6091. We agreed that

Cuda(int device_id, cudaStream_t stream);

is the construction we want (without an option to let Kokkos manage the stream).

We decided to query the device id to use from the stream passed in.

@masterleinad masterleinad marked this pull request as ready for review August 17, 2023 02:24
Co-authored-by: Dong Hun Lee <59181952+ldh4@users.noreply.github.com>
Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see a test for this specifically I would like to see a CUDA specific test where we query the number of GPUs and create a std::thread for each GPU and then have them run independent kernels. As a sanity check I would like that test to be a kernel which uses the entire GPU and runs reasonably long (>100ms) and compare the total time with the n-threads with the individual time: i.e. making sure it overlaps. We probably need to guard that test somehow since its gonna be problematic inside of Trilinos testing or an environment which doesn't guard GPU overuse by multiple processes.

@masterleinad
Copy link
Contributor Author

This really only adds the constructor and prepares CudaInternal for multi-GPU support. #6091 is a functional prototype including a test.

Co-authored-by: Dong Hun Lee <59181952+ldh4@users.noreply.github.com>
@masterleinad masterleinad requested a review from ldh4 August 29, 2023 03:32
@masterleinad masterleinad force-pushed the cuda_multiple_devices_constructor branch from dba131b to fa1aaa7 Compare August 29, 2023 16:43
@masterleinad
Copy link
Contributor Author

Only HIP-ROCm-5.6-C++20 is timing out.

@masterleinad
Copy link
Contributor Author

CUDA-12.2-NVHPC was timing out. All other Cuda CI builds passed.

Copy link
Contributor

@tcclevenger tcclevenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question about using the cudaAPI wrapper: Does it make sense to use them everywhere in core/src/Cuda (for consistency), or should we directly call cudaAPI when we want to manually set the device id (less complicated code, easier to interpret)?

I lean towards the former (consistency) since I think with the latter it then becomes complicated as to when you should or shouldn't use them, but I'm open to other opinions.

Edit: I see your comment in #6392 about other places that must manually set the device id and the interaction with the wrapper. We still could direct all calls through the CudaInternal::singleton() and manually set the device, but I'm not as convinced that would be my opinion.

core/src/Cuda/Kokkos_Cuda_Instance.cpp Outdated Show resolved Hide resolved
core/src/Cuda/Kokkos_Cuda_Instance.cpp Show resolved Hide resolved
core/src/Cuda/Kokkos_Cuda_Instance.cpp Outdated Show resolved Hide resolved
@masterleinad
Copy link
Contributor Author

We still could direct all calls through the CudaInternal::singleton() and manually set the device, but I'm not as convinced that would be my opinion.

I find it confusing to use the singleton (possibly with template arguments to not set the device) if it's not used. We could make the wrapper functions static or (global) but that makes it more awkward to use them where we actually have a Cuda instance.

@masterleinad masterleinad force-pushed the cuda_multiple_devices_constructor branch from 78f1134 to 1fcce69 Compare October 25, 2023 20:14
KOKKOS_IMPL_CUDA_SAFE_CALL(
(Impl::CudaInternal::singleton().cuda_stream_create_wrapper(
&singleton_stream)));
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaStreamCreate(&singleton_stream));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to get rid of all places where we use a wrapper with singleton.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense based on your previous explanation. Should this be done in a separate PR? I don't mind creating that PR if you don't want to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone requests to split it from this pull request, I'll drop it. I would have a slight preference to get all the other related pull requests in before systematically searching for this kind of guard but ultimately I don't care much.

@masterleinad
Copy link
Contributor Author

Only hpx is failing with a clearly unrelated error:

/home/runner/work/kokkos/kokkos/hpx/libs/core/iterator_support/include/hpx/iterator_support/counting_iterator.hpp:58:53: error: no member named 'intmax_t' in namespace 'std'
                    (sizeof(Integer) >= sizeof(std::intmax_t)),
                                               ~~~~~^
/home/runner/work/kokkos/kokkos/hpx/libs/core/iterator_support/include/hpx/iterator_support/counting_iterator.hpp:61:39: error: expected ';' after alias declaration
                        std::intmax_t>,
                                      ^
                                      ;
/home/runner/work/kokkos/kokkos/hpx/libs/core/iterator_support/include/hpx/iterator_support/counting_iterator.hpp:64:46: error: no member named 'intmax_t' in namespace 'std'
                        std::ptrdiff_t, std::intmax_t>>::type::type;
                                        ~~~~~^
/home/runner/work/kokkos/kokkos/hpx/libs/core/iterator_support/include/hpx/iterator_support/counting_iterator.hpp:64:55: error: expected member name or ';' after declaration specifiers
                        std::ptrdiff_t, std::intmax_t>>::type::type;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
4 errors generated.

that we now see for all pull requests.

@masterleinad masterleinad changed the title Cuda: Introduce constructor for multi-GPU support Cuda: Detect device from stream for multi-GPU support Oct 27, 2023
@masterleinad masterleinad dismissed crtrott’s stale review October 30, 2023 15:21

This doesn't add full multi-GPU support and the pull request has changed significantly.

core/src/Cuda/Kokkos_Cuda_Instance.cpp Outdated Show resolved Hide resolved
core/src/Cuda/Kokkos_Cuda_Instance.cpp Outdated Show resolved Hide resolved
core/src/Cuda/Kokkos_Cuda_Instance.cpp Show resolved Hide resolved
@masterleinad masterleinad added this to the Release 4.3 milestone Nov 7, 2023
@masterleinad masterleinad force-pushed the cuda_multiple_devices_constructor branch from 4332959 to d4a517f Compare November 7, 2023 21:16
@masterleinad
Copy link
Contributor Author

All CUDA builds are passing.

Comment on lines +690 to +691
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaSetDevice(cuda_device_id));
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaDeviceSynchronize());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could remove but it wouldn't really be related to the intent of this pull request.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't use the wrapper here because it used to use the static m_cudaDev which isn't static anymore

@masterleinad
Copy link
Contributor Author

Only openmptarget.partitioning_by_args is failing.

KOKKOS_IMPL_CUDA_SAFE_CALL(cudaError_t(cuStreamGetCtx(stream, &context)));
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaError_t(cuCtxPushCurrent(context)));
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaError_t(cuCtxGetDevice(&m_cudaDev)));
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaSetDevice(m_cudaDev));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't find a better way of doing this (i.e. not using driver interface) but we should be ok as long as Kokkos::initialize was called. Otherwise there could be issues with the context not being created yet.

reinterpret_cast<void **>(&d_arch), sizeof(int))));
KOKKOS_IMPL_CUDA_SAFE_CALL((CudaInternal::singleton().cuda_memcpy_wrapper(
d_arch, &arch, sizeof(int), cudaMemcpyDefault)));
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaSetDevice(device_id));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the singleton would query the wrong device so not call through its wrapper functions.

@crtrott crtrott merged commit f38553c into kokkos:develop Dec 21, 2023
27 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants