Cuda: Detect device from stream for multi-GPU support #6361

masterleinad · 2023-08-16T19:44:35Z

Part of #6091. ~~We agreed that~~

Cuda(int device_id, cudaStream_t stream);

~~is the construction we want (without an option to let Kokkos manage the stream).~~

We decided to query the device id to use from the stream passed in.

core/src/Cuda/Kokkos_Cuda_Instance.hpp

Co-authored-by: Dong Hun Lee <59181952+ldh4@users.noreply.github.com>

crtrott

I would like to see a test for this specifically I would like to see a CUDA specific test where we query the number of GPUs and create a std::thread for each GPU and then have them run independent kernels. As a sanity check I would like that test to be a kernel which uses the entire GPU and runs reasonably long (>100ms) and compare the total time with the n-threads with the individual time: i.e. making sure it overlaps. We probably need to guard that test somehow since its gonna be problematic inside of Trilinos testing or an environment which doesn't guard GPU overuse by multiple processes.

masterleinad · 2023-08-27T18:28:40Z

This really only adds the constructor and prepares CudaInternal for multi-GPU support. #6091 is a functional prototype including a test.

core/unit_test/cuda/TestCuda_InterOp_StreamsMultiGPU.cpp

Co-authored-by: Dong Hun Lee <59181952+ldh4@users.noreply.github.com>

core/src/Cuda/Kokkos_Cuda_Instance.cpp

masterleinad · 2023-08-30T19:11:35Z

Only HIP-ROCm-5.6-C++20 is timing out.

…vices_constructor

masterleinad · 2023-09-13T17:26:41Z

CUDA-12.2-NVHPC was timing out. All other Cuda CI builds passed.

tcclevenger

Just a question about using the cudaAPI wrapper: Does it make sense to use them everywhere in core/src/Cuda (for consistency), or should we directly call cudaAPI when we want to manually set the device id (less complicated code, easier to interpret)?

~~I lean towards the former (consistency) since I think with the latter it then becomes complicated as to when you should or shouldn't use them, but I'm open to other opinions.~~

Edit: I see your comment in #6392 about other places that must manually set the device id and the interaction with the wrapper. We still could direct all calls through the CudaInternal::singleton() and manually set the device, but I'm not as convinced that would be my opinion.

core/src/Cuda/Kokkos_Cuda_Instance.cpp

masterleinad · 2023-10-04T17:59:35Z

We still could direct all calls through the CudaInternal::singleton() and manually set the device, but I'm not as convinced that would be my opinion.

I find it confusing to use the singleton (possibly with template arguments to not set the device) if it's not used. We could make the wrapper functions static or (global) but that makes it more awkward to use them where we actually have a Cuda instance.

core/src/Cuda/Kokkos_Cuda_Instance.cpp

masterleinad · 2023-10-25T20:16:19Z

core/src/Cuda/Kokkos_Cuda_Instance.cpp

-  KOKKOS_IMPL_CUDA_SAFE_CALL(
-      (Impl::CudaInternal::singleton().cuda_stream_create_wrapper(
-          &singleton_stream)));
+  KOKKOS_IMPL_CUDA_SAFE_CALL(cudaStreamCreate(&singleton_stream));



I would like to get rid of all places where we use a wrapper with singleton.

I think that makes sense based on your previous explanation. Should this be done in a separate PR? I don't mind creating that PR if you don't want to.

If someone requests to split it from this pull request, I'll drop it. I would have a slight preference to get all the other related pull requests in before systematically searching for this kind of guard but ultimately I don't care much.

masterleinad · 2023-10-26T20:02:33Z

Only hpx is failing with a clearly unrelated error:

/home/runner/work/kokkos/kokkos/hpx/libs/core/iterator_support/include/hpx/iterator_support/counting_iterator.hpp:58:53: error: no member named 'intmax_t' in namespace 'std'
                    (sizeof(Integer) >= sizeof(std::intmax_t)),
                                               ~~~~~^
/home/runner/work/kokkos/kokkos/hpx/libs/core/iterator_support/include/hpx/iterator_support/counting_iterator.hpp:61:39: error: expected ';' after alias declaration
                        std::intmax_t>,
                                      ^
                                      ;
/home/runner/work/kokkos/kokkos/hpx/libs/core/iterator_support/include/hpx/iterator_support/counting_iterator.hpp:64:46: error: no member named 'intmax_t' in namespace 'std'
                        std::ptrdiff_t, std::intmax_t>>::type::type;
                                        ~~~~~^
/home/runner/work/kokkos/kokkos/hpx/libs/core/iterator_support/include/hpx/iterator_support/counting_iterator.hpp:64:55: error: expected member name or ';' after declaration specifiers
                        std::ptrdiff_t, std::intmax_t>>::type::type;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
4 errors generated.

that we now see for all pull requests.

This doesn't add full multi-GPU support and the pull request has changed significantly.

core/src/Cuda/Kokkos_Cuda_Instance.cpp

…vices_constructor

masterleinad · 2023-11-07T23:49:30Z

All CUDA builds are passing.

masterleinad · 2023-11-07T23:50:44Z

core/src/Cuda/Kokkos_Cuda_Instance.cpp

+  KOKKOS_IMPL_CUDA_SAFE_CALL(cudaSetDevice(cuda_device_id));
+  KOKKOS_IMPL_CUDA_SAFE_CALL(cudaDeviceSynchronize());


I guess we could remove but it wouldn't really be related to the intent of this pull request.

can't use the wrapper here because it used to use the static m_cudaDev which isn't static anymore

masterleinad · 2023-11-08T16:50:14Z

Only openmptarget.partitioning_by_args is failing.

crtrott · 2023-12-21T18:09:28Z

core/src/Cuda/Kokkos_Cuda_Instance.cpp

+  KOKKOS_IMPL_CUDA_SAFE_CALL(cudaError_t(cuStreamGetCtx(stream, &context)));
+  KOKKOS_IMPL_CUDA_SAFE_CALL(cudaError_t(cuCtxPushCurrent(context)));
+  KOKKOS_IMPL_CUDA_SAFE_CALL(cudaError_t(cuCtxGetDevice(&m_cudaDev)));
+  KOKKOS_IMPL_CUDA_SAFE_CALL(cudaSetDevice(m_cudaDev));


Couldn't find a better way of doing this (i.e. not using driver interface) but we should be ok as long as Kokkos::initialize was called. Otherwise there could be issues with the context not being created yet.

crtrott · 2023-12-21T18:10:16Z

core/src/Cuda/Kokkos_Cuda_Instance.cpp

-      reinterpret_cast<void **>(&d_arch), sizeof(int))));
-  KOKKOS_IMPL_CUDA_SAFE_CALL((CudaInternal::singleton().cuda_memcpy_wrapper(
-      d_arch, &arch, sizeof(int), cudaMemcpyDefault)));
+  KOKKOS_IMPL_CUDA_SAFE_CALL(cudaSetDevice(device_id));


the singleton would query the wrong device so not call through its wrapper functions.

Introduce constructor for multi-GPU support.

b848231

masterleinad marked this pull request as ready for review August 17, 2023 02:24

ldh4 reviewed Aug 25, 2023

View reviewed changes

core/src/Cuda/Kokkos_Cuda_Instance.hpp Outdated Show resolved Hide resolved

Fix typo

64a9b3d

Co-authored-by: Dong Hun Lee <59181952+ldh4@users.noreply.github.com>

crtrott previously requested changes Aug 27, 2023

View reviewed changes

Add test

e627b2e

masterleinad force-pushed the cuda_multiple_devices_constructor branch from 069e2ac to e627b2e Compare August 28, 2023 17:58

masterleinad requested a review from crtrott August 28, 2023 19:16

ldh4 reviewed Aug 28, 2023

View reviewed changes

core/unit_test/cuda/TestCuda_InterOp_StreamsMultiGPU.cpp Outdated Show resolved Hide resolved

core/unit_test/cuda/TestCuda_InterOp_StreamsMultiGPU.cpp Outdated Show resolved Hide resolved

Fix typo.

1683786

Co-authored-by: Dong Hun Lee <59181952+ldh4@users.noreply.github.com>

masterleinad requested a review from ldh4 August 29, 2023 03:32

ldh4 reviewed Aug 29, 2023

View reviewed changes

core/src/Cuda/Kokkos_Cuda_Instance.cpp Outdated Show resolved Hide resolved

Explicitly check for valid device id

fa1aaa7

masterleinad force-pushed the cuda_multiple_devices_constructor branch from dba131b to fa1aaa7 Compare August 29, 2023 16:43

ldh4 approved these changes Aug 29, 2023

View reviewed changes

dalg24 mentioned this pull request Sep 1, 2023

Deprecate {Cuda,Hip}({cuda,hip}Stream_t, bool) #6401

Merged

Daniel Arndt added 2 commits September 11, 2023 20:49

Merge remote-tracking branch 'upstream/develop' into cuda_multiple_de…

111371f

…vices_constructor

Set the device id in cuda_kernel_arch

41253bd

masterleinad mentioned this pull request Sep 13, 2023

Cuda: Allocate using the correct device #6392

Merged

Check for default device

f6977cf

tcclevenger reviewed Oct 4, 2023

View reviewed changes

core/src/Cuda/Kokkos_Cuda_Instance.cpp Outdated Show resolved Hide resolved

core/src/Cuda/Kokkos_Cuda_Instance.cpp Show resolved Hide resolved

core/src/Cuda/Kokkos_Cuda_Instance.cpp Outdated Show resolved Hide resolved

tcclevenger reviewed Oct 18, 2023

View reviewed changes

core/src/Cuda/Kokkos_Cuda_Instance.cpp Outdated Show resolved Hide resolved

Check that device associated with stream matches requested device

e156d58

masterleinad force-pushed the cuda_multiple_devices_constructor branch from e489742 to e156d58 Compare October 24, 2023 14:00

masterleinad requested a review from tcclevenger October 24, 2023 14:00

tcclevenger reviewed Oct 24, 2023

View reviewed changes

core/src/Cuda/Kokkos_Cuda_Instance.cpp Outdated Show resolved Hide resolved

masterleinad requested a review from tcclevenger October 25, 2023 20:10

masterleinad force-pushed the cuda_multiple_devices_constructor branch from ec133fd to 78f1134 Compare October 25, 2023 20:14

Remove extra constructor

1fcce69

masterleinad force-pushed the cuda_multiple_devices_constructor branch from 78f1134 to 1fcce69 Compare October 25, 2023 20:14

tcclevenger approved these changes Oct 25, 2023

View reviewed changes

masterleinad commented Oct 25, 2023

View reviewed changes

masterleinad changed the title ~~Cuda: Introduce constructor for multi-GPU support~~ Cuda: Detect device from stream for multi-GPU support Oct 27, 2023

dalg24 reviewed Nov 1, 2023

View reviewed changes

core/src/Cuda/Kokkos_Cuda_Instance.cpp Outdated Show resolved Hide resolved

core/src/Cuda/Kokkos_Cuda_Instance.cpp Outdated Show resolved Hide resolved

core/src/Cuda/Kokkos_Cuda_Instance.cpp Show resolved Hide resolved

masterleinad added 2 commits November 1, 2023 08:00

Merge remote-tracking branch 'upstream/develop' into cuda_multiple_de…

8c4fe6b

…vices_constructor

Address reviewer comments

a07c7a2

masterleinad added this to the Release 4.3 milestone Nov 7, 2023

masterleinad added 2 commits November 7, 2023 14:01

m_cudaDev isn't static anymore

403c34f

Set the device id explicitly for CUDA API calls in impl_initialize

d4a517f

masterleinad force-pushed the cuda_multiple_devices_constructor branch from 4332959 to d4a517f Compare November 7, 2023 21:16

masterleinad commented Nov 7, 2023

View reviewed changes

tcclevenger approved these changes Nov 8, 2023

View reviewed changes

ldh4 approved these changes Nov 8, 2023

View reviewed changes

crtrott approved these changes Dec 20, 2023

View reviewed changes

crtrott reviewed Dec 21, 2023

View reviewed changes

crtrott merged commit f38553c into kokkos:develop Dec 21, 2023
27 of 30 checks passed

ndellingwood mentioned this pull request Jan 3, 2024

Nightly Trilinos failures with Cuda build: undefined reference to cuStreamGetCtx', cuCtxPushCurrent_v2', cuCtxGetDevice' #6693

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda: Detect device from stream for multi-GPU support #6361

Cuda: Detect device from stream for multi-GPU support #6361

masterleinad commented Aug 16, 2023 •

edited

Loading

crtrott left a comment

masterleinad commented Aug 27, 2023

masterleinad commented Aug 30, 2023

masterleinad commented Sep 13, 2023

tcclevenger left a comment •

edited

Loading

masterleinad commented Oct 4, 2023

masterleinad Oct 25, 2023

tcclevenger Oct 25, 2023

masterleinad Oct 26, 2023

masterleinad commented Oct 26, 2023

masterleinad commented Nov 7, 2023

masterleinad Nov 7, 2023

crtrott Dec 21, 2023

masterleinad commented Nov 8, 2023

crtrott Dec 21, 2023

crtrott Dec 21, 2023

		KOKKOS_IMPL_CUDA_SAFE_CALL(cudaSetDevice(cuda_device_id));
		KOKKOS_IMPL_CUDA_SAFE_CALL(cudaDeviceSynchronize());

Cuda: Detect device from stream for multi-GPU support #6361

Cuda: Detect device from stream for multi-GPU support #6361

Conversation

masterleinad commented Aug 16, 2023 • edited Loading

crtrott left a comment

Choose a reason for hiding this comment

masterleinad commented Aug 27, 2023

masterleinad commented Aug 30, 2023

masterleinad commented Sep 13, 2023

tcclevenger left a comment • edited Loading

Choose a reason for hiding this comment

masterleinad commented Oct 4, 2023

masterleinad Oct 25, 2023

Choose a reason for hiding this comment

tcclevenger Oct 25, 2023

Choose a reason for hiding this comment

masterleinad Oct 26, 2023

Choose a reason for hiding this comment

masterleinad commented Oct 26, 2023

masterleinad commented Nov 7, 2023

masterleinad Nov 7, 2023

Choose a reason for hiding this comment

crtrott Dec 21, 2023

Choose a reason for hiding this comment

masterleinad commented Nov 8, 2023

crtrott Dec 21, 2023

Choose a reason for hiding this comment

crtrott Dec 21, 2023

Choose a reason for hiding this comment

masterleinad commented Aug 16, 2023 •

edited

Loading

tcclevenger left a comment •

edited

Loading