-
Notifications
You must be signed in to change notification settings - Fork 437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cuda: Detect device from stream for multi-GPU support #6361
Cuda: Detect device from stream for multi-GPU support #6361
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see a test for this specifically I would like to see a CUDA specific test where we query the number of GPUs and create a std::thread for each GPU and then have them run independent kernels. As a sanity check I would like that test to be a kernel which uses the entire GPU and runs reasonably long (>100ms) and compare the total time with the n-threads with the individual time: i.e. making sure it overlaps. We probably need to guard that test somehow since its gonna be problematic inside of Trilinos testing or an environment which doesn't guard GPU overuse by multiple processes.
This really only adds the constructor and prepares |
069e2ac
to
e627b2e
Compare
dba131b
to
fa1aaa7
Compare
Only HIP-ROCm-5.6-C++20 is timing out. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question about using the cudaAPI wrapper: Does it make sense to use them everywhere in core/src/Cuda
(for consistency), or should we directly call cudaAPI when we want to manually set the device id (less complicated code, easier to interpret)?
I lean towards the former (consistency) since I think with the latter it then becomes complicated as to when you should or shouldn't use them, but I'm open to other opinions.
Edit: I see your comment in #6392 about other places that must manually set the device id and the interaction with the wrapper. We still could direct all calls through the CudaInternal::singleton() and manually set the device, but I'm not as convinced that would be my opinion.
I find it confusing to use the singleton (possibly with template arguments to not set the device) if it's not used. We could make the wrapper functions |
e489742
to
e156d58
Compare
ec133fd
to
78f1134
Compare
78f1134
to
1fcce69
Compare
KOKKOS_IMPL_CUDA_SAFE_CALL( | ||
(Impl::CudaInternal::singleton().cuda_stream_create_wrapper( | ||
&singleton_stream))); | ||
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaStreamCreate(&singleton_stream)); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to get rid of all places where we use a wrapper with singleton
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense based on your previous explanation. Should this be done in a separate PR? I don't mind creating that PR if you don't want to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someone requests to split it from this pull request, I'll drop it. I would have a slight preference to get all the other related pull requests in before systematically searching for this kind of guard but ultimately I don't care much.
Only
that we now see for all pull requests. |
This doesn't add full multi-GPU support and the pull request has changed significantly.
4332959
to
d4a517f
Compare
All |
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaSetDevice(cuda_device_id)); | ||
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaDeviceSynchronize()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could remove but it wouldn't really be related to the intent of this pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't use the wrapper here because it used to use the static m_cudaDev which isn't static anymore
Only |
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaError_t(cuStreamGetCtx(stream, &context))); | ||
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaError_t(cuCtxPushCurrent(context))); | ||
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaError_t(cuCtxGetDevice(&m_cudaDev))); | ||
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaSetDevice(m_cudaDev)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't find a better way of doing this (i.e. not using driver interface) but we should be ok as long as Kokkos::initialize was called. Otherwise there could be issues with the context not being created yet.
reinterpret_cast<void **>(&d_arch), sizeof(int)))); | ||
KOKKOS_IMPL_CUDA_SAFE_CALL((CudaInternal::singleton().cuda_memcpy_wrapper( | ||
d_arch, &arch, sizeof(int), cudaMemcpyDefault))); | ||
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaSetDevice(device_id)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the singleton would query the wrong device so not call through its wrapper functions.
Part of #6091.
We agreed thatis the construction we want (without an option to letKokkos
manage the stream).We decided to query the device id to use from the stream passed in.