-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow using multiple CUDA devices #6091
Conversation
I don't get your point @masterleinad regarding the default device. We always set a device don't we? So the default execution space instance is gonna be associated with a device which is based on our common decision process (kokkos-device-id, map-by etc.). That default execution space and the associated device needs to be used for any operation which doesn't get an execution space instance passed. |
In the end, we have (or could infer/pass) an execution space instance (or at least the correct device) in most cases anyway so the question would really only be about cases where that is not the case. Of course, I see the failure in #5713 but we should really see how many places are left where we can't know about the correct device to use. On top of my head, there would be |
616f483
to
2788732
Compare
core/src/Cuda/Kokkos_CudaSpace.cpp
Outdated
Kokkos::Tools::Experimental::Impl::profile_fence_event<Kokkos::Cuda>( | ||
"Kokkos::Impl::DeepCopyAsyncCuda: Deep Copy Stream Sync", | ||
Kokkos::Tools::Experimental::SpecialSynchronizationCases:: | ||
DeepCopyResourceSynchronization, | ||
"Kokkos::Impl::DeepCopyAsyncCuda: Deep Copy Stream Sync"); | ||
[&]() { KOKKOS_IMPL_CUDA_SAFE_CALL(cudaStreamSynchronize(s)); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the only case where this overload of cuda_stream_synchronize
was used and it felt easier to just inline the call explicitly.
core/src/Cuda/Kokkos_CudaSpace.cpp
Outdated
error_code = Impl::CudaInternal::singleton().cuda_malloc_async_wrapper( | ||
&ptr, arg_alloc_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a good reason to use the singleton
here instead of the provided execution space instance. The default constructed execution space instance corresponds to the singleton anyway.
core/src/Cuda/Kokkos_Cuda.hpp
Outdated
@@ -183,6 +183,8 @@ class Cuda { | |||
|
|||
Cuda(cudaStream_t stream, bool manage_stream = false); | |||
|
|||
Cuda(int device_id, cudaStream_t stream, bool manage_stream = false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the proposed new constructor.
core/src/Cuda/Kokkos_Cuda_Error.hpp
Outdated
void cuda_stream_synchronize( | ||
const cudaStream_t stream, | ||
Kokkos::Tools::Experimental::SpecialSynchronizationCases reason, | ||
const std::string& name); | ||
void cuda_device_synchronize(const std::string& name); | ||
void cuda_stream_synchronize(const cudaStream_t stream, | ||
const std::string& name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultimately, these functions don't need to be exposed since they are only used internally. Also note, that the last overload didn't have an implementation.
KOKKOS_IMPL_CUDA_SAFE_CALL( | ||
cudaMalloc(reinterpret_cast<void **>(&d_arch), sizeof(int))); | ||
KOKKOS_IMPL_CUDA_SAFE_CALL( | ||
cudaMemcpy(d_arch, &arch, sizeof(int), cudaMemcpyDefault)); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't change the device here.
inline static std::set<int> cuda_devices = {}; | ||
inline static std::map<int, unsigned long*> constantMemHostStagingPerDevice = | ||
{}; | ||
inline static std::map<int, cudaEvent_t> constantMemReusablePerDevice = {}; | ||
inline static std::map<int, std::mutex> constantMemMutexPerDevice; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's convenient to store the cuda_devices
in a static variable so we can iterate over them. set
might not be the most performant choice here but is convenient and it shouldn't matter much since we don't use it in performance-critical code. Similarly, std::map
is mostly convenient.
template <bool setCudaDevice = true> | ||
cudaError_t cuda_device_synchronize_wrapper() const { | ||
if constexpr (setCudaDevice) set_cuda_device(); | ||
return cudaDeviceSynchronize(); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not used anymore.
if constexpr (setCudaDevice) set_cuda_device(); | ||
return cudaStreamSynchronize(stream); | ||
return cudaStreamSynchronize(get_stream<false>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always have a valid stream and that is the only one to use here.
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaSetDevice(cuda_device)); | ||
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaFuncGetAttributes(&attr, func)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turned out to be problematic to call this function with the wrong device.
Again, the assumption is that all the devices have the same properties so it doesn't matter that this function uses a static variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is basically the same as TestCuda_InterOp_Streams.cpp
but duplicates all kernels to run on two different devices interleaved.
With this pull request, we still have
I think we can discuss fixing tasks later and |
f3ac51a
to
d3a7cca
Compare
d3a7cca
to
983702e
Compare
983702e
to
7d6e04a
Compare
7f3df2a
to
ffe359c
Compare
c174a35
to
b249700
Compare
b53ee3f
to
8180a53
Compare
Based on top of #5989. This pull request explores how multi-device support could look like using
TestCuda_InterOp_Streams
. The proposed constructor for the execution space instance isassuming a
cudaStream_t
is provided as well. It shouldn't be a problem to allow for omitting that argument, though.Of course, the tricky part is sharing device allocations between devices.
deep_copy
should work independently of the device and we can allocate on the correct device if an execution space instance is provided in the View allocation.For accessing allocations on a different device, the example uses
cudaDeviceEnablePeerAccess
which comes at a cost, of course. In the end, I think the user should be responsible if they want cross-device access.Of course, we need a corresponding pull request on the
desul
-side if we decide that we care about supporting to use the arbitrary size atomics on multiple GPUs within this pull request.The other changes in this pull request, show that it's crucial to only set the device in #5989 if we actually know the device to use. Setting the device to the default is more harmful than useful.