Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow using multiple CUDA devices #6091

Closed
wants to merge 25 commits into from

Conversation

masterleinad
Copy link
Contributor

@masterleinad masterleinad commented Apr 28, 2023

Based on top of #5989. This pull request explores how multi-device support could look like using TestCuda_InterOp_Streams. The proposed constructor for the execution space instance is

Cuda(int device_id, cudaStream_t stream);

assuming a cudaStream_t is provided as well. It shouldn't be a problem to allow for omitting that argument, though.
Of course, the tricky part is sharing device allocations between devices. deep_copy should work independently of the device and we can allocate on the correct device if an execution space instance is provided in the View allocation.
For accessing allocations on a different device, the example uses cudaDeviceEnablePeerAccess which comes at a cost, of course. In the end, I think the user should be responsible if they want cross-device access.

Of course, we need a corresponding pull request on the desul-side if we decide that we care about supporting to use the arbitrary size atomics on multiple GPUs within this pull request.

The other changes in this pull request, show that it's crucial to only set the device in #5989 if we actually know the device to use. Setting the device to the default is more harmful than useful.

@crtrott
Copy link
Member

crtrott commented May 4, 2023

I don't get your point @masterleinad regarding the default device. We always set a device don't we? So the default execution space instance is gonna be associated with a device which is based on our common decision process (kokkos-device-id, map-by etc.). That default execution space and the associated device needs to be used for any operation which doesn't get an execution space instance passed.

@masterleinad
Copy link
Contributor Author

I don't get your point @masterleinad regarding the default device. We always set a device don't we? So the default execution space instance is gonna be associated with a device which is based on our common decision process (kokkos-device-id, map-by etc.). That default execution space and the associated device needs to be used for any operation which doesn't get an execution space instance passed.

In the end, we have (or could infer/pass) an execution space instance (or at least the correct device) in most cases anyway so the question would really only be about cases where that is not the case.
While looking into making a simple example work using multiple devices, I discovered places where setting the device to the default device would be wrong, see https://github.com/kokkos/kokkos/pull/6091/files/a44c11735437a56eab8205b92d3ce5f02637dd94..63270b8f1b9208e130f3fc160a039831b8f5f963 or just makes the implementation more difficult by figuring out at which point we actually set m_cudaDev(cuda_kernel_arch). A pretty annoying case was get_cuda_kernel_func_attributes where we don't have the correct device available and just not changing it would (likely) do the right thing.

Of course, I see the failure in #5713 but we should really see how many places are left where we can't know about the correct device to use. On top of my head, there would be Kokkos::fence() which should then just call cudaDeviceSynchronize on all devices we know about.

Comment on lines 87 to 91
Kokkos::Tools::Experimental::Impl::profile_fence_event<Kokkos::Cuda>(
"Kokkos::Impl::DeepCopyAsyncCuda: Deep Copy Stream Sync",
Kokkos::Tools::Experimental::SpecialSynchronizationCases::
DeepCopyResourceSynchronization,
"Kokkos::Impl::DeepCopyAsyncCuda: Deep Copy Stream Sync");
[&]() { KOKKOS_IMPL_CUDA_SAFE_CALL(cudaStreamSynchronize(s)); });
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the only case where this overload of cuda_stream_synchronize was used and it felt easier to just inline the call explicitly.

Comment on lines 182 to 183
error_code = Impl::CudaInternal::singleton().cuda_malloc_async_wrapper(
&ptr, arg_alloc_size);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a good reason to use the singleton here instead of the provided execution space instance. The default constructed execution space instance corresponds to the singleton anyway.

@@ -183,6 +183,8 @@ class Cuda {

Cuda(cudaStream_t stream, bool manage_stream = false);

Cuda(int device_id, cudaStream_t stream, bool manage_stream = false);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the proposed new constructor.

Comment on lines 30 to 32
void cuda_stream_synchronize(
const cudaStream_t stream,
Kokkos::Tools::Experimental::SpecialSynchronizationCases reason,
const std::string& name);
void cuda_device_synchronize(const std::string& name);
void cuda_stream_synchronize(const cudaStream_t stream,
const std::string& name);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately, these functions don't need to be exposed since they are only used internally. Also note, that the last overload didn't have an implementation.

Comment on lines 104 to 108
KOKKOS_IMPL_CUDA_SAFE_CALL(
cudaMalloc(reinterpret_cast<void **>(&d_arch), sizeof(int)));
KOKKOS_IMPL_CUDA_SAFE_CALL(
cudaMemcpy(d_arch, &arch, sizeof(int), cudaMemcpyDefault));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't change the device here.

Comment on lines +151 to +155
inline static std::set<int> cuda_devices = {};
inline static std::map<int, unsigned long*> constantMemHostStagingPerDevice =
{};
inline static std::map<int, cudaEvent_t> constantMemReusablePerDevice = {};
inline static std::map<int, std::mutex> constantMemMutexPerDevice;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's convenient to store the cuda_devices in a static variable so we can iterate over them. set might not be the most performant choice here but is convenient and it shouldn't matter much since we don't use it in performance-critical code. Similarly, std::map is mostly convenient.

Comment on lines -253 to -229
template <bool setCudaDevice = true>
cudaError_t cuda_device_synchronize_wrapper() const {
if constexpr (setCudaDevice) set_cuda_device();
return cudaDeviceSynchronize();
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used anymore.

if constexpr (setCudaDevice) set_cuda_device();
return cudaStreamSynchronize(stream);
return cudaStreamSynchronize(get_stream<false>());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always have a valid stream and that is the only one to use here.

Comment on lines +139 to +140
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaSetDevice(cuda_device));
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaFuncGetAttributes(&attr, func));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turned out to be problematic to call this function with the wrong device.
Again, the assumption is that all the devices have the same properties so it doesn't matter that this function uses a static variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is basically the same as TestCuda_InterOp_Streams.cpp but duplicates all kernels to run on two different devices interleaved.

@masterleinad
Copy link
Contributor Author

With this pull request, we still have

$ git grep -n "FIXME_CUDA_MULTIPLE_DEVICES"
core/src/Cuda/Kokkos_Cuda_Task.hpp:225:  // FIXME_CUDA_MULTIPLE_DEVICES
core/src/Cuda/Kokkos_Cuda_Task.hpp:465:  // FIXME_CUDA_MULTIPLE_DEVICES
core/src/Cuda/Kokkos_Cuda_ZeroMemset.hpp:39:    // FIXME_CUDA_MULTIPLE_DEVICES

I think we can discuss fixing tasks later and ZeroMemset called without arguments replaces running a functor on the default execution space instance anyway. Selecting the correct device matters here, though.

@masterleinad masterleinad marked this pull request as ready for review August 2, 2023 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants