Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA] Support IPC for allocations created by cuMemCreate and cudaMallocAsync #7110

Open
vchuravy opened this issue Jul 17, 2021 · 5 comments
Labels

Comments

@vchuravy
Copy link

vchuravy commented Jul 17, 2021

Describe the bug

CUDA 10.2 introduced a new set of memory allocation routines (https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__VA.html#group__CUDA__VA) which allow for pooled allocation and stream based allocation.

These allocation do not support cuIpcGetMemHandle as noted in https://developer.nvidia.com/blog/introducing-low-level-gpu-virtual-memory-management/

The new CUDA virtual memory management functions do not support the legacy cuIpc* functions with their memory. Instead, they expose a new mechanism for interprocess communication that works better with each supported platform. This new mechanism is based on manipulating system–specific handles. On Windows, these are of type HANDLE or D3DKMT_HANDLE, while on Linux-based platforms, these are file descriptors.

To get one of these operating system–specific handles, the new function cuMemExportToShareableHandle is introduced. The appropriate request handle types must be passed to cuMemCreate. By default, memory is not exportable, so shareable handles are not available with the default properties.

It seems to be that CUDA 11.2 introduced cudaMallocAsync is using this new interface under the hood as https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html#group__CUDART__MEMORY__POOLS_1g8158cc4b2c0d2c2c771f9d1af3cf386e takes a HandleType https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1gabde707dfb8a602b917e0b177f77f365

Steps to Reproduce

See JuliaGPU/CUDA.jl#1053 for an application failure caused by this.

The error encountered is:

The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol
cannot be used.
  cuIpcGetMemHandle return value:   1
@vchuravy vchuravy added the Bug label Jul 17, 2021
@vchuravy vchuravy changed the title [CUDA] Support IPC for allocations created by cuMemCreate [CUDA] Support IPC for allocations created by cuMemCreate and cudaMallocAsync Jul 17, 2021
@Akshay-Venkatesh
Copy link
Contributor

@vchuravy Is staging cuMemCreate/MallocAsync through cuMemAlloc/cudaMalloc memory not an option? Is it true that JuliaGPU/CUDA.jl#1053 strictly needs to use cuMemCreate/MallocAsync ?

@vchuravy
Copy link
Author

Two notes:

  1. I am only 90% sure that cudaMallocAsync uses cuMemCreate and had to infer that from the surrounding documentation.
  2. That kinda highlights the point, the user doesn't necessarily know or needs to know what allocation method was used.

From the perspective of CUDA.jl we currently do not expose the different allocators to the user, the only option is whether the user configures the use of a memory pool managed by CUDA.jl or via cudaMallocAsync thus managed by the driver.

Now we currently have the work around for users who want to use UCX or MPI to disable the use of cudaMallocAsync. On an application level staging through cudaMalloc might be a possibility as well, but introduces additional complexities. (Dealing with provenance e.g. who allocated the buffer, which method, allocation of unnecessary temporary memory...)

From my perspective as a user of MPI or UCX I would like to see support for cudaMallocAsync since they can be IPC capable.

There seem to be two relevant pointer attributes:

CU_POINTER_ATTRIBUTE_IS_LEGACY_CUDA_IPC_CAPABLE
CU_POINTER_ATTRIBUTE_ALLOWED_HANDLE_TYPES

@vchuravy
Copy link
Author

This remains an issue https://discourse.julialang.org/t/cuda-aware-mpi-works-on-system-but-not-for-julia/75060/20?u=vchuravy and we have to tell users to explicitly disable CUDA mempool's support.

@jrhemstad
Copy link

cudaMallocAsync supports CUDA IPC, but requires configuring an explicit pool handle.

See the "Interprocess communication support" section here: https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-2/

@pentschev
Copy link
Contributor

From the discussions I had with @Akshay-Venkatesh , it seems using an explicit pool handle for CUDA IPC may not be possible in UCX at the moment, but that will probably be possible in protov2. Meanwhile, support for cudaMallocAsync has been added in #8623, and given the lack of direct support for CUDA IPC, one intermediate solution is to use staging buffers by setting UCX_RNDV_FRAG_MEM_TYPE=cuda, from our preliminary performance tests in UCX-Py we were able to reach about 90% of CUDA IPC performance when compared to default CUDA pinned memory, with the advantage of being able to prevent fragmentation. We still have some open issues though: #8639 #8669 , those still prevent us from using async memory allocations for specific use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants