Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

call ibv_reg_mr failed using mapped memory #266

Open
tangrc99 opened this issue Jun 20, 2023 · 7 comments
Open

call ibv_reg_mr failed using mapped memory #266

tangrc99 opened this issue Jun 20, 2023 · 7 comments
Labels

Comments

@tangrc99
Copy link

    gpu_mem_handle_t m_t;    // from gdrcopy/test/common.h
    if( (r = gpu_mem_alloc(&m_t,10000,1,1) ) != CUDA_SUCCESS) {
        return -1;
    }
    gdr_mh_t handle;
    char *gpu_mapped_mem  = NULL;

    if( (ret = gdr_pin_buffer(g_t, m_t.ptr, m_t.allocated_size, 0,0,&handle)) != 0 ) {
        return -1;
    }
    if( (ret = gdr_map(g_t,handle,&gpu_mapped_mem,m_t.allocated_size) ) != 0 ){
        return -1;
    }

    char *gdr_mem =  gpu_mapped_mem;  // the ptr I try to register

I try to register gdr_mem using ibv_reg_mr, but got an errno EFAULT.
I am using the A10 GPU on CentOS 8.5

@drossetti
Copy link
Member

drossetti commented Jun 20, 2023

@tangrc99 this expected as the implementation of ibv_reg_mr in the Linux kernel requires the virtual address range to be backed by CPU memory pages.

More exactly, pin_user_pages does not work on CPU mappings of PCIe resources created via io_remap_pfn_range.

The official way of enabling RDMA on GPU memory is:

For a full deployment case, see for example https://github.com/openucx/ucx/blob/1308d2055ab0ba948eac213c8cfcd92776c34a53/src/uct/cuda/cuda_copy/cuda_copy_md.c#L410 and https://github.com/openucx/ucx/blob/1308d2055ab0ba948eac213c8cfcd92776c34a53/src/uct/ib/base/ib_md.c#L480.

@tangrc99
Copy link
Author

tangrc99 commented Jun 20, 2023

Thanks, cause A10 don't support dma-buf file descriptor. Can I use GDR on A10 with other methods ?

@drossetti
Copy link
Member

It should. Are you using the openrm variant of the GPU kernel-mode driver, see https://developer.nvidia.com/blog/nvidia-releases-open-source-gpu-kernel-modules/ ?

@tangrc99
Copy link
Author

tangrc99 commented Jun 21, 2023

function cuMemGetHandleForAddressRange requires CU_MEM_RANGE_HANDLE_TYPE_DMA_BUF_FD which is 0 on A10. nv_peer_mem and nvidia-peermem is already loaded, is there any other requirements ?

@pakmarkthub
Copy link
Collaborator

pakmarkthub commented Jun 21, 2023

Hi @tangrc99,

Neither nvidia-peermem nor nv_peer_mem involves in dmabuf. A10 should support dmabuf. Could you check if your SW stack is new enough to support dmabuf?

  • NVIDIA driver with the open variant version 515 or later.
  • CUDA 11.7 or later.
  • Linux kernel version 5.12 or later. This is for the NIC stack. The GPU stack does not have this requirement.

@tangrc99
Copy link
Author

Thanks, My Linux kernel 4.18.0 is too old.

@drossetti
Copy link
Member

drossetti commented Jul 13, 2023

In that case you can use the legacy RDMA memory registration path, i.e. ibv_reg_mr, which involves the peer-direct kernel infrastructure (for example provided by MLNX_OFED) and nvidia-peermem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants