New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCT/CUDA_IPC: Add VMM/mallocasync support over fabric handles #9787
base: master
Are you sure you want to change the base?
UCT/CUDA_IPC: Add VMM/mallocasync support over fabric handles #9787
Conversation
93e718c
to
9acd2de
Compare
@rakhmets I had squash commits because build was complaining about one of the commit's formatting. Sorry about that but I've addressed your comments. |
(prop.location.type == CU_MEM_LOCATION_TYPE_HOST_NUMA_CURRENT)) { | ||
*cuda_mem_type = CU_MEMORYTYPE_DEVICE; | ||
} else { | ||
status = UCS_ERR_INVALID_ADDR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add error message, e.g.:
ucs_error("invalid memory location type %u for host memory type for address %p",
prop.location.type, address);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't actually an error. It's just that we don't want cuda_copy UCT to handle it. So I don't think we should print this error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe ucs_debug
in this case. Other code paths returning error has debug log in this method.
return status; | ||
} | ||
|
||
if (cuda_mem_type != CU_MEMORYTYPE_DEVICE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it will be useful to add error message here as well. E.g.:
ucs_error("invalid memory type %u for address %p", cuda_mem_type, address);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add a trace if needed but it shouldn't be a ucs_error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. How about ucs_debug
?
if ((prop.location.type == CU_MEM_LOCATION_TYPE_HOST) || | ||
(prop.location.type == CU_MEM_LOCATION_TYPE_HOST_NUMA) || | ||
(prop.location.type == CU_MEM_LOCATION_TYPE_HOST_NUMA_CURRENT)) { | ||
*cuda_mem_type = CU_MEMORYTYPE_DEVICE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we modify it to device here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brminich This ensures that cuda_ipc considers buffers of this memory type to be transferred over nvlinks for instance.
cu_err = cuCtxSetFlags(CU_CTX_SYNC_MEMOPS); | ||
if (cu_err != CUDA_SUCCESS) { | ||
ucs_warn("cuCtxSetFlags(CU_CTX_SYNC_MEMOPS) for %p error: %s", | ||
address, uct_cuda_base_cu_get_error_string(cu_err)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use UCT_CUDADRV_FUNC
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we set sync flag for the context for vmm only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brminich VMM memory cannot use SYNC_MEMOPS attribute and the mechanism exposed is the above API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, what is about using UCT_CUDADRV_FUNC
here?
uct_cuda_ipc_open_memhandle_mempool(const uct_cuda_ipc_rkey_t *key, | ||
CUdeviceptr *mapped_addr) | ||
{ | ||
CUmemoryPool *imported_mpool = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUmemoryPool *imported_mpool = 0; | |
CUmemoryPool *imported_mpool = NULL; |
ucs_status_t uct_cuda_ipc_close_memhandle(uct_cuda_ipc_cache_region_t *region) | ||
{ | ||
ucs_status_t status; | ||
#if HAVE_CUDA_FABRIC | ||
if (region->key.ph.handle_type == UCT_CUDA_IPC_KEY_HANDLE_TYPE_VMM) { | ||
status = UCT_CUDADRV_FUNC_LOG_ERR(cuMemUnmap( | ||
(CUdeviceptr)region->mapped_addr, region->key.b_len)); | ||
status = UCT_CUDADRV_FUNC_LOG_ERR(cuMemAddressFree( | ||
(CUdeviceptr)region->mapped_addr, region->key.b_len)); | ||
} else if (region->key.ph.handle_type == UCT_CUDA_IPC_KEY_HANDLE_TYPE_LEGACY) { | ||
status = UCT_CUDADRV_FUNC_LOG_ERR(cuIpcCloseMemHandle( | ||
(CUdeviceptr)region->mapped_addr)); | ||
} else { | ||
/* Ideally we call cuMemFreeAsync on imported pointer region here but | ||
* handles can be closed after device context has been destroyed which | ||
* would cause cuMemFreeAsync to fail for a given stream */ | ||
status = UCS_OK; | ||
} | ||
#else | ||
status = UCT_CUDADRV_FUNC_LOG_ERR(cuIpcCloseMemHandle( | ||
(CUdeviceptr)region->mapped_addr)); | ||
#endif | ||
|
||
return status; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ucs_status_t uct_cuda_ipc_close_memhandle(uct_cuda_ipc_cache_region_t *region) | |
{ | |
ucs_status_t status; | |
#if HAVE_CUDA_FABRIC | |
if (region->key.ph.handle_type == UCT_CUDA_IPC_KEY_HANDLE_TYPE_VMM) { | |
status = UCT_CUDADRV_FUNC_LOG_ERR(cuMemUnmap( | |
(CUdeviceptr)region->mapped_addr, region->key.b_len)); | |
status = UCT_CUDADRV_FUNC_LOG_ERR(cuMemAddressFree( | |
(CUdeviceptr)region->mapped_addr, region->key.b_len)); | |
} else if (region->key.ph.handle_type == UCT_CUDA_IPC_KEY_HANDLE_TYPE_LEGACY) { | |
status = UCT_CUDADRV_FUNC_LOG_ERR(cuIpcCloseMemHandle( | |
(CUdeviceptr)region->mapped_addr)); | |
} else { | |
/* Ideally we call cuMemFreeAsync on imported pointer region here but | |
* handles can be closed after device context has been destroyed which | |
* would cause cuMemFreeAsync to fail for a given stream */ | |
status = UCS_OK; | |
} | |
#else | |
status = UCT_CUDADRV_FUNC_LOG_ERR(cuIpcCloseMemHandle( | |
(CUdeviceptr)region->mapped_addr)); | |
#endif | |
return status; | |
} | |
ucs_status_t uct_cuda_ipc_close_memhandle(uct_cuda_ipc_cache_region_t *region) | |
{ | |
ucs_status_t status; | |
#if HAVE_CUDA_FABRIC | |
if (region->key.ph.handle_type == UCT_CUDA_IPC_KEY_HANDLE_TYPE_VMM) { | |
status = UCT_CUDADRV_FUNC_LOG_ERR(cuMemUnmap( | |
(CUdeviceptr)region->mapped_addr, region->key.b_len)); | |
status = UCT_CUDADRV_FUNC_LOG_ERR(cuMemAddressFree( | |
(CUdeviceptr)region->mapped_addr, region->key.b_len)); | |
} else if (region->key.ph.handle_type != UCT_CUDA_IPC_KEY_HANDLE_TYPE_LEGACY) { | |
/* Ideally we call cuMemFreeAsync on imported pointer region here but | |
* handles can be closed after device context has been destroyed which | |
* would cause cuMemFreeAsync to fail for a given stream */ | |
status = UCS_OK; | |
} | |
#endif | |
status = UCT_CUDADRV_FUNC_LOG_ERR(cuIpcCloseMemHandle( | |
(CUdeviceptr)region->mapped_addr)); | |
return status; | |
} |
#if HAVE_CUDA_FABRIC | ||
uct_device_type_t dev_type = UCT_DEVICE_TYPE_NET; | ||
#else | ||
uct_device_type_t dev_type = UCT_DEVICE_TYPE_SHM; | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we always define it as NET
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brminich We could do that but we should probably wait until rkey_unpack failures are gracefully handled as cuda_ipc would become eligible for all endpoints.
src/uct/cuda/cuda_ipc/cuda_ipc_md.c
Outdated
status = UCT_CUDADRV_FUNC(cuIpcGetMemHandle(&key->ph, (CUdeviceptr)addr), | ||
UCS_LOG_LEVEL_ERROR); | ||
if (UCS_OK != status) { | ||
if (status != UCS_OK) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: no need to change
alloc_handle->length, granularity, | ||
0, 0)); | ||
if (status != UCS_OK) { | ||
return status; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to release resources allocated before (here and in other error flows of this func). I. e. use cuMemUnmap
, cuMemAddressFree
, etc
cu_err = cuCtxSetFlags(CU_CTX_SYNC_MEMOPS); | ||
if (cu_err != CUDA_SUCCESS) { | ||
ucs_warn("cuCtxSetFlags(CU_CTX_SYNC_MEMOPS) for %p error: %s", | ||
address, uct_cuda_base_cu_get_error_string(cu_err)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, what is about using UCT_CUDADRV_FUNC
here?
typedef enum { | ||
UCT_CUDA_ALLOC_TYPE_FABRIC, | ||
UCT_CUDA_ALLOC_TYPE_PINNED, | ||
UCT_CUDA_ALLOC_TYPE_MANAGED, | ||
UCT_CUDA_ALLOC_TYPE_LAST = UCT_CUDA_ALLOC_TYPE_MANAGED | ||
} uct_cuda_alloc_type_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe move it to cuda_cop_md.c
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to leave it here as it is referenced/used in cuda_ipc transport and having it in the header file makes it convenient.
status = UCT_CUDADRV_FUNC_LOG_ERR(cuMemUnmap( | ||
(CUdeviceptr)region->mapped_addr, region->key.b_len)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add assert or check that status is UCS_OK
here before reuse it
What
Allow VMM/Mallocasync created by user with FABRIC handles to take advantage of cuda-ipc transport