Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/CUDA_IPC: Add VMM/mallocasync support over fabric handles #9787

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

Akshay-Venkatesh
Copy link
Contributor

What

Allow VMM/Mallocasync created by user with FABRIC handles to take advantage of cuda-ipc transport

config/m4/cuda.m4 Outdated Show resolved Hide resolved
configure.ac Outdated Show resolved Hide resolved
config/m4/cuda.m4 Outdated Show resolved Hide resolved
src/uct/cuda/cuda_copy/cuda_copy_md.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_copy/cuda_copy_md.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_copy/cuda_copy_md.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_md.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_md.c Outdated Show resolved Hide resolved
@Akshay-Venkatesh
Copy link
Contributor Author

@rakhmets I had squash commits because build was complaining about one of the commit's formatting. Sorry about that but I've addressed your comments.

@Akshay-Venkatesh
Copy link
Contributor Author

@yosefe / @brminich Are these actually wire compat issues here ?

@Akshay-Venkatesh
Copy link
Contributor Author

@yosefe / @brminich Do you think the errors here are coming from the PR?

src/uct/cuda/cuda_copy/cuda_copy_md.c Outdated Show resolved Hide resolved
config/m4/cuda.m4 Outdated Show resolved Hide resolved
configure.ac Outdated Show resolved Hide resolved
@Akshay-Venkatesh
Copy link
Contributor Author

(prop.location.type == CU_MEM_LOCATION_TYPE_HOST_NUMA_CURRENT)) {
*cuda_mem_type = CU_MEMORYTYPE_DEVICE;
} else {
status = UCS_ERR_INVALID_ADDR;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add error message, e.g.:

ucs_error("invalid memory location type %u for host memory type for address %p",
          prop.location.type, address);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't actually an error. It's just that we don't want cuda_copy UCT to handle it. So I don't think we should print this error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ucs_debug in this case. Other code paths returning error has debug log in this method.

src/uct/cuda/cuda_copy/cuda_copy_md.c Outdated Show resolved Hide resolved
return status;
}

if (cuda_mem_type != CU_MEMORYTYPE_DEVICE) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it will be useful to add error message here as well. E.g.:

ucs_error("invalid memory type %u for address %p", cuda_mem_type, address);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a trace if needed but it shouldn't be a ucs_error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. How about ucs_debug?

src/uct/cuda/cuda_copy/cuda_copy_md.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_copy/cuda_copy_md.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_cache.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_cache.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_cache.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_cache.c Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_md.c Outdated Show resolved Hide resolved
@Akshay-Venkatesh Akshay-Venkatesh modified the milestone: v1.12.1 Apr 8, 2024
@Akshay-Venkatesh
Copy link
Contributor Author

@brminich @yosefe I've incorporated feedback from @rakhmets . Would appreciate reviews from you all as well.

if ((prop.location.type == CU_MEM_LOCATION_TYPE_HOST) ||
(prop.location.type == CU_MEM_LOCATION_TYPE_HOST_NUMA) ||
(prop.location.type == CU_MEM_LOCATION_TYPE_HOST_NUMA_CURRENT)) {
*cuda_mem_type = CU_MEMORYTYPE_DEVICE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we modify it to device here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brminich This ensures that cuda_ipc considers buffers of this memory type to be transferred over nvlinks for instance.

Comment on lines 347 to 351
cu_err = cuCtxSetFlags(CU_CTX_SYNC_MEMOPS);
if (cu_err != CUDA_SUCCESS) {
ucs_warn("cuCtxSetFlags(CU_CTX_SYNC_MEMOPS) for %p error: %s",
address, uct_cuda_base_cu_get_error_string(cu_err));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use UCT_CUDADRV_FUNC here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we set sync flag for the context for vmm only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brminich VMM memory cannot use SYNC_MEMOPS attribute and the mechanism exposed is the above API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, what is about using UCT_CUDADRV_FUNC here?

src/uct/cuda/cuda_ipc/cuda_ipc_cache.c Show resolved Hide resolved
uct_cuda_ipc_open_memhandle_mempool(const uct_cuda_ipc_rkey_t *key,
CUdeviceptr *mapped_addr)
{
CUmemoryPool *imported_mpool = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CUmemoryPool *imported_mpool = 0;
CUmemoryPool *imported_mpool = NULL;

Comment on lines 366 to 390
ucs_status_t uct_cuda_ipc_close_memhandle(uct_cuda_ipc_cache_region_t *region)
{
ucs_status_t status;
#if HAVE_CUDA_FABRIC
if (region->key.ph.handle_type == UCT_CUDA_IPC_KEY_HANDLE_TYPE_VMM) {
status = UCT_CUDADRV_FUNC_LOG_ERR(cuMemUnmap(
(CUdeviceptr)region->mapped_addr, region->key.b_len));
status = UCT_CUDADRV_FUNC_LOG_ERR(cuMemAddressFree(
(CUdeviceptr)region->mapped_addr, region->key.b_len));
} else if (region->key.ph.handle_type == UCT_CUDA_IPC_KEY_HANDLE_TYPE_LEGACY) {
status = UCT_CUDADRV_FUNC_LOG_ERR(cuIpcCloseMemHandle(
(CUdeviceptr)region->mapped_addr));
} else {
/* Ideally we call cuMemFreeAsync on imported pointer region here but
* handles can be closed after device context has been destroyed which
* would cause cuMemFreeAsync to fail for a given stream */
status = UCS_OK;
}
#else
status = UCT_CUDADRV_FUNC_LOG_ERR(cuIpcCloseMemHandle(
(CUdeviceptr)region->mapped_addr));
#endif

return status;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ucs_status_t uct_cuda_ipc_close_memhandle(uct_cuda_ipc_cache_region_t *region)
{
ucs_status_t status;
#if HAVE_CUDA_FABRIC
if (region->key.ph.handle_type == UCT_CUDA_IPC_KEY_HANDLE_TYPE_VMM) {
status = UCT_CUDADRV_FUNC_LOG_ERR(cuMemUnmap(
(CUdeviceptr)region->mapped_addr, region->key.b_len));
status = UCT_CUDADRV_FUNC_LOG_ERR(cuMemAddressFree(
(CUdeviceptr)region->mapped_addr, region->key.b_len));
} else if (region->key.ph.handle_type == UCT_CUDA_IPC_KEY_HANDLE_TYPE_LEGACY) {
status = UCT_CUDADRV_FUNC_LOG_ERR(cuIpcCloseMemHandle(
(CUdeviceptr)region->mapped_addr));
} else {
/* Ideally we call cuMemFreeAsync on imported pointer region here but
* handles can be closed after device context has been destroyed which
* would cause cuMemFreeAsync to fail for a given stream */
status = UCS_OK;
}
#else
status = UCT_CUDADRV_FUNC_LOG_ERR(cuIpcCloseMemHandle(
(CUdeviceptr)region->mapped_addr));
#endif
return status;
}
ucs_status_t uct_cuda_ipc_close_memhandle(uct_cuda_ipc_cache_region_t *region)
{
ucs_status_t status;
#if HAVE_CUDA_FABRIC
if (region->key.ph.handle_type == UCT_CUDA_IPC_KEY_HANDLE_TYPE_VMM) {
status = UCT_CUDADRV_FUNC_LOG_ERR(cuMemUnmap(
(CUdeviceptr)region->mapped_addr, region->key.b_len));
status = UCT_CUDADRV_FUNC_LOG_ERR(cuMemAddressFree(
(CUdeviceptr)region->mapped_addr, region->key.b_len));
} else if (region->key.ph.handle_type != UCT_CUDA_IPC_KEY_HANDLE_TYPE_LEGACY) {
/* Ideally we call cuMemFreeAsync on imported pointer region here but
* handles can be closed after device context has been destroyed which
* would cause cuMemFreeAsync to fail for a given stream */
status = UCS_OK;
}
#endif
status = UCT_CUDADRV_FUNC_LOG_ERR(cuIpcCloseMemHandle(
(CUdeviceptr)region->mapped_addr));
return status;
}

Comment on lines +565 to +569
#if HAVE_CUDA_FABRIC
uct_device_type_t dev_type = UCT_DEVICE_TYPE_NET;
#else
uct_device_type_t dev_type = UCT_DEVICE_TYPE_SHM;
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we always define it as NET?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brminich We could do that but we should probably wait until rkey_unpack failures are gracefully handled as cuda_ipc would become eligible for all endpoints.

src/uct/cuda/cuda_ipc/cuda_ipc_iface.c Show resolved Hide resolved
status = UCT_CUDADRV_FUNC(cuIpcGetMemHandle(&key->ph, (CUdeviceptr)addr),
UCS_LOG_LEVEL_ERROR);
if (UCS_OK != status) {
if (status != UCS_OK) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: no need to change

alloc_handle->length, granularity,
0, 0));
if (status != UCS_OK) {
return status;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to release resources allocated before (here and in other error flows of this func). I. e. use cuMemUnmap, cuMemAddressFree, etc

Comment on lines 347 to 351
cu_err = cuCtxSetFlags(CU_CTX_SYNC_MEMOPS);
if (cu_err != CUDA_SUCCESS) {
ucs_warn("cuCtxSetFlags(CU_CTX_SYNC_MEMOPS) for %p error: %s",
address, uct_cuda_base_cu_get_error_string(cu_err));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, what is about using UCT_CUDADRV_FUNC here?

Comment on lines +22 to +27
typedef enum {
UCT_CUDA_ALLOC_TYPE_FABRIC,
UCT_CUDA_ALLOC_TYPE_PINNED,
UCT_CUDA_ALLOC_TYPE_MANAGED,
UCT_CUDA_ALLOC_TYPE_LAST = UCT_CUDA_ALLOC_TYPE_MANAGED
} uct_cuda_alloc_type_t;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe move it to cuda_cop_md.c?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to leave it here as it is referenced/used in cuda_ipc transport and having it in the header file makes it convenient.

Comment on lines +372 to +373
status = UCT_CUDADRV_FUNC_LOG_ERR(cuMemUnmap(
(CUdeviceptr)region->mapped_addr, region->key.b_len));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add assert or check that status is UCS_OK here before reuse it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants