Skip to content

Conversation

@rakhmets
Copy link
Contributor

@rakhmets rakhmets commented Apr 3, 2025

What?

Added MPI+CUDA example.

@rakhmets rakhmets force-pushed the topic/test-mpi-cuda branch 2 times, most recently from e50b97e to ff5b829 Compare April 3, 2025 18:55
@rakhmets rakhmets force-pushed the topic/test-mpi-cuda branch 5 times, most recently from e0995a6 to e3ccd4c Compare April 4, 2025 16:00
@rakhmets rakhmets force-pushed the topic/test-mpi-cuda branch from e3ccd4c to 9011fcd Compare April 4, 2025 16:33
@rakhmets rakhmets marked this pull request as ready for review April 4, 2025 16:51
@rakhmets rakhmets force-pushed the topic/test-mpi-cuda branch from 9011fcd to a15d178 Compare April 7, 2025 09:37
@rakhmets rakhmets added the WIP-DNM Work in progress / Do not review label Apr 7, 2025
@rakhmets rakhmets force-pushed the topic/test-mpi-cuda branch from a15d178 to 4641483 Compare April 7, 2025 11:29
@rakhmets rakhmets removed the WIP-DNM Work in progress / Do not review label Apr 7, 2025
Copy link
Contributor

@Akshay-Venkatesh Akshay-Venkatesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rakhmets Overall the tests stress multi-GPU support in a good way.

I see the following not being tested:

  1. The case where one thread has device context bound to it and it allocates device memory. Later another thread issues the MPI operations with the allocated memory.
  2. Also use cudaSetDevice/cudaDeviceReset runtime API Instead of explicitly using ctxRetain/Release. This is more commonly the API exercised by high-level applications so it would be good to ensure that no cases break from just testing driver API.

The above two cases are supported by multi-GPU support, right? If so, will they be tested in a separate tests?

@rakhmets
Copy link
Contributor Author

rakhmets commented Apr 10, 2025

@rakhmets Overall the tests stress multi-GPU support in a good way.

I see the following not being tested:

  1. The case where one thread has device context bound to it and it allocates device memory. Later another thread issues the MPI operations with the allocated memory.
  2. Also use cudaSetDevice/cudaDeviceReset runtime API Instead of explicitly using ctxRetain/Release. This is more commonly the API exercised by high-level applications so it would be good to ensure that no cases break from just testing driver API.

The above two cases are supported by multi-GPU support, right? If so, will they be tested in a separate tests?

  1. test_alloc_prim_send_no does pretty much the same thing (from the perspective of the feature implementation). There is an active (retained) primary device context, but it is not bound to the thread at the moment of MPI send/recv.
  2. I think it would be better to have separate test for CUDA Runtime API only. Because some scenarios (e.g. create user context) are only valid for Driver API.

I will add a separate test using CUDA Runtime API (probably in another PR). And I will include the test case described in the first bullet to this new test.

export LD_LIBRARY_PATH=${ucx_inst}/lib:${MPI_HOME}/lib:${prev_LD_LIBRARY_PATH}

build release --disable-gtest --with-mpi
build release-mt --with-mpi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd include asserts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added --enable-assertions

if (__err != CUDA_SUCCESS) { \
const char *__err_str; \
cuGetErrorString(__err, &__err_str); \
fprintf(stderr, "test_mpi_cuda.c:%-3u %s failed: %d (%s)\n", \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use __FILE__ instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will print full path which is not needed here. And I don't want to add code to fetch a filename from the FILE, since it overcomplicates simple code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will print full path which is not needed here. And I don't want to add code to fetch a filename from the FILE, since it overcomplicates simple code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it be just something like below?

const char *filename = strrchr(__FILE__, '/');
filename = filename ? filename + 1 : __FILE__;

imo, it is better, so no need to change the code if file is renamed for some reason. Another alternative is just do not print a file name, because it is anyway a single file test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Comment on lines 70 to 71
CUdeviceptr d_send;
CUdeviceptr d_recv;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: maybe use send_ptr and recv_ptr or use some other consistent names for CUdeviceptr variables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the fields.

alloc_mem_send = allocator_send->alloc(size);
alloc_mem_recv = allocator_recv->alloc(size);

cuda_memcpy((void*)alloc_mem_send.ptr, gold_data, size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it is worth to memset recv buffer to zeros or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


cuda_memcpy((void*)alloc_mem_send.ptr, gold_data, size);

CUDA_CALL(cuCtxPopCurrent(&primary_ctx));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a check that there is no other context set currently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

CUDA_CALL(cuDeviceGetCount(&dev_count));

for (i = 0; i < dev_count; ++i) {
CUDA_CALL(cuDeviceGet(&cu_dev_alloc, (i + rank) % dev_count));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe iterate over all possible combinations instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left only one compbination, since the test allocates on one gpu and send using another. Iterating through all gpus doesn't improve coverage.

if (__err != CUDA_SUCCESS) { \
const char *__err_str; \
cuGetErrorString(__err, &__err_str); \
fprintf(stderr, "test_mpi_cuda.c:%-3u %s failed: %d (%s)\n", \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it be just something like below?

const char *filename = strrchr(__FILE__, '/');
filename = filename ? filename + 1 : __FILE__;

imo, it is better, so no need to change the code if file is renamed for some reason. Another alternative is just do not print a file name, because it is anyway a single file test

@brminich
Copy link
Contributor

brminich commented May 5, 2025

@rakhmets, can u pls squash?

@rakhmets rakhmets force-pushed the topic/test-mpi-cuda branch from c051257 to 4d06b0e Compare May 5, 2025 14:25
@brminich brminich enabled auto-merge May 5, 2025 16:08
@brminich brminich merged commit ae68e72 into openucx:master May 6, 2025
151 checks passed
@rakhmets rakhmets deleted the topic/test-mpi-cuda branch July 18, 2025 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants