TEST/MPI: Added MPI+CUDA example. #10601

rakhmets · 2025-04-03T18:49:33Z

What?

Added MPI+CUDA example.

Akshay-Venkatesh

@rakhmets Overall the tests stress multi-GPU support in a good way.

I see the following not being tested:

The case where one thread has device context bound to it and it allocates device memory. Later another thread issues the MPI operations with the allocated memory.
Also use cudaSetDevice/cudaDeviceReset runtime API Instead of explicitly using ctxRetain/Release. This is more commonly the API exercised by high-level applications so it would be good to ensure that no cases break from just testing driver API.

The above two cases are supported by multi-GPU support, right? If so, will they be tested in a separate tests?

test/mpi/test_mpi_cuda.c

rakhmets · 2025-04-10T12:13:03Z

@rakhmets Overall the tests stress multi-GPU support in a good way.

I see the following not being tested:

The case where one thread has device context bound to it and it allocates device memory. Later another thread issues the MPI operations with the allocated memory.

Also use cudaSetDevice/cudaDeviceReset runtime API Instead of explicitly using ctxRetain/Release. This is more commonly the API exercised by high-level applications so it would be good to ensure that no cases break from just testing driver API.

The above two cases are supported by multi-GPU support, right? If so, will they be tested in a separate tests?

test_alloc_prim_send_no does pretty much the same thing (from the perspective of the feature implementation). There is an active (retained) primary device context, but it is not bound to the thread at the moment of MPI send/recv.
I think it would be better to have separate test for CUDA Runtime API only. Because some scenarios (e.g. create user context) are only valid for Driver API.

I will add a separate test using CUDA Runtime API (probably in another PR). And I will include the test case described in the first bullet to this new test.

contrib/test_jenkins.sh

test/mpi/test_mpi_cuda.c

brminich · 2025-04-30T11:25:30Z

contrib/test_jenkins.sh

 			export LD_LIBRARY_PATH=${ucx_inst}/lib:${MPI_HOME}/lib:${prev_LD_LIBRARY_PATH}

-			build release --disable-gtest --with-mpi
+			build release-mt --with-mpi


i'd include asserts

Added --enable-assertions

contrib/test_jenkins.sh

brminich · 2025-04-30T11:31:50Z

test/mpi/test_mpi_cuda.c

+        if (__err != CUDA_SUCCESS) { \
+            const char *__err_str; \
+            cuGetErrorString(__err, &__err_str); \
+            fprintf(stderr, "test_mpi_cuda.c:%-3u %s failed: %d (%s)\n", \


can you use __FILE__ instead?

It will print full path which is not needed here. And I don't want to add code to fetch a filename from the FILE, since it overcomplicates simple code.

It will print full path which is not needed here. And I don't want to add code to fetch a filename from the FILE, since it overcomplicates simple code.

will it be just something like below?

const char *filename = strrchr(__FILE__, '/'); filename = filename ? filename + 1 : __FILE__;

imo, it is better, so no need to change the code if file is renamed for some reason. Another alternative is just do not print a file name, because it is anyway a single file test

brminich · 2025-04-30T11:33:33Z

test/mpi/test_mpi_cuda.c

+    CUdeviceptr d_send;
+    CUdeviceptr d_recv;


minor: maybe use send_ptr and recv_ptr or use some other consistent names for CUdeviceptr variables

Renamed the fields.

test/mpi/test_mpi_cuda.c

brminich · 2025-04-30T11:45:24Z

test/mpi/test_mpi_cuda.c

+    alloc_mem_send = allocator_send->alloc(size);
+    alloc_mem_recv = allocator_recv->alloc(size);
+
+    cuda_memcpy((void*)alloc_mem_send.ptr, gold_data, size);


maybe it is worth to memset recv buffer to zeros or something

brminich · 2025-04-30T11:46:37Z

test/mpi/test_mpi_cuda.c

+
+    cuda_memcpy((void*)alloc_mem_send.ptr, gold_data, size);
+
+    CUDA_CALL(cuCtxPopCurrent(&primary_ctx));


can we add a check that there is no other context set currently?

brminich · 2025-04-30T11:48:51Z

test/mpi/test_mpi_cuda.c

+    CUDA_CALL(cuDeviceGetCount(&dev_count));
+
+    for (i = 0; i < dev_count; ++i) {
+        CUDA_CALL(cuDeviceGet(&cu_dev_alloc, (i + rank) % dev_count));


maybe iterate over all possible combinations instead?

Left only one compbination, since the test allocates on one gpu and send using another. Iterating through all gpus doesn't improve coverage.

test/mpi/test_mpi_cuda.c

brminich · 2025-05-02T10:34:52Z

test/mpi/test_mpi_cuda.c

+        if (__err != CUDA_SUCCESS) { \
+            const char *__err_str; \
+            cuGetErrorString(__err, &__err_str); \
+            fprintf(stderr, "test_mpi_cuda.c:%-3u %s failed: %d (%s)\n", \


will it be just something like below?

const char *filename = strrchr(__FILE__, '/'); filename = filename ? filename + 1 : __FILE__;

imo, it is better, so no need to change the code if file is renamed for some reason. Another alternative is just do not print a file name, because it is anyway a single file test

test/mpi/test_mpi_cuda.c

brminich · 2025-05-05T14:09:28Z

@rakhmets, can u pls squash?

rakhmets force-pushed the topic/test-mpi-cuda branch 2 times, most recently from e50b97e to ff5b829 Compare April 3, 2025 18:55

brminich mentioned this pull request Apr 4, 2025

UCP: Use correct device for cuda-managed memory ppln #10603

Merged

rakhmets force-pushed the topic/test-mpi-cuda branch 5 times, most recently from e0995a6 to e3ccd4c Compare April 4, 2025 16:00

brminich mentioned this pull request Apr 4, 2025

UCT/CUDA: Detect sys_dev for async allocations #10607

Merged

rakhmets force-pushed the topic/test-mpi-cuda branch from e3ccd4c to 9011fcd Compare April 4, 2025 16:33

rakhmets marked this pull request as ready for review April 4, 2025 16:51

rakhmets force-pushed the topic/test-mpi-cuda branch from 9011fcd to a15d178 Compare April 7, 2025 09:37

rakhmets added the WIP-DNM Work in progress / Do not review label Apr 7, 2025

rakhmets force-pushed the topic/test-mpi-cuda branch from a15d178 to 4641483 Compare April 7, 2025 11:29

rakhmets removed the WIP-DNM Work in progress / Do not review label Apr 7, 2025

rakhmets mentioned this pull request Apr 8, 2025

UCT/CUDA/CUDA_COPY: Changed log level in mem_free. #10606

Merged

Akshay-Venkatesh reviewed Apr 9, 2025

View reviewed changes

test/mpi/test_mpi_cuda.c Show resolved Hide resolved

rakhmets mentioned this pull request Apr 16, 2025

UCP/RNDV: Fix degradation by avoiding alloc_md cache update from RNDV send response #10636

Merged

rakhmets force-pushed the topic/test-mpi-cuda branch 2 times, most recently from e4954df to f69090b Compare April 29, 2025 15:53