Improve pack/unpack performance if CUDA/ROCm enabled #10364

bartoldeman · 2022-05-10T13:27:32Z

Instead of replacing individual memcpy()s with function pointer
calls, use the existing checksum-related logic to compile two
versions of every pack/unpack function, one without GPU and one with
GPU support (using -DOPAL_DATATYPE_PACK_UNPACK_GPU).

Signed-off-by: Bart Oldeman bart.oldeman@calculquebec.ca

ompiteam-bot · 2022-05-10T13:27:34Z

Can one of the admins verify this patch?

bartoldeman · 2022-05-10T13:39:51Z

The reason for this PR is that #9906 (comment) piqued my interest, I was wondering where the runtime penalty is, and pack/unpack routines are the main place, though there are a few other places (opal_datatype_copy_content_same_ddt() and the inline code in opal_convertor_pack/unpack() and some others I may have missed)

I tested with osu_alltoall on 64 cores with shared memory where this brings a small improvement (about half way there for small messages between with and without CUDA without actually involving a GPU), there may be better ways to test this!

I am aware of #10069 which is mostly orthogonal to this patch. However, it uses the term "accelerator", whereas in this patch I used "gpu" to be consistent with the MEMCPY_GPU notation to apply to both ROCm and CUDA. Of course I'm happy to change "gpu" to "accelerator" if that's preferred.

Instead of replacing individual memcpy()s with function pointer calls, use the existing checksum-related logic to compile two versions of every pack/unpack function, one without GPU and one with GPU support (using -DOPAL_DATATYPE_PACK_UNPACK_GPU). Signed-off-by: Bart Oldeman <bart.oldeman@calculquebec.ca>

awlauria · 2022-05-10T14:07:19Z

ok to test

bosilca · 2022-05-12T14:50:56Z

I am curious how much improvement do you see with this approach ?

bartoldeman · 2022-05-19T20:54:05Z

@bosilca, sorry for the late reply, here is what osu_alltoall does, with:
mpirun -n 64 --mca pml ob1 --mca btl self,sm --map-by core ./osu_alltoall
on a 64-core CPU-only node.

Open MPI compiled without CUDA:

# OSU MPI All-to-All Personalized Exchange Latency Test v5.9
# Size       Avg Latency(us)
1                      14.48
2                      15.57
4                      16.27
8                      16.70
16                     21.49
32                     21.93
64                     25.31
128                    56.45
256                    70.62
512                   111.49
1024                  154.74
2048                  222.99
4096                  410.54
8192                  511.01
16384                 777.42
32768                1320.40
65536                2144.62
131072               6361.49
262144              17247.35
524288              32161.33
1048576             55366.66

unpatched Open MPI compiled with CUDA:

# OSU MPI All-to-All Personalized Exchange Latency Test v5.9
# Size       Avg Latency(us)
1                      15.28
2                      16.19
4                      16.76
8                      17.29
16                     22.99
32                     24.78
64                     26.56
128                    57.73
256                    72.75
512                   113.00
1024                  154.58
2048                  229.20
4096                  402.53
8192                  507.11
16384                 805.57
32768                1415.23
65536                1892.89
131072               5956.96
262144              16340.17
524288              30998.18
1048576             54504.33

patched Open MPI compiled with CUDA support:

# OSU MPI All-to-All Personalized Exchange Latency Test v5.9
# Size       Avg Latency(us)
1                      14.85
2                      15.61
4                      16.49
8                      16.70
16                     22.28
32                     25.07
64                     26.13
128                    56.41
256                    72.93
512                   114.54
1024                  155.03
2048                  222.94
4096                  398.81
8192                  546.31
16384                 801.55
32768                1397.80
65536                2322.90
131072               6240.16
262144              16789.86
524288              31162.64
1048576             57857.76

if we look at 1,2,4,8:

 1 14.48 15.28 14.85
 2 15.57 16.19 15.61
 4 16.27 16.76 16.49
 8 16.70 17.29 16.70
16 21.49 22.99 22.28

there is some noise in there but the optimized version consistently falls in between the non-cuda and cuda enabled baseline.

it's mainly about the overhead of calling convertor->cbmemcpy which always points to either opal_cuda_memcpy or mca_common_rocm_memcpy, that then does a runtime check, instead of calling memcpy directly without that check.

edgargabriel · 2022-05-20T15:51:38Z

Not sure, whether I am reading the data correctly, but it looks like for large message sizes this approach introduces some overhead

262144 17247.35 16340.17 16789.86
524288 32161.33 30998.18 31162.64
1048576 55366.66 54504.33 57857.76

bartoldeman · 2022-05-20T16:17:26Z

There's too much noise there in the larger messages.
I actually ran it twice for each case, let me give the full scoop:
small messages:

         1st               2nd
   nocuda cuda patch nocuda cuda patch
 1 14.48 15.28 14.85  14.63 15.40 15.15
 2 15.57 16.19 15.61  15.37 16.08 15.69
 4 16.27 16.76 16.49  16.32 16.94 16.47
 8 16.70 17.29 16.70  16.68 17.24 16.71
16 21.49 22.99 22.28  21.59 22.75 22.37

large messages

         1st                          2nd
          nocuda  cuda    patch     nocuda  cuda     patch
 262144 17247.35 16340.17 16789.86 16588.20 16828.91 16977.03
 524288 32161.33 30998.18 31162.64 31247.85 32826.90 32874.95
1048576 55366.66 54504.33 57857.76 54576.68 57973.39 58715.68

The numbers are not consisten here: the cuda path for large messages is sometimes slower than the non-cuda path (without the patch!) which theoretically shouldn't happen; for bandwidth the changes fall within the expected noise, but for latency there is a clear consistent difference.

bartoldeman · 2024-02-16T15:25:16Z

in the end #10069 made this pretty much obsolete, so I'll close this.
If I understand things well, there may be some improvement possible via vector copies instead of memcpy, but such a patch would look very different from this PR (see #10069 (comment))

bartoldeman force-pushed the cuda-speedups branch from 2e22c3c to 7d87712 Compare May 10, 2022 13:50

jsquyres requested a review from bosilca May 10, 2022 14:15

jsquyres added the Target: main label May 10, 2022

bartoldeman mentioned this pull request May 19, 2022

{mpi}[GCC/11.3.0] OpenMPI v4.1.4 easybuilders/easybuild-easyconfigs#15426

Merged

2 tasks

Micket mentioned this pull request May 25, 2022

add patch for OpenMPI 4.1.1 to support building using --with-cuda=internal easybuilders/easybuild-easyconfigs#15528

Merged

bartoldeman closed this Feb 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve pack/unpack performance if CUDA/ROCm enabled #10364

Improve pack/unpack performance if CUDA/ROCm enabled #10364

bartoldeman commented May 10, 2022 •

edited

ompiteam-bot commented May 10, 2022

bartoldeman commented May 10, 2022

awlauria commented May 10, 2022

bosilca commented May 12, 2022

bartoldeman commented May 19, 2022

edgargabriel commented May 20, 2022

bartoldeman commented May 20, 2022

bartoldeman commented Feb 16, 2024

Improve pack/unpack performance if CUDA/ROCm enabled #10364

Improve pack/unpack performance if CUDA/ROCm enabled #10364

Conversation

bartoldeman commented May 10, 2022 • edited

ompiteam-bot commented May 10, 2022

bartoldeman commented May 10, 2022

awlauria commented May 10, 2022

bosilca commented May 12, 2022

bartoldeman commented May 19, 2022

edgargabriel commented May 20, 2022

bartoldeman commented May 20, 2022

bartoldeman commented Feb 16, 2024

bartoldeman commented May 10, 2022 •

edited