Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve pack/unpack performance if CUDA/ROCm enabled #10364

Closed
wants to merge 1 commit into from

Conversation

bartoldeman
Copy link

@bartoldeman bartoldeman commented May 10, 2022

Instead of replacing individual memcpy()s with function pointer
calls, use the existing checksum-related logic to compile two
versions of every pack/unpack function, one without GPU and one with
GPU support (using -DOPAL_DATATYPE_PACK_UNPACK_GPU).

Signed-off-by: Bart Oldeman bart.oldeman@calculquebec.ca

@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@bartoldeman
Copy link
Author

The reason for this PR is that #9906 (comment) piqued my interest, I was wondering where the runtime penalty is, and pack/unpack routines are the main place, though there are a few other places (opal_datatype_copy_content_same_ddt() and the inline code in opal_convertor_pack/unpack() and some others I may have missed)

I tested with osu_alltoall on 64 cores with shared memory where this brings a small improvement (about half way there for small messages between with and without CUDA without actually involving a GPU), there may be better ways to test this!

I am aware of #10069 which is mostly orthogonal to this patch. However, it uses the term "accelerator", whereas in this patch I used "gpu" to be consistent with the MEMCPY_GPU notation to apply to both ROCm and CUDA. Of course I'm happy to change "gpu" to "accelerator" if that's preferred.

Instead of replacing individual memcpy()s with function pointer
calls, use the existing checksum-related logic to compile two
versions of every pack/unpack function, one without GPU and one with
GPU support (using -DOPAL_DATATYPE_PACK_UNPACK_GPU).

Signed-off-by: Bart Oldeman <bart.oldeman@calculquebec.ca>
@awlauria
Copy link
Contributor

ok to test

@bosilca
Copy link
Member

bosilca commented May 12, 2022

I am curious how much improvement do you see with this approach ?

@bartoldeman
Copy link
Author

@bosilca, sorry for the late reply, here is what osu_alltoall does, with:
mpirun -n 64 --mca pml ob1 --mca btl self,sm --map-by core ./osu_alltoall
on a 64-core CPU-only node.

Open MPI compiled without CUDA:

# OSU MPI All-to-All Personalized Exchange Latency Test v5.9
# Size       Avg Latency(us)
1                      14.48
2                      15.57
4                      16.27
8                      16.70
16                     21.49
32                     21.93
64                     25.31
128                    56.45
256                    70.62
512                   111.49
1024                  154.74
2048                  222.99
4096                  410.54
8192                  511.01
16384                 777.42
32768                1320.40
65536                2144.62
131072               6361.49
262144              17247.35
524288              32161.33
1048576             55366.66

unpatched Open MPI compiled with CUDA:

# OSU MPI All-to-All Personalized Exchange Latency Test v5.9
# Size       Avg Latency(us)
1                      15.28
2                      16.19
4                      16.76
8                      17.29
16                     22.99
32                     24.78
64                     26.56
128                    57.73
256                    72.75
512                   113.00
1024                  154.58
2048                  229.20
4096                  402.53
8192                  507.11
16384                 805.57
32768                1415.23
65536                1892.89
131072               5956.96
262144              16340.17
524288              30998.18
1048576             54504.33

patched Open MPI compiled with CUDA support:

# OSU MPI All-to-All Personalized Exchange Latency Test v5.9
# Size       Avg Latency(us)
1                      14.85
2                      15.61
4                      16.49
8                      16.70
16                     22.28
32                     25.07
64                     26.13
128                    56.41
256                    72.93
512                   114.54
1024                  155.03
2048                  222.94
4096                  398.81
8192                  546.31
16384                 801.55
32768                1397.80
65536                2322.90
131072               6240.16
262144              16789.86
524288              31162.64
1048576             57857.76

if we look at 1,2,4,8:

 1 14.48 15.28 14.85
 2 15.57 16.19 15.61
 4 16.27 16.76 16.49
 8 16.70 17.29 16.70
16 21.49 22.99 22.28

there is some noise in there but the optimized version consistently falls in between the non-cuda and cuda enabled baseline.

it's mainly about the overhead of calling convertor->cbmemcpy which always points to either opal_cuda_memcpy or mca_common_rocm_memcpy, that then does a runtime check, instead of calling memcpy directly without that check.

@edgargabriel
Copy link
Member

Not sure, whether I am reading the data correctly, but it looks like for large message sizes this approach introduces some overhead

262144 17247.35 16340.17 16789.86
524288 32161.33 30998.18 31162.64
1048576 55366.66 54504.33 57857.76

@bartoldeman
Copy link
Author

There's too much noise there in the larger messages.
I actually ran it twice for each case, let me give the full scoop:
small messages:

         1st               2nd
   nocuda cuda patch nocuda cuda patch
 1 14.48 15.28 14.85  14.63 15.40 15.15
 2 15.57 16.19 15.61  15.37 16.08 15.69
 4 16.27 16.76 16.49  16.32 16.94 16.47
 8 16.70 17.29 16.70  16.68 17.24 16.71
16 21.49 22.99 22.28  21.59 22.75 22.37

large messages

         1st                          2nd
          nocuda  cuda    patch     nocuda  cuda     patch
 262144 17247.35 16340.17 16789.86 16588.20 16828.91 16977.03
 524288 32161.33 30998.18 31162.64 31247.85 32826.90 32874.95
1048576 55366.66 54504.33 57857.76 54576.68 57973.39 58715.68

The numbers are not consisten here: the cuda path for large messages is sometimes slower than the non-cuda path (without the patch!) which theoretically shouldn't happen; for bandwidth the changes fall within the expected noise, but for latency there is a clear consistent difference.

@bartoldeman
Copy link
Author

in the end #10069 made this pretty much obsolete, so I'll close this.
If I understand things well, there may be some improvement possible via vector copies instead of memcpy, but such a patch would look very different from this PR (see #10069 (comment))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants