Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA + MPI_TYPE_INDEXED + MPI_SEND/RECV super slow #12202

Open
chhu opened this issue Jan 2, 2024 · 5 comments
Open

CUDA + MPI_TYPE_INDEXED + MPI_SEND/RECV super slow #12202

chhu opened this issue Jan 2, 2024 · 5 comments

Comments

@chhu
Copy link

chhu commented Jan 2, 2024

Background information

We developed a CFD code some years ago for arbitrary 3D shapes and parallelize via domain decomposition. To exchange halo/ghost cell data we used gather/scatter loops to combine data into a linear buffer before sending. That worked very well and surprisingly so for GPU2GPU, even with the detour of copying that buffer into host mem first, freeing MPI from knowing/caring about device-memory. Now, as multi-GPU environments become standard, we wanted to see if we can improve parallelism by transferring device2device memory with CUDA aware OpenMPI.

To make the code even more simple we switched to TYPE_INDEXED and sent / received directly with that type instead of first creating a linear buffer. This works very well in classic HPC environment (about 5% faster than prev method), but simply applying the same logic to device memory has tremendous impact on speed (factor 500 slower). The detour with linearizing first in device memory via cusparseGather() and cusparseScatter() yielded at least the same speed as before (with / without the detour to host memory).

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Tested with OMPI4.1 and latest 5.1 branch, same outcome.

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Compiled on system with latest UCX + CUDA12.3.1

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

+088c88499514c3c55aee0c433d6f4fb46b0c3587 3rd-party/openpmix (v1.1.3-3973-g088c8849)
bc5de986e13c14db575141714a8b5e0089f4d0bf 3rd-party/prrte (psrvr-v2.0.0rc1-4741-gbc5de986e1)
c1cfc910d92af43f8c27807a9a84c9c13f4fbc65 config/oac (heads/main)

Please describe the system on which you are running

  • Operating system/version: RedHat8
  • Computer hardware: Dual AMD Epyc 3rd gen with 8x NVIDIA RTX A5000
  • Network type: MLX5 (not used)

Details of the problem

  • Exchanging ghost data with TYPE_INDEXED and MPI_ISEND / IRECV creates massive slow-down (x500) compared to linearizing buffer first
  • To my surprise, exchanging buffers GPU2GPU is almost identical than copy to host first and exchange there. Am I doing something wrong or am I expecting too much? (not using gdr copy yet!)

Thanks a lot for your help!

@bosilca
Copy link
Member

bosilca commented Jan 2, 2024

The current approach of packing/unpacking device memory issues one cudaMemcpy per item in the indexed datatype. The performance will be terrible. This is something I wanted to take a look at for a long time, but I can't give you a time estimate for a potential fix. Meanwhile, for device memory I would suggest you pack the data yourself (with a device function if possible).

@bosilca bosilca self-assigned this Jan 2, 2024
@chhu
Copy link
Author

chhu commented Jan 2, 2024

Thanks, appreciate it. Let me know if I can help. As I wrote, something like https://docs.nvidia.com/cuda/cusparse/#cusparsegather and scatter do the trick.

Is including gdr_copy worth the effort? Or is it only useful in combination with InfiniBand?

Affinity handling by mpirun is also not so optimal (just my opinion), rank should auto--bind close to same cuda device (or rank % cuda_device_count). I don't like the manual discovery method suggested by FAQ... But that's another topic.

@rhc54
Copy link
Contributor

rhc54 commented Jan 2, 2024

rank should auto--bind close to same cuda device (or rank % cuda_device_count)

FWIW: some of us are working on that now

@jsquyres
Copy link
Member

jsquyres commented Jan 8, 2024

I would suggest that any fixes be applied to v5.0.x and later, and probably not be back-ported to v4.1.x.

@chhu
Copy link
Author

chhu commented Jan 25, 2024

But then you would need to mark MPI_TYPE_INDEXED as an experimental feature and not expose it to users without warnings. The way it is implemented right now is worse than not implementing it at all (forcing a user to find alternatives). I was lucky to came from the other direction (packing using device functions, linear transfer), so I know it would work rather fast, but if a user approaches this by using MPI_TYPE_INDEXED first it would be rather frustrating and discouraging, assuming OMPI always works optimal.

@jsquyres jsquyres modified the milestones: v5.0.2, v5.0.3 Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants