Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specific OSU benchmarks segfault when non-CUDA aware OpenMPI 4.1.1 compiled with CUDA-aware UCX #9906

Closed
casparvl opened this issue Jan 21, 2022 · 14 comments

Comments

@casparvl
Copy link

casparvl commented Jan 21, 2022

Background information

This question comes from the EasyBuild community, and may affect how OpenMPI is installed by this community going forward.

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

4.1.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From sources. The configure line was:

./configure --prefix=/sw/arch/Centos8/EB_production/2021/software/OpenMPI/4.1.1-GCC-10.3.0  --build=x86_64-pc-linux-gnu  --host=x86_64-pc-linux-gnu --enable-shared --without-verbs --with-ucx=$EBROOTUCX --enable-mpirun-prefix-by-default --with-hwloc=$EBROOTHWLOC --disable-wrapper-runpath --disable-wrapper-rpath --with-slurm --with-pmix=internal --with-pmi=/usr/include/slurm --with-pmi-libdir=/usr/lib64  --with-cuda=no  --with-libevent=/sw/arch/Centos8/EB_production/2021/software/libevent/2.1.12-GCCcore-10.3.0  --with-ofi=/sw/arch/Centos8/EB_production/2021/software/libfabric/1.12.1-GCCcore-10.3.0

Please describe the system on which you are running

  • Operating system/version: RHEL 8.4
  • Computer hardware: Intel Xeon Platinum 8360Y CPUs (2x), NVIDIA A100 GPUs
  • Network type: 2xHDR100 interconnect

Details of the problem

For practical reasons (described here) the EasyBuild community decided to try and use a non-CUDA aware OpenMPI together with a CUDA-aware UCX in order to provide support for using OpenMPI with GPU (and GPU direct) support. The assumption was that this would work, since OpenMPI would delegate communication to UCX, which in turn would take care of 'all things GPU'. Documentation like the list of MPI APIs that work with CUDA-aware UCX (here) , and statements like 'OpenMPI v2.0.0 new features: CUDA support through UCX' (here) also seem to suggest that it should.

Unfortunately, that seems to be not the case. Certain OSU (e.g. osu_latency, osu_bw, osu_bcast) benchmarks work, but many others (e.g. osu_gather, osualltoall) fail with segfaults:

mpirun -np 2 --mca pml ucx osu_alltoall -d cuda D D
...
==== backtrace (tid: 757285) ====
 0 0x000000000002a160 ucs_debug_print_backtrace()  /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucs/debug/debug.c:656
 1 0x0000000000012b20 .annobin_sigaction.c()  sigaction.c:0
 2 0x000000000016065c __memmove_avx_unaligned_erms()  :0
 3 0x0000000000052aeb non_overlap_copy_content_same_ddt()  opal_datatype_copy.c:0
 4 0x00000000000874b9 ompi_datatype_sndrcv()  ???:0
 5 0x00000000000e2d84 ompi_coll_base_alltoall_intra_pairwise()  ???:0
 6 0x000000000000646c ompi_coll_tuned_alltoall_intra_dec_fixed()  ???:0
 7 0x0000000000089a4f MPI_Alltoall()  ???:0
 8 0x0000000000402d40 main()  ???:0
 9 0x0000000000023493 __libc_start_main()  ???:0
10 0x000000000040318e _start()  ???:0
=================================
[gcn6:757285] *** Process received signal ***
[gcn6:757285] Signal: Segmentation fault (11)
[gcn6:757285] Signal code:  (-6)
[gcn6:757285] Failing at address: 0xb155000b8e25
[gcn6:757285] [ 0] /lib64/libpthread.so.0(+0x12b20)[0x14f53b4a9b20]
[gcn6:757285] [ 1] /lib64/libc.so.6(+0x16065c)[0x14f53b23265c]
[gcn6:757285] [ 2] /sw/arch/Centos8/EB_production/2021/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libopen-pal.so.40(+0x52aeb)[0x14f53ab5eaeb]
[gcn6:757285] [ 3] /sw/arch/Centos8/EB_production/2021/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so.40(ompi_datatype_sndrcv+0x949)[0x14f53d5404b9]
[gcn6:757285] [ 4] /sw/arch/Centos8/EB_production/2021/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so.40(ompi_coll_base_alltoall_intra_pairwise+0x174)[0x14f53d59bd84]
[gcn6:757285] [ 5] /sw/arch/Centos8/EB_production/2021/software/OpenMPI/4.1.1-GCC-10.3.0/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_alltoall_intra_dec_fixed+0x7c)[0x14f530c4246c]
[gcn6:757285] [ 6] /sw/arch/Centos8/EB_production/2021/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so.40(MPI_Alltoall+0x15f)[0x14f53d542a4f]
[gcn6:757285] [ 7] osu_alltoall[0x402d40]
[gcn6:757285] [ 8] /lib64/libc.so.6(__libc_start_main+0xf3)[0x14f53b0f5493]
[gcn6:757285] [ 9] osu_alltoall[0x40318e]
[gcn6:757285] *** End of error message ***

This response and this response in an earlier ticket seemed to suggest that support for collectives might be limited for anything other than point-to-point. I may have overlooked it, but I can't really find this in the official documentation (this seems to be the most relevant part, but I don't clearly see this mentioned there - unless I misunderstood something).

Indeed, if we recompile OpenMPI with CUDA support (leaving all alse the same), the above osu_alltoall benchmarks runs fine.

The actual question...

So far, I'm still confused about the level of CUDA support I can expect when compiling UCX with, but OpenMPI without CUDA. The OpenMPI FAQ seems to suggest (almost) full support, and the UCX documentation also doesn't seem to suggest serious limitations. But my segfault and the other issue on the OpenMPI issue tracker suggests rather limited support.

Can any of the experts clear this up? Is compiling OpenMPI without CUDA, but with a CUDA-aware UCX a feasible approach to offer CUDA support? If not (yet), will it be at some point in the future? A clear answer on this will really help the EasyBuild community in deciding how to support building OpenMPI/UCX going forward :)

@jsquyres
Copy link
Member

FYI @open-mpi/ucx

@brminich
Copy link
Member

This does not work, because OMPI is not just redirecting everything to UCX. In some modules, like coll, it has a bunch of memory management routines (copies) in its stack. OMPI compiled without CUDA support is not able to detect type of user memory and use the appropriate memory management routines.

@brminich
Copy link
Member

@Akshay-Venkatesh, are you aware of any plans to support such a configuration? (i.e. when OMPI is compiled without CUDA support, while UCX is built with it)

@casparvl
Copy link
Author

In some modules, like coll, it has a bunch of memory management routines (copies) in its stack. OMPI compiled without CUDA support is not able to detect type of user memory and use the appropriate memory management routines.

Thanks a lot for this clear explaination, now I also understand why we see it in some tests, but not in others :)

I have to think a bit on how we can best deal with this in the EasyBuild community. Essentially, EasyBuild is a tool for providing optimized software builds on HPC systems. Since the build recipe's need to support both GPU and non-GPU based systems, we were hoping to be able to do a base installation that provides the CPU support, and then 'something on top' (in a different prefix) that provides the GPU support. With UCX, we could actually make this split: the base installation would be made in one prefix, and an installation in a seperate prefix provides the CUDA functionality (and only needs to be installed by people with GPU based systems). That was very attractive for this community.

I'm not sure if we can do something similar with the GPU support in OpenMPI though. If it was just the smcuda BTL, I guess we could have installed that in a seperate prefix and use export OMPI_MCA_mca_component_path=/additional/path/to/smcuda. But if it's in the coll MCA, I'm note sure if that's possible...

I'm sure such considerations are a bit out of scope here, but if anyone has ideas on this, don't hesitate to share! I'd also indeed love to hear if there are plans to support OMPI without CUDA + UCX with CUDA (since for that setup we've pretty much already solved our 'seperation' problem).

@boegel
Copy link

boegel commented Jan 25, 2022

What would be the impact of (always) building OpenMPI on top of CUDA, but not having CUDA available at runtime (only when it's actually needed)?
If that's OK, then CUDA could be just a build dependency, in EasyBuild terms.

I guess that may still be problematic in terms of which CUDA version we build against though...

@casparvl
Copy link
Author

I was thinking the exact same thing. But maybe it's better to discuss the details of a solution here. The question of the impact is a good one though. Curious if anyone here would know this straight away. Otherwise, we'll just have to try and test (extensively) I guess...

@Micket
Copy link

Micket commented Jan 27, 2022

In case it's not clear what we EasyBuild folks are trying to do here; this is really just a packaging issue from our side. It would be greatly beneficial for us to build "cuda support" separately, and allowing it as an opt-in at runtime by loading a environment module. We thought UCX's plugin system would allow us to achieve that, but we (I in particular) was wrong.

Since MCAs are supposed to allow for run-time assembling of the MPI implementation, I had still had some hope that this would be possible by gently prying apart the monolithic configure flag --cuda and just building the "OPAL_CUDA_SUPPORT" into libopen-pal.so. Given that opal_datatype_cuda doesn't actually have any compile-time dependencies on CUDA at all (it just makes OPAL check for the possibility of a GPU array using function handles), it seemed like it would be relatively straight forward to assemble to support assembling CUDA support at runtime by simply extending the OMPI_mca_component_path to include lib64/openmpi/mca_coll_cuda.so (which actually contains the CUDA code)
I have only done some quick and really really dirty testing here, but it did seem promising, building libopen-pal.so with all the CUDA-aware code (opal_cuda_* functions), without actually having any need for a cuda runtime library.

So I got my hopes up testing this with 4.1.1, but unfortunately, this commit; deb37ac
is pushing things into the other direction, bundling the opal_cuda_* functions along with CUDA-dependant mca_common_cuda_* functions which aren't necessary for opal itself.
I really can't see any benefit to this change; it just makes it more monolithic, compile time determined, which I feel goes against the whole idea of the MCA code. Surely the core library shouldn't have any direct includes into mca/ subdirectories.

@akesandgren
Copy link
Contributor

I agree deb37ac looks like a bad move for us (EasyBuild) and seems like making the MCA separation model less useful. Before that commit you can build OpenMPI with CUDA support enabled without using cuda include files and then add the actual OpenMPI-CUDA runtime part at a later stage (or separately in our case).

The above commit makes that impossible.

@bwbarrett
Copy link
Member

Just so we're clear (because at least one person in the OMPI community pinged me this morning about the severity of this ticket), deb37ac is not on any released branch. So it's not actually causing any problems today and isn't a step in any direction. There may be bugs in unreleased software, but let's not talk about directionality of unrelased software.

It has always been the intent of Open MPI to be able to be built with --with-cuda and only actually provide CUDA support if the CUDA libraries are installed. This is not free, however. Compiling OMPI with --with-cuda has a hard to quantify (and likely application dependent) cost, even when the CUDA libraries aren't available at runtime (it is even larger if the CUDA libraries are available). This is likely true of UCX as well, but I don't know UCX as well as I know OMPI. Libfabric solves the performance delta problem by moving CUDA/Host memory detection into the registration path, meaning that the expensive conditionals aren't in the Libfabric critical path.

So while the intent is very much that you should be able to build Open MPI with --with-cuda (assuming the CUDA libraries are available at build time) and run on a system without CUDA libraries, there's also the possibility work remains in 5.0. And if you're worried about performance, you probably don't want to run in that configuration in the general case, because OMPI will have some (hard to quantify) overheads for doing so.

@bwbarrett
Copy link
Member

Closing this issue, as we've identified root cause. Because Open MPI has to manipulate the user buffer in certain scenarios (collectives are a big one, but there are others), Open MPI must be built with CUDA support in order for the application to be able to pass CUDA buffers to MPI. That was the root cause of the 4.1.x problem.

I've opened #9933 to track the 5.0.x library dependency regression.

@Micket
Copy link

Micket commented Jan 27, 2022

So while the intent is very much that you should be able to build Open MPI with --with-cuda (assuming the CUDA libraries are available at build time) and run on a system without CUDA libraries

I'm sorry but I must have been unclear and I would like to clarify; this isn't at all the situation, and I have no reason to believe this scenario that you describe is in any way broken in any version of OpenMPI (including 5.0).

My comments are about the opal functions (this code specifcally https://github.com/open-mpi/ompi/blob/v4.1.x/opal/datatype/opal_datatype_cuda.c) which does not actually need any CUDA libraries to be present.
However, this (actually non-cuda-dependant) code is tied to the same --with-cuda flag as the rest of the CUDA MCA's (which of course do need CUDA libraries).
I've just done some quick and very hacky testing so far, but it seemed promising to me that this functionality can be built into libopen-pal.so (without CUDA), which would still allow for separately built (at a later stage, including CUDA runtime) smcuda, coll_cuda etc. ) to assemble an MPI implementation at runtime with cuda support.

The restructuring in deb37ac just makes separation of these trickier as it's moving them into a translation unit with MCA-specific functions that do have direct CUDA dependencies. For me it seems like a needless change and having libopen-pal use any direct dependencies into mca subdirectories sticks out like a sore thumb I.M.H.O.

@akesandgren
Copy link
Contributor

And if it isn't clear yet, what we, EasyBuild, want to achieve is to build OpenMPI in such a way that it is aware of the CUDA functionality in the base build (without it trying to include cuda.h at configure/build) then at a later stage build the actual cuda part, libmca_common_cuda.so, into a separate directory and add it to the paths OpenMPI searches for MCA modules.

I.e. the code prior to the deb37ac commit makes that possible by changing the configure step a bit to enable OPAL_CUDA parts, build OpenMPI without mca_common_cuda etc, install in dir A. Then make a build that basically just compiles libmca_common_cuda.so, mca_btl_smcuda.so and mca_coll_cuda.so, we can enable runtime cuda support when needed. This will induce a small overhead for the base code when it doesn't need, or have, the cuda parts, but for us this is ok.

This would then match what we already do with UCX. We build (slightly modified) UCX without CUDA, then add a CUDA enabled UCX in a separate dir with pointers to where to find the CUDA part.

The combination of these two would give us the end result we are looking for.

@bwbarrett
Copy link
Member

You can't achieve what you propose with Open MPI's design, at any point in Open MPI's history. What you tried may have worked for certain cases, but would not have worked in the general case.

CUDA's design is pretty invasive for something like an MPI implementation. Certain datatypes require local processing (frequently, copying), as there's no other rational way to transfer them. A 1 million entry vector of single (sparse) elements is an example. You don't want to create a 1 million iovec entry to describe that to IB (and IB has an iovec length limit smaller than that anyway), so the datatype engine copies that sparse array into a dense array before handing off to IB (or the SM BTL or any other network). This code is all implemented in the datatype engine, which is a combination of inline functions and functions living in the libopen-pal library. The inline bits end up being included in libmpi.so as well over a dozen components. At this point, you really have two different builds of Open MPI.

What we tried to do with CUDA support for some of these use cases is to have a two-level enablement. First, at compile time, you can decide whether or not you want CUDA support via configure flags. There is a performance penalty (of hard to quantify, differing by application amount) for enabling this support, even if no cuda libraries are found at runtime. Second, we dlopen() the CUDA libraries, rather than adding a loader-time dependency to the application, and the CUDA-enabled code is all conditional based on the success of that dlopen. The penalty I spoke of earlier is the cost of doing those conditionals. If you're sending mostly large messages (and infrequently), the penalty probably doesn't matter. But if you're sending frequent very small messages, it's likely to show up (on high performance networks like IB or GNI).

So EasyBuild's most portable solution is to build Open MPI with --with-cuda and let the runtime loading of CUDA fail in situations where the user hasn't installed CUDA support. That's all you should have to do, although, as I said, there's a performance penalty in doing that. The only way out of that performance penalty is a completely independent build of the entire Open MPI package. Otherwise, there are lots of sharp edges where you will get amorphous failures when a CUDA buffer ends up in some non-CUDA enabled piece of code.

@bartoldeman
Copy link

If I understand things correctly, with 5.0+ and #10069 the performance concerns are no longer there, it's even documented that you can compile with CUDA without it being available at runtime:
https://docs.open-mpi.org/en/v5.0.x/tuning-apps/networking/cuda.html
"Open MPI supports building with CUDA libraries and running on systems without CUDA libraries or hardware."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants