-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specific OSU benchmarks segfault when non-CUDA aware OpenMPI 4.1.1 compiled with CUDA-aware UCX #9906
Comments
FYI @open-mpi/ucx |
This does not work, because OMPI is not just redirecting everything to UCX. In some modules, like coll, it has a bunch of memory management routines (copies) in its stack. OMPI compiled without CUDA support is not able to detect type of user memory and use the appropriate memory management routines. |
@Akshay-Venkatesh, are you aware of any plans to support such a configuration? (i.e. when OMPI is compiled without CUDA support, while UCX is built with it) |
Thanks a lot for this clear explaination, now I also understand why we see it in some tests, but not in others :) I have to think a bit on how we can best deal with this in the EasyBuild community. Essentially, EasyBuild is a tool for providing optimized software builds on HPC systems. Since the build recipe's need to support both GPU and non-GPU based systems, we were hoping to be able to do a base installation that provides the CPU support, and then 'something on top' (in a different prefix) that provides the GPU support. With UCX, we could actually make this split: the base installation would be made in one prefix, and an installation in a seperate prefix provides the CUDA functionality (and only needs to be installed by people with GPU based systems). That was very attractive for this community. I'm not sure if we can do something similar with the GPU support in OpenMPI though. If it was just the I'm sure such considerations are a bit out of scope here, but if anyone has ideas on this, don't hesitate to share! I'd also indeed love to hear if there are plans to support OMPI without CUDA + UCX with CUDA (since for that setup we've pretty much already solved our 'seperation' problem). |
What would be the impact of (always) building OpenMPI on top of CUDA, but not having CUDA available at runtime (only when it's actually needed)? I guess that may still be problematic in terms of which CUDA version we build against though... |
I was thinking the exact same thing. But maybe it's better to discuss the details of a solution here. The question of the impact is a good one though. Curious if anyone here would know this straight away. Otherwise, we'll just have to try and test (extensively) I guess... |
In case it's not clear what we EasyBuild folks are trying to do here; this is really just a packaging issue from our side. It would be greatly beneficial for us to build "cuda support" separately, and allowing it as an opt-in at runtime by loading a environment module. We thought UCX's plugin system would allow us to achieve that, but we (I in particular) was wrong. Since MCAs are supposed to allow for run-time assembling of the MPI implementation, I had still had some hope that this would be possible by gently prying apart the monolithic configure flag So I got my hopes up testing this with 4.1.1, but unfortunately, this commit; deb37ac |
I agree deb37ac looks like a bad move for us (EasyBuild) and seems like making the MCA separation model less useful. Before that commit you can build OpenMPI with CUDA support enabled without using cuda include files and then add the actual OpenMPI-CUDA runtime part at a later stage (or separately in our case). The above commit makes that impossible. |
Just so we're clear (because at least one person in the OMPI community pinged me this morning about the severity of this ticket), deb37ac is not on any released branch. So it's not actually causing any problems today and isn't a step in any direction. There may be bugs in unreleased software, but let's not talk about directionality of unrelased software. It has always been the intent of Open MPI to be able to be built with --with-cuda and only actually provide CUDA support if the CUDA libraries are installed. This is not free, however. Compiling OMPI with --with-cuda has a hard to quantify (and likely application dependent) cost, even when the CUDA libraries aren't available at runtime (it is even larger if the CUDA libraries are available). This is likely true of UCX as well, but I don't know UCX as well as I know OMPI. Libfabric solves the performance delta problem by moving CUDA/Host memory detection into the registration path, meaning that the expensive conditionals aren't in the Libfabric critical path. So while the intent is very much that you should be able to build Open MPI with --with-cuda (assuming the CUDA libraries are available at build time) and run on a system without CUDA libraries, there's also the possibility work remains in 5.0. And if you're worried about performance, you probably don't want to run in that configuration in the general case, because OMPI will have some (hard to quantify) overheads for doing so. |
Closing this issue, as we've identified root cause. Because Open MPI has to manipulate the user buffer in certain scenarios (collectives are a big one, but there are others), Open MPI must be built with CUDA support in order for the application to be able to pass CUDA buffers to MPI. That was the root cause of the 4.1.x problem. I've opened #9933 to track the 5.0.x library dependency regression. |
I'm sorry but I must have been unclear and I would like to clarify; this isn't at all the situation, and I have no reason to believe this scenario that you describe is in any way broken in any version of OpenMPI (including 5.0). My comments are about the opal functions (this code specifcally https://github.com/open-mpi/ompi/blob/v4.1.x/opal/datatype/opal_datatype_cuda.c) which does not actually need any CUDA libraries to be present. The restructuring in deb37ac just makes separation of these trickier as it's moving them into a translation unit with MCA-specific functions that do have direct CUDA dependencies. For me it seems like a needless change and having libopen-pal use any direct dependencies into mca subdirectories sticks out like a sore thumb I.M.H.O. |
And if it isn't clear yet, what we, EasyBuild, want to achieve is to build OpenMPI in such a way that it is aware of the CUDA functionality in the base build (without it trying to include cuda.h at configure/build) then at a later stage build the actual cuda part, libmca_common_cuda.so, into a separate directory and add it to the paths OpenMPI searches for MCA modules. I.e. the code prior to the deb37ac commit makes that possible by changing the configure step a bit to enable OPAL_CUDA parts, build OpenMPI without mca_common_cuda etc, install in dir A. Then make a build that basically just compiles libmca_common_cuda.so, mca_btl_smcuda.so and mca_coll_cuda.so, we can enable runtime cuda support when needed. This will induce a small overhead for the base code when it doesn't need, or have, the cuda parts, but for us this is ok. This would then match what we already do with UCX. We build (slightly modified) UCX without CUDA, then add a CUDA enabled UCX in a separate dir with pointers to where to find the CUDA part. The combination of these two would give us the end result we are looking for. |
You can't achieve what you propose with Open MPI's design, at any point in Open MPI's history. What you tried may have worked for certain cases, but would not have worked in the general case. CUDA's design is pretty invasive for something like an MPI implementation. Certain datatypes require local processing (frequently, copying), as there's no other rational way to transfer them. A 1 million entry vector of single (sparse) elements is an example. You don't want to create a 1 million iovec entry to describe that to IB (and IB has an iovec length limit smaller than that anyway), so the datatype engine copies that sparse array into a dense array before handing off to IB (or the SM BTL or any other network). This code is all implemented in the datatype engine, which is a combination of inline functions and functions living in the libopen-pal library. The inline bits end up being included in libmpi.so as well over a dozen components. At this point, you really have two different builds of Open MPI. What we tried to do with CUDA support for some of these use cases is to have a two-level enablement. First, at compile time, you can decide whether or not you want CUDA support via configure flags. There is a performance penalty (of hard to quantify, differing by application amount) for enabling this support, even if no cuda libraries are found at runtime. Second, we dlopen() the CUDA libraries, rather than adding a loader-time dependency to the application, and the CUDA-enabled code is all conditional based on the success of that dlopen. The penalty I spoke of earlier is the cost of doing those conditionals. If you're sending mostly large messages (and infrequently), the penalty probably doesn't matter. But if you're sending frequent very small messages, it's likely to show up (on high performance networks like IB or GNI). So EasyBuild's most portable solution is to build Open MPI with |
If I understand things correctly, with 5.0+ and #10069 the performance concerns are no longer there, it's even documented that you can compile with CUDA without it being available at runtime: |
Background information
This question comes from the EasyBuild community, and may affect how OpenMPI is installed by this community going forward.
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
4.1.1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From sources. The configure line was:
Please describe the system on which you are running
Details of the problem
For practical reasons (described here) the EasyBuild community decided to try and use a non-CUDA aware OpenMPI together with a CUDA-aware UCX in order to provide support for using OpenMPI with GPU (and GPU direct) support. The assumption was that this would work, since OpenMPI would delegate communication to UCX, which in turn would take care of 'all things GPU'. Documentation like the list of MPI APIs that work with CUDA-aware UCX (here) , and statements like 'OpenMPI v2.0.0 new features: CUDA support through UCX' (here) also seem to suggest that it should.
Unfortunately, that seems to be not the case. Certain OSU (e.g.
osu_latency
,osu_bw
,osu_bcast
) benchmarks work, but many others (e.g.osu_gather
,osualltoall
) fail with segfaults:This response and this response in an earlier ticket seemed to suggest that support for collectives might be limited for anything other than point-to-point. I may have overlooked it, but I can't really find this in the official documentation (this seems to be the most relevant part, but I don't clearly see this mentioned there - unless I misunderstood something).
Indeed, if we recompile OpenMPI with CUDA support (leaving all alse the same), the above
osu_alltoall
benchmarks runs fine.The actual question...
So far, I'm still confused about the level of CUDA support I can expect when compiling UCX with, but OpenMPI without CUDA. The OpenMPI FAQ seems to suggest (almost) full support, and the UCX documentation also doesn't seem to suggest serious limitations. But my segfault and the other issue on the OpenMPI issue tracker suggests rather limited support.
Can any of the experts clear this up? Is compiling OpenMPI without CUDA, but with a CUDA-aware UCX a feasible approach to offer CUDA support? If not (yet), will it be at some point in the future? A clear answer on this will really help the EasyBuild community in deciding how to support building OpenMPI/UCX going forward :)
The text was updated successfully, but these errors were encountered: