Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dlopen, dlclose libmpi triggers a segmentation fault if calling getenv #10142

Closed
simonbyrne opened this issue Mar 17, 2022 · 16 comments · Fixed by #10185
Closed

dlopen, dlclose libmpi triggers a segmentation fault if calling getenv #10142

simonbyrne opened this issue Mar 17, 2022 · 16 comments · Fixed by #10185
Assignees

Comments

@simonbyrne
Copy link
Contributor

simonbyrne commented Mar 17, 2022

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

4.1.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Using homebrew.

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version: Mac OS
  • Computer hardware: Intel Mac
  • Network type: n/a

Details of the problem

Doing the following:

  1. dlopen libmpi
  2. dlclose it
  3. Call getenv on an unset environment variable

triggers a segmentation fault

#include <dlfcn.h>
#include <stdlib.h>

int main(int argc, char** argv) {  
  dlclose(dlopen("/usr/local/lib/libmpi.dylib", RTLD_LAZY | RTLD_GLOBAL));
  getenv("NONSENSE");
  return 0;
}
$ cc mpi-dlopen.c -o mpi-dlopen
$ ./mpi-dlopen
Segmentation fault: 11

cc: @vchuravy

@vchuravy
Copy link

I can also confirm this on Archlinux with OpenMPI 4.1.2

@ggouaillardet
Copy link
Contributor

As a workaround, try to

export ZES_ENABLE_SYSMAN=1

before running your program.

@ggouaillardet
Copy link
Contributor

@vchuravy did you adapt the path to libmpi.so on arch linux?
if not, the root cause of the crash will likely be dlclose(NULL)

@ggouaillardet
Copy link
Contributor

@vchuravy and you probably did ...

@simonbyrne note you can reproduce the crash by using /usr/local/lib/libhwloc.15.dylib instead

feel free to refer to open-mpi/hwloc@fe363de for the gory details.

@simonbyrne
Copy link
Contributor Author

@simonbyrne note you can reproduce the crash by using /usr/local/lib/libhwloc.15.dylib instead

Indeed I can, thanks! And your suggested fix works.

Will your patch make it into a release soon?

@ggouaillardet
Copy link
Contributor

I am not the author of that patch.

@bgoglin any plan to release hwloc with that fix anytime soon?
per this ticket, the bug has "landed" into homebrew and Archlinux.

@jsquyres
Copy link
Member

The fix is actually in hwloc, not Open MPI.

Open MPI embeds a copy of hwloc (which, for the purposes of this discussion, is just a library), but that embedded copy of hwloc is only used if hwloc is not already available on your system. The version of hwloc that is included in Open MPI v4.1.2 is hwloc v2.0.2, which is long before the ZES_ENABLE_SYSMAN issue (i.e., it won't crash because of this). This means that your Open MPI v4.1.2 install is using an external hwloc installation. So the solution is likely to upgrade your external hwloc installation and ensure that your Open MPI is using that upgraded hwloc.

@jsquyres
Copy link
Member

jsquyres commented Mar 18, 2022

@bgoglin any plan to release hwloc with that fix anytime soon? per this ticket, the bug has "landed" into homebrew and Archlinux.

FWIW: I see the fix in hwloc v2.7.0, which appears to both be the latest version available, and also what homebrew pulled down and installed for me this morning.

EDIT: This ^^ turned out to be incorrect. See #10142 (comment), below.

@bgoglin
Copy link
Contributor

bgoglin commented Mar 18, 2022

I wonder why I don't receive notification from github anymore when I am tagged.
I can do 2.7.1rc1 on Monday for sure. I didn't do it earlier because it seemed it wasn't urgently needed anymore.

@ggouaillardet
Copy link
Contributor

@jsquyres you meant the hwloc-v2.7 branch, right?
The fix is definitely not in the hwloc-2.7.0 tag

@jsquyres
Copy link
Member

@jsquyres you meant the hwloc-v2.7 branch, right? The fix is definitely not in the hwloc-2.7.0 tag

@ggouaillardet You are absolutely right. Thanks for the correction!

@bgoglin
Copy link
Contributor

bgoglin commented Mar 21, 2022

hwloc 2.7.1rc1 is available from https://www.open-mpi.org/software/hwloc/v2.7/
I'll release the final 2.7.1 in a couple days if you confirm the issue is gone.

@giordano
Copy link
Contributor

giordano commented Mar 21, 2022

I did a local build of 2.7.1rc1 and I can confirm that with this one dlclosing libhwloc doesn't segfault badly:

julia> using Hwloc_jll, Libdl
[ Info: Precompiling Hwloc_jll [e33a78d0-f292-5ffc-b300-72abe9b543c8]

julia> dlclose(Hwloc_jll.libhwloc_handle)
true

For reference, at the moment with 2.7.0 I get

julia> using Hwloc_jll, Libdl

julia> dlclose(Hwloc_jll.libhwloc_handle)

signal (11): Segmentation fault
in expression starting at none:0
getenv at /usr/bin/../lib/libc.so.6 (unknown line)
[...]

Thanks!

@jsquyres
Copy link
Member

hwloc 2.7.1rc1 is available from https://www.open-mpi.org/software/hwloc/v2.7/ I'll release the final 2.7.1 in a couple days if you confirm the issue is gone.

Thanks @bgoglin!

@bgoglin
Copy link
Contributor

bgoglin commented Mar 24, 2022

I did a local build of 2.7.1rc1 and I can confirm that with this one dlclosing libhwloc doesn't segfault badly:

Thanks a lot for testing so quickly, I just released the final hwloc 2.7.1 then.

simonbyrne added a commit to simonbyrne/homebrew-core that referenced this issue Mar 24, 2022
@awlauria awlauria linked a pull request Mar 29, 2022 that will close this issue
awlauria added a commit to awlauria/ompi that referenced this issue Mar 30, 2022
This pulls in a fix for a segv when doing a sequence of dlopen(libmpi) +
dlocose() + getenv().

See open-mpi#10142 for more details.

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
awlauria added a commit to awlauria/ompi that referenced this issue Mar 30, 2022
This pulls in a fix for a segv when doing a sequence of dlopen(libmpi) +
dlocose() + getenv().

See open-mpi#10142 for more details.

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
(cherry picked from commit 1b4379e)
@awlauria
Copy link
Contributor

PR's merged. Closing as fixed in upstream HWLOC and now in internal HWLOC ompi builds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants