Skip to content

orterun version 2.1.0 seg fault on tool launch via LaunchMON #3247

@lee218llnl

Description

@lee218llnl

I am getting seg faults when trying to launch a debug session. This is with the OpenMPI version 2.1.0 release and using LaunchMON as a test:

bash-4.2$ ./test.launch_1
[LMON_FE] launching the job/daemons via /collab/usr/global/tools/openmpi/toss_3_x86_64_ib/openmpi-2.1.0/bin/orterun

[LMON FE] 6 RM types are supported
[rzgenie6:55304] *** Process received signal ***
[rzgenie6:55304] Signal: Segmentation fault (11)
[rzgenie6:55304] Signal code: Address not mapped (1)
[rzgenie6:55304] Failing at address: (nil)
[rzgenie6:55304] [ 0] /lib64/libpthread.so.0(+0xf370)[0x2aaaabc45370]
[rzgenie6:55304] [ 1] /lib64/libc.so.6(+0x133586)[0x2aaaac89c586]
[rzgenie6:55304] [ 2] /collab/usr/global/tools/openmpi/toss_3_x86_64_ib/openmpi-2.1.0/lib/libopen-rte.so.20(orte_schizo_base_setup_fork+0x40)[0x2aaaaadc452f]
[rzgenie6:55304] [ 3] /collab/usr/global/tools/openmpi/toss_3_x86_64_ib/openmpi-2.1.0/lib/libopen-rte.so.20(orte_odls_base_default_launch_local+0x464)[0x2aaaaad5fbea]
[rzgenie6:55304] [ 4] /collab/usr/global/tools/openmpi/toss_3_x86_64_ib/openmpi-2.1.0/lib/libopen-pal.so.20(+0xcce34)[0x2aaaab0e2e34]
[rzgenie6:55304] [ 5] /collab/usr/global/tools/openmpi/toss_3_x86_64_ib/openmpi-2.1.0/lib/libopen-pal.so.20(+0xcd0a6)[0x2aaaab0e30a6]
[rzgenie6:55304] [ 6] /collab/usr/global/tools/openmpi/toss_3_x86_64_ib/openmpi-2.1.0/lib/libopen-pal.so.20(opal_libevent2022_event_base_loop+0x298)[0x2aaaab0e36f9]
[rzgenie6:55304] [ 7] /collab/usr/global/tools/openmpi/toss_3_x86_64_ib/openmpi-2.1.0/bin/orterun[0x405090]
[rzgenie6:55304] [ 8] /collab/usr/global/tools/openmpi/toss_3_x86_64_ib/openmpi-2.1.0/bin/orterun[0x4039f6]
[rzgenie6:55304] [ 9] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaac78ab35]
[rzgenie6:55304] [10] /collab/usr/global/tools/openmpi/toss_3_x86_64_ib/openmpi-2.1.0/bin/orterun[0x403909]
[rzgenie6:55304] *** End of error message ***

^C
bash-4.2$ ^C

bash-4.2$ gdb /collab/usr/global/tools/openmpi/toss_3_x86_64_ib/openmpi-2.1.0/bin/orterun rzgenie6-orterun-55304.core 
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /collab/usr/global/tools/openmpi/toss_3_x86_64_ib/openmpi-2.1.0/bin/orterun...done.
[New LWP 55304]
[New LWP 55315]
[New LWP 55316]
[New LWP 55317]
[New LWP 55314]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/collab/usr/global/tools/openmpi/toss_3_x86_64_ib/openmpi-2.1.0/bin/orterun -mc'.
Program terminated with signal 11, Segmentation fault.
#0  __strcmp_sse42 () at ../sysdeps/x86_64/multiarch/strcmp-sse42.S:164
164             movdqu  (%rdi), %xmm1
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.166-2.el7.x86_64 elfutils-libs-0.166-2.el7.x86_64 libattr-2.4.46-12.el7.x86_64 libcap-2.22-8.el7.x86_64 libcxgb3-1.3.1-8.el7.x86_64 libcxgb4-1.3.5-3.el7.x86_64 libhfi1-0.5-23.el7.x86_64 libibverbs-1.2.1-1.el7.x86_64 libipathverbs-1.3-2.el7.x86_64 libmlx4-1.2.1-1.el7.x86_64 libmlx5-1.2.1-8.el7.x86_64 libmthca-1.0.6-13.el7.x86_64 libnes-1.1.4-2.el7.x86_64 libnl3-3.2.28-3.el7_3.x86_64 libpciaccess-0.13.4-3.el7_3.x86_64 librdmacm-1.1.0-2.el7.x86_64 munge-libs-0.5.12-1.ch6.x86_64 numactl-libs-2.0.9-6.el7_2.x86_64 systemd-libs-219-30.el7_3.7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) where
#0  __strcmp_sse42 () at ../sysdeps/x86_64/multiarch/strcmp-sse42.S:164
#1  0x00002aaaaadc452f in orte_schizo_base_setup_fork (jdata=0x7b7400, 
    context=0x7b7840) at base/schizo_base_stubs.c:67
#2  0x00002aaaaad5fbea in orte_odls_base_default_launch_local (fd=-1, sd=4, 
    cbdata=0x7b83c0) at base/odls_base_default_fns.c:712
#3  0x00002aaaab0e2e34 in event_process_active_single_queue (base=0x64e070, 
    activeq=0x64e5f0) at event.c:1370
#4  0x00002aaaab0e30a6 in event_process_active (base=0x64e070) at event.c:1440
#5  0x00002aaaab0e36f9 in opal_libevent2022_event_base_loop (base=0x64e070, 
    flags=1) at event.c:1644
#6  0x0000000000405090 in orterun (argc=7, argv=0x7fffffffc768)
    at orterun.c:1083
#7  0x00000000004039f6 in main (argc=7, argv=0x7fffffffc768) at main.c:13
(gdb) frame 1
#1  0x00002aaaaadc452f in orte_schizo_base_setup_fork (jdata=0x7b7400, 
    context=0x7b7840) at base/schizo_base_stubs.c:67
67              if (0 == strcmp(jdata->personality, mod->component->mca_component_name)) {
(gdb) print jdata->personality
$1 = 0x0

Here is how you may reproduce:

git clone https://github.com/llnl/launchmon.git
cd launchmon/
export PATH=/collab/usr/global/tools/openmpi/toss_3_x86_64_ib/openmpi-2.1.0/bin:$PATH
CFLAGS="-g -O0" CXXFLAGS="-g -O0" ./configure --prefix=/nfs/tmp2/lee218/prefix/launchmon-1.0.3b --with-test-rm=orte --with-test-rm-launcher=/collab/usr/global/tools/openmpi/toss_3_x86_64_ib/openmpi-2.1.0/bin/mpirun --with-test-installed --with-test-nnodes=1 && make clean && make -j 8 install && make -j 8 check
cd test/src
./test.launch_1

I'm not sure where the ultimate blame lies (OpenMPI or LaunchMON), but since the seg fault is occuring in orterun because it is trying to access a NULL string, I am submitting this to OpenMPI. It's also worth noting that LaunchMON works OK with OpenMPI release 2.0.2. @rhc54 and @dongahn, let me know if you can help with this one.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions