Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional segfaults and hangs when running multiple MPI jobs on one node #7642

Closed
jeorsch opened this issue Apr 17, 2020 · 11 comments
Closed

Comments

@jeorsch
Copy link

jeorsch commented Apr 17, 2020

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.0.3 (also occured with v4.0.0)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From a tarball with "--prefix", "--enable-debug", and "--with-hwloc=internal" configure parameters.

Please describe the system on which you are running

  • Operating system/version: Debian Buster
  • Computer hardware: 2 x Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
  • Network type: Ethernet (not used since only intra-node jobs are run)

Details of the problem

We run a test suite that involves many parallel MPI jobs on a single node. The jobs are started by mpirun --bind-to none -use-hwthread-cpus -oversubscribe, since this gives us the best throughput. However, occasionally a single MPI job exits with a segmentation fault and another job hangs at the same time. I don't think that it is releated to my application, since the errorneous behaviour happens only every few days or weeks, while most runs pass flawlessly. Furthermore, the test suite jobs that fail/hang are different each time. OpenMPI 1.10.2 that we used before never had that problem.

Here is a corresponding backtrace for a job with two ranks that crashed. I had to strip confidential content (hostname and application backtrace).

First process:

*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x563d133a0040
[ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7f259164d730]
[ 1] /opt/openmpi-4.0.3-debug/lib/openmpi/mca_btl_vader.so(+0x50b2)[0x7f258dbe60b2]
[ 2] /opt/openmpi-4.0.3-debug/lib/openmpi/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x68)[0x7f258dbe8aef]
[ 3] /opt/openmpi-4.0.3-debug/lib/openmpi/mca_btl_vader.so(+0x7d30)[0x7f258dbe8d30]
[ 4] /opt/openmpi-4.0.3-debug/lib/openmpi/mca_btl_vader.so(+0x7fe0)[0x7f258dbe8fe0]
[ 5] /opt/openmpi-4.0.3-debug/lib/libopen-pal.so.40(opal_progress+0x30)[0x7f258fb85b43]
[ 6] /opt/openmpi-4.0.3-debug/lib/libopen-pal.so.40(ompi_sync_wait_mt+0x18e)[0x7f258fb8e08a]
[ 7] /opt/openmpi-4.0.3-debug/lib/libmpi.so.40(ompi_request_default_wait_all+0x1b3)[0x7f2590dbc0e7]
[ 8] /opt/openmpi-4.0.3-debug/lib/libmpi.so.40(PMPI_Waitall+0x15d)[0x7f2590e3469f]
...

Second process:

*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x5576a3d611c0
[ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7f41e7964730]
[ 1] /opt/openmpi-4.0.3-debug/lib/openmpi/mca_btl_vader.so(+0x50b2)[0x7f41e42fd0b2]
[ 2] /opt/openmpi-4.0.3-debug/lib/openmpi/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x68)[0x7f41e42ffaef]
[ 3] /opt/openmpi-4.0.3-debug/lib/openmpi/mca_btl_vader.so(+0x7d30)[0x7f41e42ffd30]
[ 4] /opt/openmpi-4.0.3-debug/lib/openmpi/mca_btl_vader.so(+0x7fe0)[0x7f41e42fffe0]
[ 5] /opt/openmpi-4.0.3-debug/lib/libopen-pal.so.40(opal_progress+0x30)[0x7f41e5e9cb43]
[ 6] /opt/openmpi-4.0.3-debug/lib/libopen-pal.so.40(ompi_sync_wait_mt+0x18e)[0x7f41e5ea508a]
[ 7] /opt/openmpi-4.0.3-debug/lib/libmpi.so.40(+0x5f92b)[0x7f41e70d292b]
[ 8] /opt/openmpi-4.0.3-debug/lib/libmpi.so.40(ompi_request_default_wait+0x27)[0x7f41e70d29ed]
[ 9] /opt/openmpi-4.0.3-debug/lib/libmpi.so.40(ompi_coll_base_sendrecv_actual+0xf0)[0x7f41e7177ea3]
[10] /opt/openmpi-4.0.3-debug/lib/libmpi.so.40(ompi_coll_base_allreduce_intra_recursivedoubling+0x335)[0x7f41e7179e80]
[11] /opt/openmpi-4.0.3-debug/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0xad)[0x7f41e426aa33]
[12] /opt/openmpi-4.0.3-debug/lib/libmpi.so.40(PMPI_Allreduce+0x3c8)[0x7f41e70eede2]
...

Additionally, an independent MPI jobs hangs forever. Here are the backtraces from all threads that are related to such a hanging process:

Thread 3 (Thread 0x7fd2ee455700 (LWP 16320)):
#0  0x00007fd2f04387ef in epoll_wait (epfd=10, events=events@entry=0x563d132dbeb0, maxevents=32, timeout=timeout@entry=1499921) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007fd2ef41302d in epoll_dispatch (base=0x563d132dbc00, tv=<optimized out>) at epoll.c:407
#2  0x00007fd2ef4170c9 in opal_libevent2022_event_base_loop (base=0x563d132dbc00, flags=1) at event.c:1630
#3  0x00007fd2ee5ca472 in progress_engine (obj=0x563d1329ef58) at runtime/pmix_progress_threads.c:232
#4  0x00007fd2f0f72fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#5  0x00007fd2f04384cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7fd2eee7e700 (LWP 16318)):
#0  0x00007fd2f042d819 in __GI___poll (fds=fds@entry=0x7fd2e8000b20, nfds=nfds@entry=1, timeout=timeout@entry=3600000) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007fd2ef41fb35 in poll_dispatch (base=0x563d1327dc70, tv=0x7fd2eee7dd90) at poll.c:165
#2  0x00007fd2ef4170c9 in opal_libevent2022_event_base_loop (base=0x563d1327dc70, flags=1) at event.c:1630
#3  0x00007fd2ef3b2e33 in progress_engine (obj=0x563d1323fdb8) at runtime/opal_progress_threads.c:105
#4  0x00007fd2f0f72fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#5  0x00007fd2f04384cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7fd2ef361c40 (LWP 16315)):
#0  opal_progress () at runtime/opal_progress.c:231
#1  0x00007fd2ef3b408a in ompi_sync_wait_mt (sync=0x7ffdf8a5afa0) at threads/wait_sync.c:85
#2  0x00007fd2f072e92b in ompi_request_wait_completion (req=0x563d1339ef80) at ../ompi/request/request.h:405
#3  0x00007fd2f072e9ed in ompi_request_default_wait (req_ptr=0x7ffdf8a5b0b0, status=0x7ffdf8a5b090) at request/req_wait.c:42
#4  0x00007fd2f07d3ea3 in ompi_coll_base_sendrecv_actual (sendbuf=0x7ffdf8a5b2bc, scount=2, sdatatype=0x7fd2f08844a0 <ompi_mpi_int>, dest=1, stag=-10, recvbuf=0x563d1339b3b8, rcount=2, rdatatype=0x7fd2f08844a0 <ompi_mpi_int>, source=1, rtag=-10, 
    comm=0x7fd2f0891940 <ompi_mpi_comm_world>, status=0x0) at base/coll_base_util.c:60
#5  0x00007fd2f07d15f3 in ompi_coll_base_sendrecv (sendbuf=0x7ffdf8a5b2bc, scount=2, sdatatype=0x7fd2f08844a0 <ompi_mpi_int>, dest=1, stag=-10, recvbuf=0x563d1339b3b8, rcount=2, rdatatype=0x7fd2f08844a0 <ompi_mpi_int>, source=1, rtag=-10, 
    comm=0x7fd2f0891940 <ompi_mpi_comm_world>, status=0x0, myid=0) at base/coll_base_util.h:98
#6  0x00007fd2f07d251f in ompi_coll_base_allgather_intra_two_procs (sbuf=0x7ffdf8a5b2bc, scount=2, sdtype=0x7fd2f08844a0 <ompi_mpi_int>, rbuf=0x563d1339b3b0, rcount=2, rdtype=0x7fd2f08844a0 <ompi_mpi_int>, comm=0x7fd2f0891940 <ompi_mpi_comm_world>, 
    module=0x563d133a51f0) at base/coll_base_allgather.c:635
#7  0x00007fd2ed37980a in ompi_coll_tuned_allgather_intra_dec_fixed (sbuf=0x7ffdf8a5b2bc, scount=2, sdtype=0x7fd2f08844a0 <ompi_mpi_int>, rbuf=0x563d1339b3b0, rcount=2, rdtype=0x7fd2f08844a0 <ompi_mpi_int>, comm=0x7fd2f0891940 <ompi_mpi_comm_world>, 
    module=0x563d133a51f0) at coll_tuned_decision_fixed.c:556
#8  0x00007fd2f07032f4 in ompi_comm_split (comm=0x7fd2f0891940 <ompi_mpi_comm_world>, color=0, key=0, newcomm=0x563d133bd8d8, pass_on_topo=false) at communicator/comm.c:445
#9  0x00007fd2f075dab2 in PMPI_Comm_split (comm=0x7fd2f0891940 <ompi_mpi_comm_world>, color=0, key=0, newcomm=0x563d133bd8d8) at pcomm_split.c:69
...

I can try to provide some more information, if requested, but I will have to wait until such a hang happens again. As I said it usually takes some days until a CI pipeline gets stuck by this.

@jeorsch
Copy link
Author

jeorsch commented May 5, 2020

Another hang! This time two MPI jobs crashed.
First crashed job's backtrace (printed by 2/4 ranks):

*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x55ea69fed980
[ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7f5ef1faf730]
[ 1] /opt/openmpi-4.0.3-debug/lib/openmpi/mca_btl_vader.so(+0x50b2)[0x7f5eee5470b2]
[ 2] /opt/openmpi-4.0.3-debug/lib/openmpi/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x68)[0x7f5eee549aef]
[ 3] /opt/openmpi-4.0.3-debug/lib/openmpi/mca_btl_vader.so(+0x7d30)[0x7f5eee549d30]
[ 4] /opt/openmpi-4.0.3-debug/lib/openmpi/mca_btl_vader.so(+0x7fe0)[0x7f5eee549fe0]
[ 5] /opt/openmpi-4.0.3-debug/lib/libopen-pal.so.40(opal_progress+0x30)[0x7f5ef04e6b43]
[ 6] /opt/openmpi-4.0.3-debug/lib/libopen-pal.so.40(ompi_sync_wait_mt+0x18e)[0x7f5ef04ef08a]
[ 7] /opt/openmpi-4.0.3-debug/lib/libmpi.so.40(ompi_request_default_wait_all+0x1b3)[0x7f5ef171d0e7]
[ 8] /opt/openmpi-4.0.3-debug/lib/libmpi.so.40(PMPI_Waitall+0x15d)[0x7f5ef179569f]

Second crashed job's backtrace (printed by 1/2 ranks):

*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x55c44e82f100
[ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7f2dd0e03730]
[ 1] /opt/openmpi-4.0.3-debug/lib/openmpi/mca_btl_vader.so(+0x50b2)[0x7f2dcd2910b2]
[ 2] /opt/openmpi-4.0.3-debug/lib/openmpi/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x68)[0x7f2dcd293aef]
[ 3] /opt/openmpi-4.0.3-debug/lib/openmpi/mca_btl_vader.so(+0x7d30)[0x7f2dcd293d30]
[ 4] /opt/openmpi-4.0.3-debug/lib/openmpi/mca_btl_vader.so(+0x7fe0)[0x7f2dcd293fe0]
[ 5] /opt/openmpi-4.0.3-debug/lib/libopen-pal.so.40(opal_progress+0x30)[0x7f2dcf231b43]
[ 6] /opt/openmpi-4.0.3-debug/lib/libopen-pal.so.40(ompi_sync_wait_mt+0x18e)[0x7f2dcf23a08a]
[ 7] /opt/openmpi-4.0.3-debug/lib/openmpi/mca_pml_ob1.so(+0x1021c)[0x7f2dcd24721c]
[ 8] /opt/openmpi-4.0.3-debug/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5ac)[0x7f2dcd2496cd]
[ 9] /opt/openmpi-4.0.3-debug/lib/openmpi/mca_coll_basic.so(mca_coll_basic_gatherv_intra+0x7c)[0x7f2dcd1ec6e1]
[10] /opt/openmpi-4.0.3-debug/lib/libmpi.so.40(MPI_Gatherv+0x668)[0x7f2dd05f7486]

I don't know how to translate these backtraces to code lines since I do not not the addresses of the shared objects that the chrashed processes used. Any hints how to do this?

Again, there is an additional MPI job that hangs forever. It has two ranks that are waiting for progress at different locations, i.e., they are out-of-sync and hence deadlocked.
Rank 0`s backtrace of hung job taken at a random time:

Thread 3 (Thread 0x7f914bfff700 (LWP 3147)):
#0  0x00007f91520967ef in epoll_wait (epfd=10, events=events@entry=0x561e845af7f0, maxevents=32, timeout=timeout@entry=2100000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f91511c002d in epoll_dispatch (base=0x561e845af540, tv=<optimized out>) at epoll.c:407
#2  0x00007f91511c40c9 in opal_libevent2022_event_base_loop (base=0x561e845af540, flags=1) at event.c:1630
#3  0x00007f9150378472 in progress_engine (obj=0x561e845524b8) at runtime/pmix_progress_threads.c:232
#4  0x00007f9152c16fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#5  0x00007f91520964cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7f9150c2c700 (LWP 3145)):
#0  0x00007f915208b819 in __GI___poll (fds=fds@entry=0x7f914c000b20, nfds=nfds@entry=1, timeout=timeout@entry=3600000) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f91511ccb35 in poll_dispatch (base=0x561e84551aa0, tv=0x7f9150c2bd10) at poll.c:165
#2  0x00007f91511c40c9 in opal_libevent2022_event_base_loop (base=0x561e84551aa0, flags=1) at event.c:1630
#3  0x00007f915115fe33 in progress_engine (obj=0x561e84536dc8) at runtime/opal_progress_threads.c:105
#4  0x00007f9152c16fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#5  0x00007f91520964cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7f915110d880 (LWP 3142)):
#0  0x00007f9151158619 in opal_atomic_swap_32 (addr=0x7f915127480c <lock>, newval=1) at ../opal/include/opal/sys/gcc_builtin/atomic.h:110
#1  0x00007f91511587aa in opal_thread_swap_32 (ptr=0x7f915127480c <lock>, newvalue=1) at ../opal/threads/thread_usage.h:162
#2  0x00007f9151158a91 in opal_progress_events () at runtime/opal_progress.c:177
#3  0x00007f9151158ba9 in opal_progress () at runtime/opal_progress.c:245
#4  0x00007f915116108a in ompi_sync_wait_mt (sync=0x7ffdaa29da60) at threads/wait_sync.c:85
#5  0x00007f915238e92b in ompi_request_wait_completion (req=0x561e84673040) at ../ompi/request/request.h:405
#6  0x00007f915238e9ed in ompi_request_default_wait (req_ptr=0x7ffdaa29db70, status=0x7ffdaa29db50) at request/req_wait.c:42
#7  0x00007f9152433ea3 in ompi_coll_base_sendrecv_actual (sendbuf=0x561e8466f450, scount=1, sdatatype=0x7f91524e44a0 <ompi_mpi_int>, dest=1, stag=-12, recvbuf=0x561e8466f2cc, rcount=1, rdatatype=0x7f91524e44a0 <ompi_mpi_int>, source=1, 
    rtag=-12, comm=0x7f91524f1940 <ompi_mpi_comm_world>, status=0x0) at base/coll_base_util.c:60
#8  0x00007f9152435e80 in ompi_coll_base_allreduce_intra_recursivedoubling (sbuf=0x1, rbuf=0x561e8466f2cc, count=1, dtype=0x7f91524e44a0 <ompi_mpi_int>, op=0x7f91524fc840 <ompi_mpi_op_sum>, comm=0x7f91524f1940 <ompi_mpi_comm_world>, 
    module=0x561e846792d0) at base/coll_base_allreduce.c:219
#9  0x00007f9150127a33 in ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=0x1, rbuf=0x561e8466f2cc, count=1, dtype=0x7f91524e44a0 <ompi_mpi_int>, op=0x7f91524fc840 <ompi_mpi_op_sum>, comm=0x7f91524f1940 <ompi_mpi_comm_world>, 
    module=0x561e846792d0) at coll_tuned_decision_fixed.c:66
#10 0x00007f91523aade2 in PMPI_Allreduce (sendbuf=0x1, recvbuf=0x561e8466f2cc, count=1, datatype=0x7f91524e44a0 <ompi_mpi_int>, op=0x7f91524fc840 <ompi_mpi_op_sum>, comm=0x7f91524f1940 <ompi_mpi_comm_world>) at pallreduce.c:113

Rank 1`s backtrace of hung job taken at a random time:

Thread 3 (Thread 0x7fafaac0c700 (LWP 3146)):
#0  0x00007fafaca9f7ef in epoll_wait (epfd=10, events=events@entry=0x55c44e76a7f0, maxevents=32, timeout=timeout@entry=2100000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007fafabbc902d in epoll_dispatch (base=0x55c44e76a540, tv=<optimized out>) at epoll.c:407
#2  0x00007fafabbcd0c9 in opal_libevent2022_event_base_loop (base=0x55c44e76a540, flags=1) at event.c:1630
#3  0x00007fafaad81472 in progress_engine (obj=0x55c44e70d4b8) at runtime/pmix_progress_threads.c:232
#4  0x00007fafad61ffa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#5  0x00007fafaca9f4cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7fafab635700 (LWP 3144)):
#0  0x00007fafaca94819 in __GI___poll (fds=fds@entry=0x7fafa4000b20, nfds=nfds@entry=1, timeout=timeout@entry=3600000) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007fafabbd5b35 in poll_dispatch (base=0x55c44e70caa0, tv=0x7fafab634d10) at poll.c:165
#2  0x00007fafabbcd0c9 in opal_libevent2022_event_base_loop (base=0x55c44e70caa0, flags=1) at event.c:1630
#3  0x00007fafabb68e33 in progress_engine (obj=0x55c44e6f1dc8) at runtime/opal_progress_threads.c:105
#4  0x00007fafad61ffa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#5  0x00007fafaca9f4cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7fafabb16880 (LWP 3143)):
#0  0x00007fafabb6162b in opal_atomic_swap_32 (addr=0x7fafabc7d80c <lock>, newval=1) at ../opal/include/opal/sys/gcc_builtin/atomic.h:112
#1  0x00007fafabb617aa in opal_thread_swap_32 (ptr=0x7fafabc7d80c <lock>, newvalue=1) at ../opal/threads/thread_usage.h:162
#2  0x00007fafabb61a91 in opal_progress_events () at runtime/opal_progress.c:177
#3  0x00007fafabb61ba9 in opal_progress () at runtime/opal_progress.c:245
#4  0x00007fafabb6a08a in ompi_sync_wait_mt (sync=0x7ffc950c5890) at threads/wait_sync.c:85
#5  0x00007fafacd9792b in ompi_request_wait_completion (req=0x55c44e82e040) at ../ompi/request/request.h:405
#6  0x00007fafacd979ed in ompi_request_default_wait (req_ptr=0x7ffc950c59a0, status=0x7ffc950c5980) at request/req_wait.c:42
#7  0x00007faface3cea3 in ompi_coll_base_sendrecv_actual (sendbuf=0x7ffc950c5bac, scount=2, sdatatype=0x7fafaceed4a0 <ompi_mpi_int>, dest=0, stag=-10, recvbuf=0x55c44e82a450, rcount=2, rdatatype=0x7fafaceed4a0 <ompi_mpi_int>, source=0, 
    rtag=-10, comm=0x7fafacefa940 <ompi_mpi_comm_world>, status=0x0) at base/coll_base_util.c:60
#8  0x00007faface3a5f3 in ompi_coll_base_sendrecv (sendbuf=0x7ffc950c5bac, scount=2, sdatatype=0x7fafaceed4a0 <ompi_mpi_int>, dest=0, stag=-10, recvbuf=0x55c44e82a450, rcount=2, rdatatype=0x7fafaceed4a0 <ompi_mpi_int>, source=0, 
    rtag=-10, comm=0x7fafacefa940 <ompi_mpi_comm_world>, status=0x0, myid=1) at base/coll_base_util.h:98
#9  0x00007faface3b51f in ompi_coll_base_allgather_intra_two_procs (sbuf=0x7ffc950c5bac, scount=2, sdtype=0x7fafaceed4a0 <ompi_mpi_int>, rbuf=0x55c44e82a450, rcount=2, rdtype=0x7fafaceed4a0 <ompi_mpi_int>, 
    comm=0x7fafacefa940 <ompi_mpi_comm_world>, module=0x55c44e8342d0) at base/coll_base_allgather.c:635
#10 0x00007fafa9b3080a in ompi_coll_tuned_allgather_intra_dec_fixed (sbuf=0x7ffc950c5bac, scount=2, sdtype=0x7fafaceed4a0 <ompi_mpi_int>, rbuf=0x55c44e82a450, rcount=2, rdtype=0x7fafaceed4a0 <ompi_mpi_int>, 
    comm=0x7fafacefa940 <ompi_mpi_comm_world>, module=0x55c44e8342d0) at coll_tuned_decision_fixed.c:556
#11 0x00007fafacd6c2f4 in ompi_comm_split (comm=0x7fafacefa940 <ompi_mpi_comm_world>, color=0, key=1, newcomm=0x55c44e836ec8, pass_on_topo=false) at communicator/comm.c:445
#12 0x00007fafacdc6ab2 in PMPI_Comm_split (comm=0x7fafacefa940 <ompi_mpi_comm_world>, color=0, key=1, newcomm=0x55c44e836ec8) at pcomm_split.c:69

This job is at an really early stage where the initial commincators are set up shortly after MPI_Init. The application first calls MPI_Allreduce on MPI_COMM_WORLD, followed by an MPI_Comm_split. From the backtrackes it is obvious that rank 1 somehow already passed the allreduce and wants to split the communicator, while rank 0 has not finished the allreduce yet.

Btw. additional independent MPI jobs run after these jobs crashed or hung.

@jsquyres
Copy link
Member

jsquyres commented May 6, 2020

Can you provide simple reproducers for either or both of these problems?

Per your question above: there isn't a good way to map the addresses to individual line numbers. If you are compiling Open MPI yourself, you can add in -g to the CFLAGS, and that should then show line numbers in core dumps.

@jeorsch
Copy link
Author

jeorsch commented May 7, 2020

Unfortunately, I cannot provide a reproducer since the problem only occurs every few days while everything works fine in the meantime. IMHO the vader BTL probably runs into a race condition when multiple MPI jobs are running simultaneously. However, I also tried to explicitly stress the system with many simultaneous jobs to trigger the error, but this was not successful. Now, I will try, if can reproduce it with a trivial problem.
As I wrote, I compiled using --enable-debug so the debug symbols are included in the libraries. The backtraces of the hung jobs that I generated with an attached gdb resolve the code lines correctly. Only the backtraces printed when the segmentation faults occur do not resolve the code lines and I cannot resolve them manually using addr2line since I do not know the memory location where the shared objects were located at runtime.

@jeorsch
Copy link
Author

jeorsch commented May 22, 2020

Here are some new backtraces from another crash. This time the application was instrumented with gcc's address sanitizer (OpenMPI was not sanitized). The sanitizer printed a resolved backtrace of the crashing jobs. Though, this time no jobs hang. I had to demux the output from the MPI processes and ASAN unfortunately does not print the process ID in every line, so the backtraces might be demuxed in a wrong way. However, I think it is correct and also it should not make a big difference.

First concurrent MPI job
This job was running on 4 ranks of which 2 crashed. Each of these two performed an MPI_Wait operation on an MPI_Irecv request.

AddressSanitizer:DEADLYSIGNAL
=================================================================
==80566==ERROR: AddressSanitizer: SEGV on unknown address 0x000000001bf8 (pc 0x7f91ea4aaefd bp 0x7ffd0305a5d0 sp 0x7ffd0305a5b0 T0)
==80566==The signal is caused by a WRITE memory access.
    #0 0x7f91ea4aaefc in vader_fifo_write /tmp/openmpi-4.0.3/opal/mca/btl/vader/btl_vader_fifo.h:179
    #1 0x7f91ea4aaff4 in vader_fifo_write_back /tmp/openmpi-4.0.3/opal/mca/btl/vader/btl_vader_fifo.h:228
    #2 0x7f91ea4accdc in mca_btl_vader_poll_handle_frag /tmp/openmpi-4.0.3/opal/mca/btl/vader/btl_vader_component.c:673
    #3 0x7f91ea4aaa51 in mca_btl_vader_check_fboxes /tmp/openmpi-4.0.3/opal/mca/btl/vader/btl_vader_fbox.h:231
    #4 0x7f91ea4acfa9 in mca_btl_vader_component_progress /tmp/openmpi-4.0.3/opal/mca/btl/vader/btl_vader_component.c:763
    #5 0x7f91eea11b42 in opal_progress runtime/opal_progress.c:231
    #6 0x7f91eea1a089 in ompi_sync_wait_mt threads/wait_sync.c:85
    #7 0x7f91f0c4b92a in ompi_request_wait_completion ../ompi/request/request.h:405
    #8 0x7f91f0c4b9ec in ompi_request_default_wait request/req_wait.c:42
    #9 0x7f91f0cc437e in PMPI_Wait /tmp/openmpi-4.0.3/ompi/mpi/c/profile/pwait.c:74
    ...

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /tmp/openmpi-4.0.3/opal/mca/btl/vader/btl_vader_fifo.h:179 in vader_fifo_write
==80566==ABORTING


AddressSanitizer:DEADLYSIGNAL
=================================================================
==80567==ERROR: AddressSanitizer: SEGV on unknown address 0x000000001bf8 (pc 0x7f83fa48ed85 bp 0x7fff56643e20 sp 0x7fff56643e00 T0)
==80567==The signal is caused by a READ memory access.
    #0 0x7f83fa48ed84 in vader_fifo_read /tmp/openmpi-4.0.3/opal/mca/btl/vader/btl_vader_fifo.h:136
    #1 0x7f83fa490d0c in mca_btl_vader_poll_fifo /tmp/openmpi-4.0.3/opal/mca/btl/vader/btl_vader_component.c:683
    #2 0x7f83fa490fdf in mca_btl_vader_component_progress /tmp/openmpi-4.0.3/opal/mca/btl/vader/btl_vader_component.c:773
    #3 0x7f83fe9f5b42 in opal_progress runtime/opal_progress.c:231
    #4 0x7f83fe9fe089 in ompi_sync_wait_mt threads/wait_sync.c:85
    #5 0x7f8400c2f92a in ompi_request_wait_completion ../ompi/request/request.h:405
    #6 0x7f8400c2f9ec in ompi_request_default_wait request/req_wait.c:42
    #7 0x7f8400ca837e in PMPI_Wait /tmp/openmpi-4.0.3/ompi/mpi/c/profile/pwait.c:74
    ...

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /tmp/openmpi-4.0.3/opal/mca/btl/vader/btl_vader_fifo.h:136 in vader_fifo_read
==80567==ABORTING

Second concurrent MPI job
3 out of 6 ranks of this job crashed while calling MPI_Init_thread and requesting MPI_THREAD_FUNNELED thread support. This is the second MPI call in the application (the first one is MPI_Initialized) and hardly any other instructions precede this function call.

AddressSanitizer:DEADLYSIGNAL
=================================================================
==15031==ERROR: AddressSanitizer: SEGV on unknown address 0x62900019dc00 (pc 0x7f95706da0b2 bp 0x7ffd5fb349a0 sp 0x7ffd5fb34980 T0)
==15031==The signal is caused by a READ memory access.
    #0 0x7f95706da0b1 in mca_btl_vader_frag_complete /tmp/openmpi-4.0.3/opal/mca/btl/vader/btl_vader_frag.h:144
    #1 0x7f95706dcaee in mca_btl_vader_poll_handle_frag /tmp/openmpi-4.0.3/opal/mca/btl/vader/btl_vader_component.c:641
    #2 0x7f95706dcd2f in mca_btl_vader_poll_fifo /tmp/openmpi-4.0.3/opal/mca/btl/vader/btl_vader_component.c:688
    #3 0x7f95706dcfdf in mca_btl_vader_component_progress /tmp/openmpi-4.0.3/opal/mca/btl/vader/btl_vader_component.c:773
    #4 0x7fdfe255eb42 in opal_progress runtime/opal_progress.c:231
    #5 0x7ffb98585b26 in ompi_mpi_init runtime/ompi_mpi_init.c:904
    #6 0x7ffb984eaf9d in PMPI_Init_thread /tmp/openmpi-4.0.3/ompi/mpi/c/profile/pinit_thread.c:67
    ...

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /tmp/openmpi-4.0.3/opal/mca/btl/vader/btl_vader_frag.h:144 in mca_btl_vader_frag_complete
==15031==ABORTING


AddressSanitizer:DEADLYSIGNAL
=================================================================
==15032==ERROR: AddressSanitizer: SEGV on unknown address 0x6290000aa300 (pc 0x7fdfddff70b2 bp 0x7ffe0a073000 sp 0x7ffe0a072fe0 T0)
==15032==The signal is caused by a READ memory access.
    #0 0x7fdfddff70b1 in mca_btl_vader_frag_complete /tmp/openmpi-4.0.3/opal/mca/btl/vader/btl_vader_frag.h:144
    #1 0x7fdfddff9aee in mca_btl_vader_poll_handle_frag /tmp/openmpi-4.0.3/opal/mca/btl/vader/btl_vader_component.c:641
    #2 0x7fdfddff9d2f in mca_btl_vader_poll_fifo /tmp/openmpi-4.0.3/opal/mca/btl/vader/btl_vader_component.c:688
    #3 0x7fdfddff9fdf in mca_btl_vader_component_progress /tmp/openmpi-4.0.3/opal/mca/btl/vader/btl_vader_component.c:773
    #4 0x7f9574c41b42 in opal_progress runtime/opal_progress.c:231
    #5 0x7f9576f64b26 in ompi_mpi_init runtime/ompi_mpi_init.c:904
    #6 0x7f9576ec9f9d in PMPI_Init_thread /tmp/openmpi-4.0.3/ompi/mpi/c/profile/pinit_thread.c:67
    ...

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /tmp/openmpi-4.0.3/opal/mca/btl/vader/btl_vader_frag.h:144 in mca_btl_vader_frag_complete
==15032==ABORTING


AddressSanitizer:DEADLYSIGNAL
=================================================================
==15033==ERROR: AddressSanitizer: SEGV on unknown address 0x629000189300 (pc 0x7ffb91cfb0b2 bp 0x7fff28717ad0 sp 0x7fff28717ab0 T0)
==15033==The signal is caused by a READ memory access.
    #0 0x7ffb91cfb0b1 in mca_btl_vader_frag_complete /tmp/openmpi-4.0.3/opal/mca/btl/vader/btl_vader_frag.h:144
    #1 0x7ffb91cfdaee in mca_btl_vader_poll_handle_frag /tmp/openmpi-4.0.3/opal/mca/btl/vader/btl_vader_component.c:641
    #2 0x7ffb91cfdd2f in mca_btl_vader_poll_fifo /tmp/openmpi-4.0.3/opal/mca/btl/vader/btl_vader_component.c:688
    #3 0x7ffb91cfdfdf in mca_btl_vader_component_progress /tmp/openmpi-4.0.3/opal/mca/btl/vader/btl_vader_component.c:773
    #4 0x7ffb96262b42 in opal_progress runtime/opal_progress.c:231
    #5 0x7fdfe4881b26 in ompi_mpi_init runtime/ompi_mpi_init.c:904
    #6 0x7fdfe47e6f9d in PMPI_Init_thread /tmp/openmpi-4.0.3/ompi/mpi/c/profile/pinit_thread.c:67
    ...

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /tmp/openmpi-4.0.3/opal/mca/btl/vader/btl_vader_frag.h:144 in mca_btl_vader_frag_complete
==15033==ABORTING

@jeorsch
Copy link
Author

jeorsch commented Jun 18, 2020

Still an issue with 4.0.4! Disabling vader and using tcp BTL fixes it, but this is not a nice solution.

@jeorsch
Copy link
Author

jeorsch commented Jan 18, 2021

I am not sure, if it is really related, but I observed occasional errors in a minimal example when using vader BTL.

Here is how to reproduce:

#include <mpi.h>

int main(int argc, char *argv[]) {

    MPI_Init(&argc, &argv);
    MPI_Finalize();
}

Compile and run many instances in a loop:
for i in $(seq $(nproc)); do (while true; do mpirun -oversubscribe -np $(nproc) ./a.out; done)& done
Note that $(nproc) evaluates to 40 on my machine.

This results in occasional errors:

--------------------------------------------------------------------------
mpirun has exited due to process rank 6 with PID 0 on
node at-num-01f exiting improperly. There are three reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

You can avoid this message by specifying -quiet on the mpirun command line.
--------------------------------------------------------------------------

When using only TCP and self BTL I observed no such errors so far:
for i in $(seq $(nproc)); do (while true; do mpirun -mca btl tcp,self -oversubscribe -np $(nproc) ./a.out; done)& done

@jeorsch
Copy link
Author

jeorsch commented Jan 19, 2021

It seems like the error triggered by the minimal reproducer is fixed in 4.1.0. However, it is still an issue for 4.0.5.

@jeorsch
Copy link
Author

jeorsch commented Jan 19, 2021

4.0.6rc1 also seems to fix the error for the minimal reproducer. Possibly due to the PMIx update?

I will try to install 4.1.0 on out CI build system to see, if the errors in production are also fixed. Since they used to happen not too frequently, I have to wait some weeks to be able to assess if the issue is solved.

@jsquyres
Copy link
Member

@jeorsch Is this issue resolved for you?

@jeorsch
Copy link
Author

jeorsch commented Mar 15, 2021

@jsquyres, until now, it is definitely much more robust. I just noticed one strange hang in the last two month, but I it seems to be related to something else. If I have any new findings, I will reopen this issue or write a new one.

@jeorsch jeorsch closed this as completed Mar 15, 2021
@jsquyres
Copy link
Member

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants