Skip to content

TCP BTL progress thread RDMA assertion fail #3718

@jsquyres

Description

@jsquyres

Got an interesting TCP BTL fail in Cisco's MTT today, when running with the TCP progress thread on master with one-sided tests:

$ mpirun   -np 32 --mca orte_startup_timeout 10000 --mca oob tcp --mca btl vader,tcp,self --mca btl_tcp_progress_thread 1 test_put6 
================ test_put6 ========== Mon Jun 19 12:39:17 2017
test_put6: pml_ob1_recvreq.c:197: mca_pml_ob1_put_completion: Assertion `(uint64_t) rdma_size == frag->rdma_length' failed.
[mpi017:20916] *** Process received signal ***
[mpi017:20916] Signal: Aborted (6)
[mpi017:20916] Signal code:  (-6)[mpi017:20916] [ 0] /lib64/libpthread.so.0[0x3affc0f710]
[mpi017:20916] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3aff832925]
[mpi017:20916] [ 2] /lib64/libc.so.6(abort+0x175)[0x3aff834105]
[mpi017:20916] [ 3] /lib64/libc.so.6[0x3aff82ba4e]
[mpi017:20916] [ 4] /lib64/libc.so.6(__assert_perror_fail+0x0)[0x3aff82bb10]
[mpi017:20916] [ 5] /home/mpiteam/scratches/community/2017-06-19manual/r8QY/installs/Xgug/install/lib/libmpi.so.0(+0x238895)[0x2aaaaace5895]
[mpi017:20916] [ 6] /home/mpiteam/scratches/community/2017-06-19manual/r8QY/installs/Xgug/install/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_fin+0x74)[0x2aaaaace1e1e]
[mpi017:20916] [ 7] /home/mpiteam/scratches/community/2017-06-19manual/r8QY/installs/Xgug/install/lib/libopen-pal.so.0(+0xeeca8)[0x2aaaab507ca8]
[mpi017:20916] [ 8] /home/mpiteam/scratches/community/2017-06-19manual/r8QY/installs/Xgug/install/lib/libopen-pal.so.0(+0x1017ab)[0x2aaaab51a7ab]
[mpi017:20916] [ 9] /home/mpiteam/scratches/community/2017-06-19manual/r8QY/installs/Xgug/install/lib/libopen-pal.so.0(+0x1018ba)[0x2aaaab51a8ba]
[mpi017:20916] [10] /home/mpiteam/scratches/community/2017-06-19manual/r8QY/installs/Xgug/install/lib/libopen-pal.so.0(+0x101b87)[0x2aaaab51ab87]
[mpi017:20916] [11] /home/mpiteam/scratches/community/2017-06-19manual/r8QY/installs/Xgug/install/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x298)[0x2aaaab51b1da]
[mpi017:20916] [12] /home/mpiteam/scratches/community/2017-06-19manual/r8QY/installs/Xgug/install/lib/libopen-pal.so.0(+0xeaa29)[0x2aaaab503a29]
[mpi017:20916] [13] /lib64/libpthread.so.0[0x3affc079d1]
[mpi017:20916] [14] /lib64/libc.so.6(clone+0x6d)[0x3aff8e8b6d]
[mpi017:20916] *** End of error message ***
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected.  This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate.  For
example, there may be a core file that you can examine.  More
generally: such peer hangups are frequently caused by application bugs
or other external events.

  Local host: mpi030
  Local PID:  7311
  Peer host:  mpi017
-----------------------------------------------------------------------------

@bosilca Can you have a look?

@hjelmn ...or is this an OSC issue?

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions