-
Notifications
You must be signed in to change notification settings - Fork 932
Closed
Labels
Description
Got an interesting TCP BTL fail in Cisco's MTT today, when running with the TCP progress thread on master with one-sided tests:
$ mpirun -np 32 --mca orte_startup_timeout 10000 --mca oob tcp --mca btl vader,tcp,self --mca btl_tcp_progress_thread 1 test_put6
================ test_put6 ========== Mon Jun 19 12:39:17 2017
test_put6: pml_ob1_recvreq.c:197: mca_pml_ob1_put_completion: Assertion `(uint64_t) rdma_size == frag->rdma_length' failed.
[mpi017:20916] *** Process received signal ***
[mpi017:20916] Signal: Aborted (6)
[mpi017:20916] Signal code: (-6)[mpi017:20916] [ 0] /lib64/libpthread.so.0[0x3affc0f710]
[mpi017:20916] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3aff832925]
[mpi017:20916] [ 2] /lib64/libc.so.6(abort+0x175)[0x3aff834105]
[mpi017:20916] [ 3] /lib64/libc.so.6[0x3aff82ba4e]
[mpi017:20916] [ 4] /lib64/libc.so.6(__assert_perror_fail+0x0)[0x3aff82bb10]
[mpi017:20916] [ 5] /home/mpiteam/scratches/community/2017-06-19manual/r8QY/installs/Xgug/install/lib/libmpi.so.0(+0x238895)[0x2aaaaace5895]
[mpi017:20916] [ 6] /home/mpiteam/scratches/community/2017-06-19manual/r8QY/installs/Xgug/install/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_fin+0x74)[0x2aaaaace1e1e]
[mpi017:20916] [ 7] /home/mpiteam/scratches/community/2017-06-19manual/r8QY/installs/Xgug/install/lib/libopen-pal.so.0(+0xeeca8)[0x2aaaab507ca8]
[mpi017:20916] [ 8] /home/mpiteam/scratches/community/2017-06-19manual/r8QY/installs/Xgug/install/lib/libopen-pal.so.0(+0x1017ab)[0x2aaaab51a7ab]
[mpi017:20916] [ 9] /home/mpiteam/scratches/community/2017-06-19manual/r8QY/installs/Xgug/install/lib/libopen-pal.so.0(+0x1018ba)[0x2aaaab51a8ba]
[mpi017:20916] [10] /home/mpiteam/scratches/community/2017-06-19manual/r8QY/installs/Xgug/install/lib/libopen-pal.so.0(+0x101b87)[0x2aaaab51ab87]
[mpi017:20916] [11] /home/mpiteam/scratches/community/2017-06-19manual/r8QY/installs/Xgug/install/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x298)[0x2aaaab51b1da]
[mpi017:20916] [12] /home/mpiteam/scratches/community/2017-06-19manual/r8QY/installs/Xgug/install/lib/libopen-pal.so.0(+0xeaa29)[0x2aaaab503a29]
[mpi017:20916] [13] /lib64/libpthread.so.0[0x3affc079d1]
[mpi017:20916] [14] /lib64/libc.so.6(clone+0x6d)[0x3aff8e8b6d]
[mpi017:20916] *** End of error message ***
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: mpi030
Local PID: 7311
Peer host: mpi017
-----------------------------------------------------------------------------
@bosilca Can you have a look?
@hjelmn ...or is this an OSC issue?