Skip to content

BTL TCP abort in MPI_FINALIZE with progress thread #2262

@jsquyres

Description

@jsquyres

On Cisco master MTT runs when running with the BTL TCP progress thread, I'm seeing aborts during MPI_FINALIZE when the TCP BTL progress thread is enabled.

https://mtt.open-mpi.org/index.php?do_redir=2371

The failure signatures are similar to the one shown below; it looks like it's trying to destroy something in btl_tcp_del_procs that doesn't have a refcount of 0:

 Warning :: opal_list_remove_item - the item 0x7c22d0 is not on the list 0x7371d8 
ialltoall: class/opal_list.c:69: opal_list_item_destruct: Assertion `0 == item->opal_list_item_refcount' failed.[mpi001:27496] *** Process received signal ***
[mpi001:27496] Signal: Aborted (6)
[mpi001:27496] Signal code:  (-6)[mpi001:27496] [ 0] /lib64/libpthread.so.0[0x31ec20f710]
[mpi001:27496] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x31ebe32925]
[mpi001:27496] [ 2] /lib64/libc.so.6(abort+0x175)[0x31ebe34105]
[mpi001:27496] [ 3] /lib64/libc.so.6[0x31ebe2ba4e]
[mpi001:27496] [ 4] /lib64/libc.so.6(__assert_perror_fail+0x0)[0x31ebe2bb10]
[mpi001:27496] [ 5] /home/mpiteam/scratches/community/2016-10-19cron/VmSh/installs/3bSr/install/lib/libopen-pal.so.0(+0x4a3e9)[0x2aaaab4fb3e9]
[mpi001:27496] [ 6] /home/mpiteam/scratches/community/2016-10-19cron/VmSh/installs/3bSr/install/lib/libopen-pal.so.0(+0xed79f)[0x2aaaab59e79f]
[mpi001:27496] [ 7] /home/mpiteam/scratches/community/2016-10-19cron/VmSh/installs/3bSr/install/lib/libopen-pal.so.0(mca_btl_tcp_del_procs+0x118)[0x2aaaab59f60b]
[mpi001:27496] [ 8] /home/mpiteam/scratches/community/2016-10-19cron/VmSh/installs/3bSr/install/lib/libmpi.so.0(+0x1460a7)[0x2aaaaabf30a7]
[mpi001:27496] [ 9] /home/mpiteam/scratches/community/2016-10-19cron/VmSh/installs/3bSr/install/lib/libmpi.so.0(mca_pml_ob1_del_procs+0x2b)[0x2aaaaad5a8df]
[mpi001:27496] [10] /home/mpiteam/scratches/community/2016-10-19cron/VmSh/installs/3bSr/install/lib/libmpi.so.0(ompi_mpi_finalize+0x724)[0x2aaaaab344a3]
[mpi001:27496] [11] /home/mpiteam/scratches/community/2016-10-19cron/VmSh/installs/3bSr/install/lib/libmpi.so.0(PMPI_Finalize+0x59)[0x2aaaaab63865]
[mpi001:27496] [12] collective/ialltoall[0x4011bf]
[mpi001:27496] [13] /lib64/libc.so.6(__libc_start_main+0xfd)[0x31ebe1ed1d]
[mpi001:27496] [14] collective/ialltoall[0x400e69]
[mpi001:27496] *** End of error message ***
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
---------------------------------------------------------------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node mpi001 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

@bosilca Could you have a look?

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions