Skip to content

opal/asm changes causes MT applications hang. #4784

@thananon

Description

@thananon

I did a rebase to master head from a version from October and I found that some of the multithreaded applications I use get deadlock when the program is initialized with THREAD_MULTIPLE with more than 1 thread doing communication. (1 is fine)

I did a bisection and it seems to start from the commit 84f63d0 from @hjelmn . I dug a little bit deeper and found that it might be the problem from opal_free_lifo_pop_atomic(). This is the stack from a typical MT ping-ping injection rate. It looks like the item_free field is always 1 although it is the only thread trying to do the pop.

#0  opal_lifo_pop_atomic (lifo=0x7ffff7dda780 <mca_pml_base_recv_requests>) at ../../../../opal/class/opal_lifo.h:247
247             if (opal_atomic_swap_32((volatile int32_t *) &item->item_free, 1)) {
(gdb) bt
#0  opal_lifo_pop_atomic (lifo=0x7ffff7dda780 <mca_pml_base_recv_requests>) at ../../../../opal/class/opal_lifo.h:247
#1  0x00007fffe2f37002 in opal_free_list_get_mt (flist=0x7ffff7dda780 <mca_pml_base_recv_requests>) at ../../../../opal/class/opal_free_list.h:193
#2  0x00007fffe2f370ec in opal_free_list_get (flist=0x7ffff7dda780 <mca_pml_base_recv_requests>) at ../../../../opal/class/opal_free_list.h:222
#3  0x00007fffe2f38537 in mca_pml_ob1_recv (addr=0x7fffd00008c0, count=1, datatype=0x603360 <ompi_mpi_byte>, src=1, tag=2, comm=0x603160 <ompi_mpi_comm_world>, status=0x7fffdfa04dc0) at pml_ob1_irecv.c:121
#4  0x00007ffff7ae5e84 in PMPI_Recv (buf=0x7fffd00008c0, count=1, type=0x603360 <ompi_mpi_byte>, source=1, tag=2, comm=0x603160 <ompi_mpi_comm_world>, status=0x7fffdfa04dc0) at precv.c:79
#5  0x000000000040150a in thread_work (info=0x902fd0) at pairwise.c:193
#6  0x00007ffff7815e25 in start_thread () from /usr/lib64/libpthread.so.0
#7  0x00007ffff754334d in clone () from /usr/lib64/libc.so.6

Another application that has the same problem is GRID. The threaded-stencil benchmark deadlocks but thread-test from ompi-tests seems to run without problems.

OS: Red Hat Scientific Linux release 7.3

Any suggestion?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions