Skip to content

OSC rdma: deadlock if memory registration fails #6740

@devreal

Description

@devreal

We ran into a situation in which Open MPI got stuck in MPI_Win_allocate while allocating a series of windows on a ConnectX-5 cluster, like in the following example:

for (i = 0; i < n; ++i) {
  MPI_Win_allocate(1<<30, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &bases[i], &wins[i]);
}

On a 64GB with two ranks, we expect all allocations to succeed up to n=30 (60GB in total).

Running with two ranks, the processes hang with the stack trace of process 0 looking like:

#0  mca_btl_vader_component_progress ()
    at btl_vader_component.c:701
#1  0x00002aaaab3ff3ba in opal_progress ()
    at opal_progress.c:228
#2  0x00002aaaaab301b7 in ompi_request_wait_completion (req=0x7be688)
    at request.h:413
#3  0x00002aaaaab301f5 in ompi_request_default_wait (req_ptr=0x7fffffffb890, 
    status=0x7fffffffb870)
    at req_wait.c:42
#4  0x00002aaaaabd6f60 in ompi_coll_base_sendrecv_zero (dest=1, stag=-16, 
    source=1, rtag=-16, comm=0x9b7cd0)
    at coll_base_barrier.c:64
#5  0x00002aaaaabd7620 in ompi_coll_base_barrier_intra_two_procs (
    comm=0x9b7cd0, module=0x9b9410)
    at coll_base_barrier.c:300
#6  0x00002aaaaf7fb11f in ompi_coll_tuned_barrier_intra_dec_fixed (
    comm=0x9b7cd0, module=0x9b9410)
    at coll_tuned_decision_fixed.c:196
#7  0x00002aaaaf8718b4 in ompi_osc_rdma_free (win=0x9b5230)
    at osc_rdma_module.c:61
#8  0x00002aaaaf88272a in ompi_osc_rdma_component_select (win=0x9b5230, 
    base=0x7fffffffbac0, size=1073741824, disp_unit=1, 
    comm=0x404140 <ompi_mpi_comm_world>, info=0x404340 <ompi_mpi_info_null>, 
    flavor=2, model=0x7fffffffbacc)
    at osc_rdma_component.c:1255
#9  0x00002aaaaabf067a in ompi_osc_base_select (win=0x9b5230, 
    base=0x7fffffffbac0, size=1073741824, disp_unit=1, 
    comm=0x404140 <ompi_mpi_comm_world>, info=0x404340 <ompi_mpi_info_null>, 
    flavor=2, model=0x7fffffffbacc)
    at osc_base_init.c:74
#10 0x00002aaaaab37d4f in ompi_win_allocate (size=1073741824, disp_unit=1, 
    info=0x404340 <ompi_mpi_info_null>, comm=0x404140 <ompi_mpi_comm_world>, 
    baseptr=0x7fffffffbc38, newwin=0x7fffffffbd28)
    at win.c:277
#11 0x00002aaaaabae09a in PMPI_Win_allocate (size=1073741824, disp_unit=1, 
    info=0x404340 <ompi_mpi_info_null>, comm=0x404140 <ompi_mpi_comm_world>, 
    baseptr=0x7fffffffbc38, win=0x7fffffffbd28) at pwin_allocate.c:81
#12 0x0000000000401912 in main ()

On rank 1 it is:

#0  0x00002aaaaebd0c41 in uct_mm_iface_progress () from /usr/lib64/libuct.so.0
#1  0x00002aaaae966ff2 in ucp_worker_progress () from /usr/lib64/libucp.so.0
#2  0x00002aaaad9d6e01 in mca_pml_ucx_progress ()
    at pml_ucx.c:466
#3  0x00002aaaab3ff3ba in opal_progress ()
    at opal_progress.c:228
#4  0x00002aaaaab301b7 in ompi_request_wait_completion (req=0x7be588)
    at request.h:413
#5  0x00002aaaaab301f5 in ompi_request_default_wait (req_ptr=0x7fffffffb840, 
    status=0x7fffffffb820)
    at req_wait.c:42
#6  0x00002aaaaabd6f60 in ompi_coll_base_sendrecv_zero (dest=0, stag=-16, 
    source=0, rtag=-16, comm=0x9b7350)
    at coll_base_barrier.c:64
#7  0x00002aaaaabd7620 in ompi_coll_base_barrier_intra_two_procs (
    comm=0x9b7350, module=0x9b8380)
    at coll_base_barrier.c:300
#8  0x00002aaaaf7fb11f in ompi_coll_tuned_barrier_intra_dec_fixed (
    comm=0x9b7350, module=0x9b8380)
    at coll_tuned_decision_fixed.c:196
#9  0x00002aaaaf880aed in allocate_state_shared (module=0x9b4cb0, 
    base=0x7fffffffbad0, size=1073741824)
    at osc_rdma_component.c:685
#10 0x00002aaaaf8826ea in ompi_osc_rdma_component_select (win=0x9b3300, 
    base=0x7fffffffbad0, size=1073741824, disp_unit=1, 
    comm=0x404140 <ompi_mpi_comm_world>, info=0x404340 <ompi_mpi_info_null>, 
    flavor=2, model=0x7fffffffbadc)
    at osc_rdma_component.c:1252
#11 0x00002aaaaabf067a in ompi_osc_base_select (win=0x9b3300, 
    base=0x7fffffffbad0, size=1073741824, disp_unit=1, 
    comm=0x404140 <ompi_mpi_comm_world>, info=0x404340 <ompi_mpi_info_null>, 
    flavor=2, model=0x7fffffffbadc)
    at osc_base_init.c:74
#12 0x00002aaaaab37d4f in ompi_win_allocate (size=1073741824, disp_unit=1, 
    info=0x404340 <ompi_mpi_info_null>, comm=0x404140 <ompi_mpi_comm_world>, 
    baseptr=0x7fffffffbc48, newwin=0x7fffffffbd38)
    at win.c:277
#13 0x00002aaaaabae09a in PMPI_Win_allocate (size=1073741824, disp_unit=1, 
    info=0x404340 <ompi_mpi_info_null>, comm=0x404140 <ompi_mpi_comm_world>, 
    baseptr=0x7fffffffbc48, win=0x7fffffffbd38) at pwin_allocate.c:81
---Type <return> to continue, or q <return> to quit---
#14 0x0000000000401912 in main ()

Notice that rank 0 called into a barrier within ompi_osc_rdma_free while rank 1 is waiting in a barrier in allocate_state_shared, but on a different communicator (the shared memory communicator).

Debug output hints at a failed registration of memory with the IB device which caused rank 0 to exit allocate_state_shared:

selected btl: openib
creating osc/rdma window of flavor 2 with id 32
selected btl: openib
allocating shared internal state
registering segment with btl. range: 0x2ab83be4a008 - 0x2ab8bbe4a308 (2147484416 bytes)
failed to register pointer with selected BTL. base: 0x2ab83be4a008, size: 2147484416. file: osc_rdma_component.c, line: 666
failed to allocate internal state
rdma component destroying window with id 32

Accumulating windows of 1 or 2 GB per rank, the problem occurs at around 58GB accumulated window memory on the node (the node has 64GB so there should be some headspace). I'm not sure why this happens in the first place.

The problem, however, is that the error checking in allocate_state_shared is rank-local, errors are not propagated to other processes before returning, which leads to the situation that some processes are stuck in a barrier while others have aborted the window allocation already.

This was discovered using Open MPI 3.1.2 but the relevant code is the same on master.

I'm working on a patch that I will post soon.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions