Skip to content

ibm/c_accumulate (and friends) all fail with master #1010

@rolfv

Description

@rolfv

When I run the ibm/onesided/c_accumulate test over openib I get a SEGV.

[rvandevaart@drossetti-ivy4 onesided]$ mpirun --host drossetti-ivy4,drossetti-ivy5 -np 2 -mca btl self,openib c_accumulate
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node drossetti-ivy4 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
[rvandevaart@drossetti-ivy4 onesided]$ 
(gdb) where
#0  0x00007fbdf4bc41d0 in ?? ()
#1  <signal handler called>
#2  0x00007fbdf76fda88 in mca_btl_openib_atomic_internal (btl=0x24aa140, endpoint=0x0, local_address=0x7fbdfd786010, remote_address=42016232, local_handle=0x27ca2a0, remote_handle=0x27ca5a0, 
    opcode=IBV_WR_ATOMIC_CMP_AND_SWP, operand=0, operand2=0, flags=0, order=0, cbfunc=0x7fbdf41a7d94 <ompi_osc_rdma_atomic_complete>, cbcontext=0x7fffa8010ca7, cbdata=0x0)
    at ../../../../../opal/mca/btl/openib/btl_openib_atomic.c:85
#3  0x00007fbdf76fdca9 in mca_btl_openib_atomic_cswap (btl=0x24aa140, endpoint=0x0, local_address=0x7fbdfd786010, remote_address=42016232, local_handle=0x27ca2a0, remote_handle=0x27ca5a0, compare=0, 
    value=9223372036854775808, flags=0, order=0, cbfunc=0x7fbdf41a7d94 <ompi_osc_rdma_atomic_complete>, cbcontext=0x7fffa8010ca7, cbdata=0x0) at ../../../../../opal/mca/btl/openib/btl_openib_atomic.c:138
#4  0x00007fbdf419f5f6 in ompi_osc_rdma_lock_try_acquire_exclusive (module=0x27e3f40, peer=0x280bd80, offset=16) at ../../../../../ompi/mca/osc/rdma/osc_rdma_lock.h:201
#5  0x00007fbdf419f6e1 in ompi_osc_rdma_lock_acquire_exclusive (module=0x27e3f40, peer=0x280bd80, offset=16) at ../../../../../ompi/mca/osc/rdma/osc_rdma_lock.h:244
#6  0x00007fbdf419fc17 in ompi_osc_rdma_gacc_local (source_buffer=0x7fffa8010fe4, source_count=1, source_datatype=0x602fa0, result_buffer=0x0, result_count=0, result_datatype=0x0, peer=0x280bd80, 
    target_address=140736012029920, target_handle=0x0, target_count=1, target_datatype=0x602fa0, op=0x6037a0, module=0x27e3f40, request=0x0) at ../../../../../ompi/mca/osc/rdma/osc_rdma_accumulate.c:29
#7  0x00007fbdf41a1d2e in ompi_osc_rdma_rget_accumulate_internal (sync=0x27e4130, origin_addr=0x7fffa8010fe4, origin_count=1, origin_datatype=0x602fa0, result_addr=0x0, result_count=0, result_datatype=0x0, peer=0x280bd80, 
    target_rank=0, target_disp=0, target_count=1, target_datatype=0x602fa0, op=0x6037a0, request=0x0) at ../../../../../ompi/mca/osc/rdma/osc_rdma_accumulate.c:730
#8  0x00007fbdf41a2708 in ompi_osc_rdma_accumulate (origin_addr=0x7fffa8010fe4, origin_count=1, origin_datatype=0x602fa0, target_rank=0, target_disp=0, target_count=1, target_datatype=0x602fa0, op=0x6037a0, win=0x270ccb0)
    at ../../../../../ompi/mca/osc/rdma/osc_rdma_accumulate.c:888
#9  0x00007fbdfd305d95 in PMPI_Accumulate (origin_addr=0x7fffa8010fe4, origin_count=1, origin_datatype=0x602fa0, target_rank=0, target_disp=0, target_count=1, target_datatype=0x602fa0, op=0x6037a0, win=0x270ccb0)
    at paccumulate.c:131
#10 0x00000000004010ba in main (argc=1, argv=0x7fffa80110e8) at c_accumulate.c:39
(gdb) up
#1  <signal handler called>
(gdb) up
#2  0x00007fbdf76fda88 in mca_btl_openib_atomic_internal (btl=0x24aa140, endpoint=0x0, local_address=0x7fbdfd786010, remote_address=42016232, local_handle=0x27ca2a0, remote_handle=0x27ca5a0, 
    opcode=IBV_WR_ATOMIC_CMP_AND_SWP, operand=0, operand2=0, flags=0, order=0, cbfunc=0x7fbdf41a7d94 <ompi_osc_rdma_atomic_complete>, cbcontext=0x7fffa8010ca7, cbdata=0x0)
    at ../../../../../opal/mca/btl/openib/btl_openib_atomic.c:85
85      if (endpoint->endpoint_state != MCA_BTL_IB_CONNECTED) {
(gdb) print endpoint
$1 = (struct mca_btl_base_endpoint_t *) 0x0
(gdb) 

I see similar failures with:
c_fence_lock
c_fence_put_1
c_accumulate_atomic
c_fetch_and_op
c_flush
c_get
c_get_accumulate

and some others. Results are in MTT. I do not see these failures in v1.10 or v2.x.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions