-
Notifications
You must be signed in to change notification settings - Fork 931
Description
When I run the ibm/onesided/c_accumulate test over openib I get a SEGV.
[rvandevaart@drossetti-ivy4 onesided]$ mpirun --host drossetti-ivy4,drossetti-ivy5 -np 2 -mca btl self,openib c_accumulate
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node drossetti-ivy4 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
[rvandevaart@drossetti-ivy4 onesided]$
(gdb) where
#0 0x00007fbdf4bc41d0 in ?? ()
#1 <signal handler called>
#2 0x00007fbdf76fda88 in mca_btl_openib_atomic_internal (btl=0x24aa140, endpoint=0x0, local_address=0x7fbdfd786010, remote_address=42016232, local_handle=0x27ca2a0, remote_handle=0x27ca5a0,
opcode=IBV_WR_ATOMIC_CMP_AND_SWP, operand=0, operand2=0, flags=0, order=0, cbfunc=0x7fbdf41a7d94 <ompi_osc_rdma_atomic_complete>, cbcontext=0x7fffa8010ca7, cbdata=0x0)
at ../../../../../opal/mca/btl/openib/btl_openib_atomic.c:85
#3 0x00007fbdf76fdca9 in mca_btl_openib_atomic_cswap (btl=0x24aa140, endpoint=0x0, local_address=0x7fbdfd786010, remote_address=42016232, local_handle=0x27ca2a0, remote_handle=0x27ca5a0, compare=0,
value=9223372036854775808, flags=0, order=0, cbfunc=0x7fbdf41a7d94 <ompi_osc_rdma_atomic_complete>, cbcontext=0x7fffa8010ca7, cbdata=0x0) at ../../../../../opal/mca/btl/openib/btl_openib_atomic.c:138
#4 0x00007fbdf419f5f6 in ompi_osc_rdma_lock_try_acquire_exclusive (module=0x27e3f40, peer=0x280bd80, offset=16) at ../../../../../ompi/mca/osc/rdma/osc_rdma_lock.h:201
#5 0x00007fbdf419f6e1 in ompi_osc_rdma_lock_acquire_exclusive (module=0x27e3f40, peer=0x280bd80, offset=16) at ../../../../../ompi/mca/osc/rdma/osc_rdma_lock.h:244
#6 0x00007fbdf419fc17 in ompi_osc_rdma_gacc_local (source_buffer=0x7fffa8010fe4, source_count=1, source_datatype=0x602fa0, result_buffer=0x0, result_count=0, result_datatype=0x0, peer=0x280bd80,
target_address=140736012029920, target_handle=0x0, target_count=1, target_datatype=0x602fa0, op=0x6037a0, module=0x27e3f40, request=0x0) at ../../../../../ompi/mca/osc/rdma/osc_rdma_accumulate.c:29
#7 0x00007fbdf41a1d2e in ompi_osc_rdma_rget_accumulate_internal (sync=0x27e4130, origin_addr=0x7fffa8010fe4, origin_count=1, origin_datatype=0x602fa0, result_addr=0x0, result_count=0, result_datatype=0x0, peer=0x280bd80,
target_rank=0, target_disp=0, target_count=1, target_datatype=0x602fa0, op=0x6037a0, request=0x0) at ../../../../../ompi/mca/osc/rdma/osc_rdma_accumulate.c:730
#8 0x00007fbdf41a2708 in ompi_osc_rdma_accumulate (origin_addr=0x7fffa8010fe4, origin_count=1, origin_datatype=0x602fa0, target_rank=0, target_disp=0, target_count=1, target_datatype=0x602fa0, op=0x6037a0, win=0x270ccb0)
at ../../../../../ompi/mca/osc/rdma/osc_rdma_accumulate.c:888
#9 0x00007fbdfd305d95 in PMPI_Accumulate (origin_addr=0x7fffa8010fe4, origin_count=1, origin_datatype=0x602fa0, target_rank=0, target_disp=0, target_count=1, target_datatype=0x602fa0, op=0x6037a0, win=0x270ccb0)
at paccumulate.c:131
#10 0x00000000004010ba in main (argc=1, argv=0x7fffa80110e8) at c_accumulate.c:39
(gdb) up
#1 <signal handler called>
(gdb) up
#2 0x00007fbdf76fda88 in mca_btl_openib_atomic_internal (btl=0x24aa140, endpoint=0x0, local_address=0x7fbdfd786010, remote_address=42016232, local_handle=0x27ca2a0, remote_handle=0x27ca5a0,
opcode=IBV_WR_ATOMIC_CMP_AND_SWP, operand=0, operand2=0, flags=0, order=0, cbfunc=0x7fbdf41a7d94 <ompi_osc_rdma_atomic_complete>, cbcontext=0x7fffa8010ca7, cbdata=0x0)
at ../../../../../opal/mca/btl/openib/btl_openib_atomic.c:85
85 if (endpoint->endpoint_state != MCA_BTL_IB_CONNECTED) {
(gdb) print endpoint
$1 = (struct mca_btl_base_endpoint_t *) 0x0
(gdb)
I see similar failures with:
c_fence_lock
c_fence_put_1
c_accumulate_atomic
c_fetch_and_op
c_flush
c_get
c_get_accumulate
and some others. Results are in MTT. I do not see these failures in v1.10 or v2.x.