-
Notifications
You must be signed in to change notification settings - Fork 929
Closed
Description
Trying to run the 4.0.x branch on our IB cluster, I end up with a Segfault in Open MPI if a) Open MPI was compiled without UCX, and b) the openib btl was not explicitly allowed using btl_openib_allow_ib. Even though the message provided states that UCX should be used it seems strange that the openib btl is nevertheless selected and Open MPI dies in a Segfault.
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: n110502
Local adapter: mlx4_0
Local port: 1
--------------------------------------------------------------------------
Program received signal SIGSEGV, Segmentation fault.
mca_btl_openib_register_mem (btl=0x739f70, endpoint=0xffffffffffffffff,
base=0x7875b0, size=392, flags=15)
at opal/mca/btl/openib/btl_openib.c:1953
1953 rc = openib_module->device->rcache->rcache_register (openib_module->device->rcache, base, size, mflags,
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.5.x86_64 libibverbs-17.2-3.el7.x86_64 libnl3-3.2.28-4.el7.x86_64 librdmacm-17.2-3.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0 mca_btl_openib_register_mem (btl=0x739f70, endpoint=0xffffffffffffffff,
base=0x7875b0, size=392, flags=15)
at opal/mca/btl/openib/btl_openib.c:1953
#1 0x00002aaabe51925f in _ompi_osc_rdma_register (module=0x77fbb0,
endpoint=0xffffffffffffffff, ptr=0x7875b0, size=392, flags=15,
handle=0x77fd78, line=463,
file=0x2aaabe530650 "mca/osc/rdma/osc_rdma_component.c")
at ompi/mca/osc/rdma/osc_rdma.h:359
#2 0x00002aaabe51a89e in allocate_state_single (module=0x77fbb0,
base=0x7fffffffcba0, size=8)
at ompi/mca/osc/rdma/osc_rdma_component.c:462
#3 0x00002aaabe51ad02 in allocate_state_shared (module=0x77fbb0,
base=0x7fffffffcba0, size=8)
at ompi/mca/osc/rdma/osc_rdma_component.c:564
#4 0x00002aaabe51cf60 in ompi_osc_rdma_component_select (win=0x77f640,
base=0x7fffffffcba0, size=8, disp_unit=1,
comm=0x6023a0 <ompi_mpi_comm_world>, info=0x6020a0 <ompi_mpi_info_null>,
flavor=2, model=0x7fffffffcbac)
at ompi/mca/osc/rdma/osc_rdma_component.c:1248
#5 0x00002aaaaadfe17c in ompi_osc_base_select (win=0x77f640,
base=0x7fffffffcba0, size=8, disp_unit=1,
comm=0x6023a0 <ompi_mpi_comm_world>, info=0x6020a0 <ompi_mpi_info_null>,
flavor=2, model=0x7fffffffcbac)
at ompi/mca/osc/base/osc_base_init.c:74
#6 0x00002aaaaad37df3 in ompi_win_allocate (size=8, disp_unit=1,
info=0x6020a0 <ompi_mpi_info_null>, comm=0x6023a0 <ompi_mpi_comm_world>,
baseptr=0x7fffffffcc70, newwin=0x7fffffffcc78)
at ompi/win/win.c:277
#7 0x00002aaaaadac600 in PMPI_Win_allocate (size=8, disp_unit=1,
info=0x6020a0 <ompi_mpi_info_null>, comm=0x6023a0 <ompi_mpi_comm_world>,
baseptr=0x7fffffffcc70, win=0x7fffffffcc78) at pwin_allocate.c:81
#8 0x0000000000400c3d in main ()
(gdb) print openib_module
$1 = (mca_btl_openib_module_t *) 0x739f70
(gdb) print openib_module->device
$2 = (mca_btl_openib_device_t *) 0x0
It seems that the device not initialized. Just peeking at the code in btl_openib_component.c it looks like the decision to not allocated a device in init_one_port is not communicated upwards init_one_device. Shouldn't that decision be made early on (e.g., in btl_openib_component_init) and the ib btl never be selected if it's not allowed?