Skip to content

4.0.x: OpenIB btl segfault if not allowed #6785

@devreal

Description

@devreal

Trying to run the 4.0.x branch on our IB cluster, I end up with a Segfault in Open MPI if a) Open MPI was compiled without UCX, and b) the openib btl was not explicitly allowed using btl_openib_allow_ib. Even though the message provided states that UCX should be used it seems strange that the openib btl is nevertheless selected and Open MPI dies in a Segfault.

--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

  Local host:              n110502
  Local adapter:           mlx4_0
  Local port:              1

--------------------------------------------------------------------------

Program received signal SIGSEGV, Segmentation fault.
mca_btl_openib_register_mem (btl=0x739f70, endpoint=0xffffffffffffffff, 
    base=0x7875b0, size=392, flags=15)
    at opal/mca/btl/openib/btl_openib.c:1953
1953	    rc = openib_module->device->rcache->rcache_register (openib_module->device->rcache, base, size, mflags,
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.5.x86_64 libibverbs-17.2-3.el7.x86_64 libnl3-3.2.28-4.el7.x86_64 librdmacm-17.2-3.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  mca_btl_openib_register_mem (btl=0x739f70, endpoint=0xffffffffffffffff, 
    base=0x7875b0, size=392, flags=15)
    at opal/mca/btl/openib/btl_openib.c:1953
#1  0x00002aaabe51925f in _ompi_osc_rdma_register (module=0x77fbb0, 
    endpoint=0xffffffffffffffff, ptr=0x7875b0, size=392, flags=15, 
    handle=0x77fd78, line=463, 
    file=0x2aaabe530650 "mca/osc/rdma/osc_rdma_component.c")
    at ompi/mca/osc/rdma/osc_rdma.h:359
#2  0x00002aaabe51a89e in allocate_state_single (module=0x77fbb0, 
    base=0x7fffffffcba0, size=8)
    at ompi/mca/osc/rdma/osc_rdma_component.c:462
#3  0x00002aaabe51ad02 in allocate_state_shared (module=0x77fbb0, 
    base=0x7fffffffcba0, size=8)
    at ompi/mca/osc/rdma/osc_rdma_component.c:564
#4  0x00002aaabe51cf60 in ompi_osc_rdma_component_select (win=0x77f640, 
    base=0x7fffffffcba0, size=8, disp_unit=1, 
    comm=0x6023a0 <ompi_mpi_comm_world>, info=0x6020a0 <ompi_mpi_info_null>, 
    flavor=2, model=0x7fffffffcbac)
    at ompi/mca/osc/rdma/osc_rdma_component.c:1248
#5  0x00002aaaaadfe17c in ompi_osc_base_select (win=0x77f640, 
    base=0x7fffffffcba0, size=8, disp_unit=1, 
    comm=0x6023a0 <ompi_mpi_comm_world>, info=0x6020a0 <ompi_mpi_info_null>, 
    flavor=2, model=0x7fffffffcbac)
    at ompi/mca/osc/base/osc_base_init.c:74
#6  0x00002aaaaad37df3 in ompi_win_allocate (size=8, disp_unit=1, 
    info=0x6020a0 <ompi_mpi_info_null>, comm=0x6023a0 <ompi_mpi_comm_world>, 
    baseptr=0x7fffffffcc70, newwin=0x7fffffffcc78)
    at ompi/win/win.c:277
#7  0x00002aaaaadac600 in PMPI_Win_allocate (size=8, disp_unit=1, 
    info=0x6020a0 <ompi_mpi_info_null>, comm=0x6023a0 <ompi_mpi_comm_world>, 
    baseptr=0x7fffffffcc70, win=0x7fffffffcc78) at pwin_allocate.c:81
#8  0x0000000000400c3d in main ()
(gdb) print openib_module
$1 = (mca_btl_openib_module_t *) 0x739f70
(gdb) print openib_module->device
$2 = (mca_btl_openib_device_t *) 0x0

It seems that the device not initialized. Just peeking at the code in btl_openib_component.c it looks like the decision to not allocated a device in init_one_port is not communicated upwards init_one_device. Shouldn't that decision be made early on (e.g., in btl_openib_component_init) and the ib btl never be selected if it's not allowed?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions