Skip to content

Internal hwloc shipped with OpenMPI 4.1.7 no longer compatible with SLURM 23.11 cgroup plugin / system hwloc #12470

@NicoMittenzwey

Description

@NicoMittenzwey

System

AlmaLinux 9.3
OpenMPI 4.1.7 out of HPCX 2.18.0
Nvidia Infiniband NDR
Slurm 23.11

Issue

We are running Slurm 23.11 on Alma Linux 9.3 with TaskPlugin=task/affinity,task/cgroup and OpenMPI 4.1.7 from Mellanox / Nvidia HPC-X 2.18.0. When starting jobs with less then the maximum number of processes per node and NOT defining --ntasks-per-node OpenMPI 4.1.7 will crash as it is trying to bind process to cores which are not available to it:

Open MPI tried to bind a new process, but something went wrong.  The
process was killed without launching the target application.  Your job
will now abort.

  Local host:        gpu004
  Application name:  ./hpcx
  Error message:     hwloc_set_cpubind returned "Error" for bitmap "2,114"
  Location:          rtc_hwloc.c:382
--------------------------------------------------------------------------

Workaround

Recompiling OpenMPI and forcing it to use system hwloc resolves this issue (might need a dnf install hwloc-devel):

./configure [...] --with-hwloc=/usr/ && make && make install

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions