You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
AlmaLinux 9.3
OpenMPI 4.1.7 out of HPCX 2.18.0
Nvidia Infiniband NDR
Slurm 23.11
Issue
We are running Slurm 23.11 on Alma Linux 9.3 with TaskPlugin=task/affinity,task/cgroup and OpenMPI 4.1.7 from Mellanox / Nvidia HPC-X 2.18.0. When starting jobs with less then the maximum number of processes per node and NOT defining --ntasks-per-node OpenMPI 4.1.7 will crash as it is trying to bind process to cores which are not available to it:
Open MPI tried to bind a new process, but something went wrong. The
process was killed without launching the target application. Your job
will now abort.
Local host: gpu004
Application name: ./hpcx
Error message: hwloc_set_cpubind returned "Error" for bitmap "2,114"
Location: rtc_hwloc.c:382
--------------------------------------------------------------------------
Workaround
Recompiling OpenMPI and forcing it to use system hwloc resolves this issue (might need a dnf install hwloc-devel):
./configure [...] --with-hwloc=/usr/ && make && make install