Skip to content

PMIX ERROR: PMIX_ERR_BAD_PARAM when using MPI_THREAD_MULTIPLE with the PML/UCX #12833

@vanman-nguyen

Description

@vanman-nguyen

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

  • pmix-5.0.3, cloned from this git repository and built with these options: ../configure --prefix=<prefix> --with-ucx=<path-to-ucx-1.17> --with-libevent=/usr --with-hwloc=/usr
  • openmpi-5.0.5 from the official tarball, built with :
../configure \
    --prefix=${HOME}/SCRATCHDIR/tools/openmpi-5.0.5_commu \
    --with-pmix=${HOME}/SCRATCHDIR/tools/pmix-git \
    --with-prrte=${HOME}/SCRATCHDIR/tools/prrte-3.0.6 \
    --with-ucx=${HOME}/SCRATCHDIR/tools/ucx-1.17 \
    --with-ucx-libdir=${HOME}/SCRATCHDIR/tools/ucx-1.17/lib \
    --with-hcoll=$(pkg-config --variable prefix hcoll) \
    --with-portals4=no \
    --enable-mpi1-compatibility \
    --enable-mpirun-prefix-by-default \
    --with-libnl=no \
    --enable-wrapper-rpath=no \
    --enable-wrapper-runpath=no \
    --with-cma \
    --with-libevent=/usr \
    --with-hwloc=/usr \
    --with-knem=$(pkg-config --variable=prefix knem) \
    --with-lustre=no \
    --enable-debug \
    --enable-mca-dso=btl-uct,common-ucx,sshmem-ucx,spml-ucx,atomic-ucx,pml-ucx,osc-ucx,coll-ucc,coll-hcoll,btl-portals4,mtl-portals4,coll-portals4,osc-portals4,btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator CFLAGS='-DNDEBUG -O3 -g -m64' CXXFLAGS='-DNDEBUG -O3 -g -m64' FCFLAGS='-O3 -g -m64' CC='gcc' CXX='g++' FC='gfortran'

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

OMPI was built from the 5.0.5 tarball, PMIX from a git clone and checkout'd to the v5.0.3 tag.

Please describe the system on which you are running

  • Operating system/version: x86_64 RHEL88
  • Computer hardware: AMD EPYC 7402
  • Network type: InfiniBand

Details of the problem

Hello,
We have noticed an unexpected crash with an MPI code that use MPI_THREAD_MULTIPLE, more specifically around MPI_Reduce and MPI_Scatter.
I have included a reproducer here:
reproducer.txt

It can simply be compiled with mpiCC reproducer.cpp -o reproducer, and run with srun: srun --exclusive -N 4 -n 4 --cpus-per-task 4 -p<partition> --resv-ports -K -l ./reproducer.
This code doesn't crash 100%, but often after several runs, in my observations it takes a dozen of runs or so. This is the backtrace :

PMIX ERROR: PMIX_ERR_BAD_PARAM in file ../../../../src/mca/bfrops/base/bfrop_base_copy.c at line 43
PMIX ERROR: PMIX_ERR_BAD_PARAM in file ../../src/client/pmix_client_get.c at line 477
./../src/class/pmix_list.c:62: pmix_list_item_destruct: Assertion `0 == item->pmix_list_item_refcount' failed.
*** Process received signal ***
Signal: Aborted (6)
Signal code:  (-6)
[ 0] /lib64/libpthread.so.0(+0x12cf0)[0x7fb45e2d7cf0]
[ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7fb45df4eacf]
[ 2] /lib64/libc.so.6(abort+0x127)[0x7fb45df21ea5]
[ 3] /lib64/libc.so.6(+0x21d79)[0x7fb45df21d79]
[ 4] /lib64/libc.so.6(+0x47426)[0x7fb45df47426]
[ 5] /home_nfs/nguyenvm/SCRATCHDIR/tools/pmix-git/lib/libpmix.so.2(+0x16b7b1)[0x7fb45d5cb7b1]
[ 6] /home_nfs/nguyenvm/SCRATCHDIR/tools/pmix-git/lib/libpmix.so.2(+0x6c9c4)[0x7fb45d4cc9c4]
[ 7] /home_nfs/nguyenvm/SCRATCHDIR/tools/pmix-git/lib/libpmix.so.2(PMIx_Get+0xc1c)[0x7fb45d4cf93e]
[ 8] /home_nfs/nguyenvm/SCRATCHDIR/tools/openmpi-5.0.5_commu/lib/openmpi/mca_pml_ucx.so(+0x3871)[0x7fb45a2e0871]
[ 9] /home_nfs/nguyenvm/SCRATCHDIR/tools/openmpi-5.0.5_commu/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_isend+0x1de)[0x7fb45a2e3dae]
[10] /home_nfs/nguyenvm/SCRATCHDIR/tools/openmpi-5.0.5_commu/lib/libmpi.so.40(ompi_coll_base_scatter_intra_linear_nb+0x4d0)[0x7fb45ef1d330]
[11] /home_nfs/nguyenvm/SCRATCHDIR/tools/openmpi-5.0.5_commu/lib/libmpi.so.40(ompi_coll_tuned_scatter_intra_dec_fixed+0xac)[0x7fb45ef6bbbc]
[12] /home_nfs/nguyenvm/SCRATCHDIR/tools/openmpi-5.0.5_commu/lib/libmpi.so.40(mca_coll_han_scatter_intra+0x53f)[0x7fb45ef935cf][13] /home_nfs/nguyenvm/SCRATCHDIR/tools/openmpi-5.0.5_commu/lib/libmpi.so.40(MPI_Scatter+0x19b)[0x7fb45eef03bb]
[14] /scratch/nguyenvm/benchs/test-rt4u/input/thread_safe/ktest_mpi/2.0/./reproducer[0x401154]
[15] /scratch/nguyenvm/benchs/test-rt4u/input/thread_safe/ktest_mpi/2.0/./reproducer[0x4011e8]
[16] /lib64/libpthread.so.0(+0x81ca)[0x7fb45e2cd1ca]
[17] /lib64/libc.so.6(clone+0x43)[0x7fb45df39e73]
*** End of error message ***

It looks like it is a regression from pmix-4.2.9. I tried building it and linking it to OpenMPI-5.0.5, and found no issue with this code in >400 runs. Same for openmpi-4.1.6 and pmix-4.2.9. However as soon as I linked openmpi-4.1.6 against pmix-5.0.3, I reproduced this error again after a couple of runs. This code also runs fine with the pml/ob1.

The crash probably comes from an unsuccessful OPAL_MODEX_RECV from the pml/ucx, but I haven't noticed any incoherent values on the MPI side.

This is most likely an issue from the PMIx side. A ticket has been opened on their side, however @rhc54 suggested I open one here as well, as they do not have the hardware to reproduce this crash.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions