-
Couldn't load subscription status.
- Fork 928
Description
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
- pmix-5.0.3, cloned from this git repository and built with these options:
../configure --prefix=<prefix> --with-ucx=<path-to-ucx-1.17> --with-libevent=/usr --with-hwloc=/usr - openmpi-5.0.5 from the official tarball, built with :
../configure \
--prefix=${HOME}/SCRATCHDIR/tools/openmpi-5.0.5_commu \
--with-pmix=${HOME}/SCRATCHDIR/tools/pmix-git \
--with-prrte=${HOME}/SCRATCHDIR/tools/prrte-3.0.6 \
--with-ucx=${HOME}/SCRATCHDIR/tools/ucx-1.17 \
--with-ucx-libdir=${HOME}/SCRATCHDIR/tools/ucx-1.17/lib \
--with-hcoll=$(pkg-config --variable prefix hcoll) \
--with-portals4=no \
--enable-mpi1-compatibility \
--enable-mpirun-prefix-by-default \
--with-libnl=no \
--enable-wrapper-rpath=no \
--enable-wrapper-runpath=no \
--with-cma \
--with-libevent=/usr \
--with-hwloc=/usr \
--with-knem=$(pkg-config --variable=prefix knem) \
--with-lustre=no \
--enable-debug \
--enable-mca-dso=btl-uct,common-ucx,sshmem-ucx,spml-ucx,atomic-ucx,pml-ucx,osc-ucx,coll-ucc,coll-hcoll,btl-portals4,mtl-portals4,coll-portals4,osc-portals4,btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator CFLAGS='-DNDEBUG -O3 -g -m64' CXXFLAGS='-DNDEBUG -O3 -g -m64' FCFLAGS='-O3 -g -m64' CC='gcc' CXX='g++' FC='gfortran'Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
OMPI was built from the 5.0.5 tarball, PMIX from a git clone and checkout'd to the v5.0.3 tag.
Please describe the system on which you are running
- Operating system/version: x86_64 RHEL88
- Computer hardware: AMD EPYC 7402
- Network type: InfiniBand
Details of the problem
Hello,
We have noticed an unexpected crash with an MPI code that use MPI_THREAD_MULTIPLE, more specifically around MPI_Reduce and MPI_Scatter.
I have included a reproducer here:
reproducer.txt
It can simply be compiled with mpiCC reproducer.cpp -o reproducer, and run with srun: srun --exclusive -N 4 -n 4 --cpus-per-task 4 -p<partition> --resv-ports -K -l ./reproducer.
This code doesn't crash 100%, but often after several runs, in my observations it takes a dozen of runs or so. This is the backtrace :
PMIX ERROR: PMIX_ERR_BAD_PARAM in file ../../../../src/mca/bfrops/base/bfrop_base_copy.c at line 43
PMIX ERROR: PMIX_ERR_BAD_PARAM in file ../../src/client/pmix_client_get.c at line 477
./../src/class/pmix_list.c:62: pmix_list_item_destruct: Assertion `0 == item->pmix_list_item_refcount' failed.
*** Process received signal ***
Signal: Aborted (6)
Signal code: (-6)
[ 0] /lib64/libpthread.so.0(+0x12cf0)[0x7fb45e2d7cf0]
[ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7fb45df4eacf]
[ 2] /lib64/libc.so.6(abort+0x127)[0x7fb45df21ea5]
[ 3] /lib64/libc.so.6(+0x21d79)[0x7fb45df21d79]
[ 4] /lib64/libc.so.6(+0x47426)[0x7fb45df47426]
[ 5] /home_nfs/nguyenvm/SCRATCHDIR/tools/pmix-git/lib/libpmix.so.2(+0x16b7b1)[0x7fb45d5cb7b1]
[ 6] /home_nfs/nguyenvm/SCRATCHDIR/tools/pmix-git/lib/libpmix.so.2(+0x6c9c4)[0x7fb45d4cc9c4]
[ 7] /home_nfs/nguyenvm/SCRATCHDIR/tools/pmix-git/lib/libpmix.so.2(PMIx_Get+0xc1c)[0x7fb45d4cf93e]
[ 8] /home_nfs/nguyenvm/SCRATCHDIR/tools/openmpi-5.0.5_commu/lib/openmpi/mca_pml_ucx.so(+0x3871)[0x7fb45a2e0871]
[ 9] /home_nfs/nguyenvm/SCRATCHDIR/tools/openmpi-5.0.5_commu/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_isend+0x1de)[0x7fb45a2e3dae]
[10] /home_nfs/nguyenvm/SCRATCHDIR/tools/openmpi-5.0.5_commu/lib/libmpi.so.40(ompi_coll_base_scatter_intra_linear_nb+0x4d0)[0x7fb45ef1d330]
[11] /home_nfs/nguyenvm/SCRATCHDIR/tools/openmpi-5.0.5_commu/lib/libmpi.so.40(ompi_coll_tuned_scatter_intra_dec_fixed+0xac)[0x7fb45ef6bbbc]
[12] /home_nfs/nguyenvm/SCRATCHDIR/tools/openmpi-5.0.5_commu/lib/libmpi.so.40(mca_coll_han_scatter_intra+0x53f)[0x7fb45ef935cf][13] /home_nfs/nguyenvm/SCRATCHDIR/tools/openmpi-5.0.5_commu/lib/libmpi.so.40(MPI_Scatter+0x19b)[0x7fb45eef03bb]
[14] /scratch/nguyenvm/benchs/test-rt4u/input/thread_safe/ktest_mpi/2.0/./reproducer[0x401154]
[15] /scratch/nguyenvm/benchs/test-rt4u/input/thread_safe/ktest_mpi/2.0/./reproducer[0x4011e8]
[16] /lib64/libpthread.so.0(+0x81ca)[0x7fb45e2cd1ca]
[17] /lib64/libc.so.6(clone+0x43)[0x7fb45df39e73]
*** End of error message ***It looks like it is a regression from pmix-4.2.9. I tried building it and linking it to OpenMPI-5.0.5, and found no issue with this code in >400 runs. Same for openmpi-4.1.6 and pmix-4.2.9. However as soon as I linked openmpi-4.1.6 against pmix-5.0.3, I reproduced this error again after a couple of runs. This code also runs fine with the pml/ob1.
The crash probably comes from an unsuccessful OPAL_MODEX_RECV from the pml/ucx, but I haven't noticed any incoherent values on the MPI side.
This is most likely an issue from the PMIx side. A ticket has been opened on their side, however @rhc54 suggested I open one here as well, as they do not have the hardware to reproduce this crash.