Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trouble running openmpi+pmix in rootless podman-hpc container #12146

Open
lastephey opened this issue Dec 4, 2023 · 56 comments
Open

trouble running openmpi+pmix in rootless podman-hpc container #12146

lastephey opened this issue Dec 4, 2023 · 56 comments

Comments

@lastephey
Copy link

Background information

Dear OpenMPI developers,

I'm going to describe an issue that @hppritcha kindly helped me troubleshoot during SC a few weeks ago. It may actually be a Slurm issue rather than an OpenMPI issue, but I wanted to share the information I have with you.

Background- we are trying to get support working for OpenMPI using podman-hpc, a relatively new container runtime at NERSC. By default podman-hpc runs in rootless mode, where the user inside the container appears to be root. Our current methodology is to use either srun/mpirun to launch the MPI/PMI wireup outside the container, and then have this connect to OpenMPI installed with PMI support inside the container. Although I've done tests both with and without Slurm support, I'll focus on the Slurm case here to keep things simple. However I am happy to provide more information about mpirun launch if you would like it.

The situation- we are able to make this setup work using PMI2, but it is not currently working using PMIx. However, we have observed that when we run the container as the user (i.e. using --userns=keep-id) rather than in rootless mode, our OpenMPI+PMIx test does succeed. Since running with --userns=keep-id can be substantially slower, we would really like to enable containers running as root.

PMI2 PMIx
Runs as root- succeeds Runs as root- fails
Runs as user- succeeds Runs as user- succeeds

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

We are running with OpenMPI version 4.1.6 as suggested by @hppritcha. For this test we have built a single container image with OpenMPI built with both PMI2 and PMIx. We toggle between PMI2 and PMIx support in our tests.

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

We developed a Containerfile recipe for this build:

FROM ubuntu:jammy
WORKDIR /opt

ENV DEBIAN_FRONTEND noninteractive

RUN apt-get update && apt-get install -y \
        build-essential \
        ca-certificates \
        automake \
        autoconf \
        wget \
        python3-dev \
        python3-pip \
        libpmi2-0-dev \
        libpmix-dev \
    --no-install-recommends \
    && rm -rf /var/lib/apt/lists/*

ARG openmpi_version=4.1.6

RUN wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-$openmpi_version.tar.gz \
    && tar xf openmpi-$openmpi_version.tar.gz \
    && cd openmpi-$openmpi_version \
    && CFLAGS=-I/usr/include/slurm ./configure \
       --prefix=/opt/openmpi --with-slurm \
       --with-pmi=/usr/include/slurm --with-pmi-libdir=/usr/lib/x86_64-linux-gnu \
       --with-pmix=external --with-pmix=/usr/lib/x86_64-linux-gnu/pmix2 \
    && make -j 32 \
    && make install \
    && cd .. \
    && rm -rf openmpi-$openmpi_version.tar.gz openmpi-$openmpi_version

RUN /sbin/ldconfig

ENV PATH=/opt/openmpi/bin:$PATH

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

N/A

Please describe the system on which you are running

  • Operating system/version: SLES
  • Computer hardware: NERSC Perlmutter, AMD CPUs
  • Network type: Running TCP on Slingshot

Details of the problem

We are running the same hello world mpi4py test with both OpenMPI PMI2 and OpenMPI PMIx. Additionally, we are running the same test both with and without --userns=keep-id. All tests succeed except the PMIx + userns=keep-id test. We are using a PMI2 and PMIx helper module.

Running pmi2 test succeeds:

stephey@nid200050:/pscratch/sd/s/stephey/containerfiles/openmpi> srun -n 2 --mpi=pmi2 podman-hpc run --rm --openmpi-pmi2 openmpi:pmix python3 -m mpi4py.bench helloworld
Hello, World! I am process 0 of 2 on nid200050.
Hello, World! I am process 1 of 2 on nid200051.

Running pmix + userns=keep-d test succeeds:

stephey@nid200052:/pscratch/sd/s/stephey/containerfiles/openmpi> export PMIX_MCA_psec=native
stephey@nid200052:/pscratch/sd/s/stephey/containerfiles/openmpi> srun -n 2 --mpi=pmix podman-hpc run --rm --openmpi-pmix openmpi:pmix python3 -m mpi4py.bench helloworld
[nid200052:1125877] UNPACK-PMIX-VALUE: UNSUPPORTED TYPE 81
[nid200053:2173502] UNPACK-PMIX-VALUE: UNSUPPORTED TYPE 81
Hello, World! I am process 0 of 2 on nid200052.
Hello, World! I am process 1 of 2 on nid200053.
stephey@nid200052:/pscratch/sd/s/stephey/containerfiles/openmpi> 

Running pmix test fails:

stephey@nid200247:/pscratch/sd/s/stephey/containerfiles/openmpi> srun -n 2 --mpi=pmix podman-hpc run --rm --openmpi-pmix openmpi:pmix python3 -m mpi4py.bench helloworld
srun: Job 18952480 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Step created for StepId=18952480.2
[nid200247:2144171] OPAL ERROR: Unreachable in file ext3x_client.c at line 111
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[nid200247:2144171] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[nid200248:2177237] OPAL ERROR: Unreachable in file ext3x_client.c at line 111
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[nid200248:2177237] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
srun: error: nid200247: task 0: Exited with exit code 1
srun: Terminating StepId=18952480.2
srun: error: nid200248: task 1: Exited with exit code 1
stephey@nid200247:/pscratch/sd/s/stephey/containerfiles/openmpi> 

It's not clear to me if this is a Slurm issue or an OpenMPI/PMIx issue. I haven't been able to make this work with mpirun either, but I left out those details here since this issue is already quite long. @hppritcha suggested we may want to file an issue with Slurm support, but I wanted to share this with you all first before we go down that route.

Thanks very much for your help,
Laurie Stephey

cc @rgayatri23

@rhc54
Copy link
Contributor

rhc54 commented Dec 4, 2023

Running pmix + userns=keep-d test succeeds:

I would consider this error message:

[nid200052:1125877] UNPACK-PMIX-VALUE: UNSUPPORTED TYPE 81
[nid200053:2173502] UNPACK-PMIX-VALUE: UNSUPPORTED TYPE 81

to indicate a failure. A simple "hello" might work, but something is wrong.

Running pmix test fails
[nid200247:2144171] OPAL ERROR: Unreachable in file ext3x_client.c at line 111

I notice you didn't include export PMIX_MCA_psec=native for this test - true? I believe Slurm wants PMIx to use the munge security support, so you may be hitting a conflict there.

I haven't been able to make this work with mpirun either

Not surprising - I'd guess that the problem lies in getting the PMIx socket connect across the container boundary. Note that we do have others who run containers that have root as the user, even under Slurm, so we know that it can be done. I don't know the reason for this particular error.

One thing I find curious:

--with-pmix=/usr/lib/x86_64-linux-gnu/pmix2

What is this pmix2?

@lastephey
Copy link
Author

Hi @rhc54,

Thanks for your quick reply.

"I would consider this error message to indicate a failure."

Noted, that makes sense.

"I notice you didn't include export PMIX_MCA_psec=native for this test - true?"

For the test with the failure, it does include the export PMIX_MCA_psec=native- I'm sorry, I should have shown that setting. Without it, I see munge errors as you mentioned.

"What is this pmix2"

This is the pmix version that I found that comes as an Ubuntu jammy package. I don't really understand why they call the directory pmix2.

I'm glad you have seen cases where running this as root should work. Do you have any advice for how I should go about troubleshooting this?

@rhc54
Copy link
Contributor

rhc54 commented Dec 4, 2023

This is the pmix version that I found that comes as an Ubuntu jammy package. I don't really understand why they call the directory pmix2.

I immediately get suspicious when I see that this is PMIx "4.1.2-2ubuntu1" - it sounds like they have modified the release. I would strongly advise against using any software that has been modified by the packager. I'd suggest downloading a copy of 4.2.7 and building it locally, or just use the copy that is embedded in OMPI.

I especially recommend that due to the "UNSUPPORTED TYPE" error. Something is broken between your Slurm and OMPI PMIx connections. I'd start by trying to understand what that might be.

Do you know what version of PMIx your Slurm is using? If you download and build OMPI outside the container, are you able to srun an OMPI app?

@lastephey
Copy link
Author

Thanks @rhc54.

Got it, thanks for that advice about packaging. I'll work and building and testing with my own pmix.

We do have a version of OMPI on Perlmutter that @rgayatri23 and @hppritcha built. We also have pmix v4 support in Slurm:

stephey@perlmutter:login40:/pscratch/sd/s/stephey/containerfiles/openmpi> srun --mpi=list
MPI plugin types are...
	cray_shasta
	none
	pmi2
	pmix
specific pmix plugin versions available: pmix_v4
stephey@perlmutter:login40:/pscratch/sd/s/stephey/containerfiles/openmpi> 

Testing with slurm+openmpi+mpi4py outside a container, I do actually see a similar UNSUPPORTED TYPE error.

stephey@nid200185:~/openmpi> srun --mpi=pmix -n 2 python -m mpi4py.bench helloworld
[nid200186:2078490] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750
[nid200185:2074824] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750
[nid200185:2074824] UNPACK-PMIX-VALUE: UNSUPPORTED TYPE 104
[nid200186:2078490] UNPACK-PMIX-VALUE: UNSUPPORTED TYPE 104
Hello, World! I am process 0 of 2 on nid200185.
Hello, World! I am process 1 of 2 on nid200186.
stephey@nid200185:~/openmpi> 

For comparison, I also see a similar OUT OF RESSOURCE error when I force mpirun to launch across 2 nodes, but notably no UNSUPPORTED TYPE error.

stephey@nid200185:~/openmpi> mpirun -N 2 python -m mpi4py.bench helloworld
[nid200185:2077382] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750
[nid200185:2077382] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750
Hello, World! I am process 0 of 4 on nid200185.
Hello, World! I am process 1 of 4 on nid200185.
Hello, World! I am process 2 of 4 on nid200186.
Hello, World! I am process 3 of 4 on nid200186.

Do you think this could mean there's some issue in our slurm and/or pmix installation?

@rhc54
Copy link
Contributor

rhc54 commented Dec 4, 2023

Yeah, something isn't right. Let me do a little digging tonight to see what those errors might mean. We need to get you running cleanly outside the container before introducing the container into the mix.

Just to be sure:

  • the mpirun is the one from the OMPI you built?
  • the app is linked against the same OMPI?
  • what version of OMPI is it?
  • what version of PMIx is being used by that OMPI installation?

@lastephey
Copy link
Author

Thanks @rhc54. I agree, that sounds like a good course of action.

Yes, the mpirun comes from the openmpi module on Perlmutter. I didn't build it myself, but I think Rahul and Howard did.

Here's the top part of ompi_info:

stephey@nid200185:~/openmpi> ompi_info
                 Package: Open MPI ncicd@login01 Distribution
                Open MPI: 5.0.0rc12
  Open MPI repo revision: v5.0.0rc12
   Open MPI release date: May 19, 2023
                 MPI API: 3.1.0
            Ident string: 5.0.0rc12
                  Prefix: /global/common/software/nersc/pe/gnu/openmpi/5.0.0rc12
 Configured architecture: x86_64-pc-linux-gnu
           Configured by: ncicd
           Configured on: Wed Jul 26 23:10:21 UTC 2023
          Configure host: login01
  Configure command line: 'CC=cc' 'FC=ftn' 'CXX=CC'
                          'CFLAGS=--cray-bypass-pkgconfig'
                          'CXXFLAGS=--cray-bypass-pkgconfig'
                          'FCFLAGS=--cray-bypass-pkgconfig'
                          'LDFLAGS=--cray-bypass-pkgconfig'
                          '--enable-orterun-prefix-by-default'
                          '--prefix=/global/common/software/nersc/pe/gnu/openmpi/5.0.0rc12'
                          '--with-cuda=/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/cuda/11.7'
                          '--with-cuda-libdir=/usr/lib64' '--with-ucx=no'
                          '--with-verbs=no' '--enable-mpi-java'
                          '--with-libfabric=/opt/cray/libfabric/1.15.2.0'
                Built by: ncicd
                Built on: Wed 26 Jul 2023 11:14:39 PM UTC

Here's the mpirun:

stephey@nid200185:~/openmpi> type mpirun
mpirun is hashed (/global/common/software/nersc/pe/gnu/openmpi/5.0.0rc12/bin/mpirun)

Yes, I built the mpi4py package on top of this openmpi:

stephey@nid200185:~/openmpi> python -c "from mpi4py import MPI;print(MPI.Get_library_version())"
Open MPI v5.0.0rc12, package: Open MPI ncicd@login01 Distribution, ident: 5.0.0rc12, repo rev: v5.0.0rc12, May 19, 2023
stephey@nid200185:~/openmpi> 

Since I don't see pmix mentioned in ompi_info, is it safe to assume that it's using whatever shipped with 5.0.0rc12? I'm sorry, I don't know how to determine the version it's using.

@rhc54
Copy link
Contributor

rhc54 commented Dec 5, 2023

is it safe to assume that it's using whatever shipped with 5.0.0rc12?

Yes - the configure line shows (because it doesn't explicitly include a --with-pmix in it) that you are using the included PMIx. So that's a good start. We know there are bugs in the OMPI version, including in mpirun, but that wouldn't explain this problem.

Sorry to keep nagging with questions: what happens if you run a C "hello" version? In other words, take the python and mpi4py out of the equation?

@lastephey
Copy link
Author

No problem, I appreciate your help.

Sure, here's srun:

stephey@nid200018:/pscratch/sd/s/stephey/test_mpi> srun --mpi=pmix -n 2 ./ompi_hello 
[nid200018:821449] UNPACK-PMIX-VALUE: UNSUPPORTED TYPE 104
[nid200019:1324764] UNPACK-PMIX-VALUE: UNSUPPORTED TYPE 104
[nid200019:1324764] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750
[nid200018:821449] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750
Hello world from processor nid200018, rank 0 out of 2 processors
Hello world from processor nid200019, rank 1 out of 2 processors
stephey@nid200018:/pscratch/sd/s/stephey/test_mpi> 

and mpirun:

stephey@nid200018:/pscratch/sd/s/stephey/test_mpi> mpirun -N 2 -np 4 ./ompi_hello 
[nid200018:821321] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750
[nid200018:821321] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750
Hello world from processor nid200018, rank 1 out of 4 processors
Hello world from processor nid200018, rank 0 out of 4 processors
Hello world from processor nid200019, rank 3 out of 4 processors
Hello world from processor nid200019, rank 2 out of 4 processors

@rgayatri23
Copy link

Hi @lastephey and @rhc54, thanks for looking into the issue.
FYI - We also have the 5.0.0 release version on Perlmutter (non-container) and even that gives the same error message thats shown above.

@rhc54
Copy link
Contributor

rhc54 commented Dec 5, 2023

Can you please provide the configuration line for that 5.0.0 release? FWIW: that error message ordinarily indicates that there is a disconnect between the PMIx code in mpirun vs the PMIx code in the app. I'm going to try and build it here to see if I can reproduce it, but given we aren't hearing this in general, I suspect this has something to do with your environment.

@ggouaillardet
Copy link
Contributor

What is the PMIx version used by SLURM on the host?

note your configure command line in the container file contains
--with-pmix=external --with-pmix=/usr/lib/x86_64-linux-gnu/pmix2
and these are conflicting directives.

what if you build a container with --with-pmix=internal instead?

@rgayatri23
Copy link

Thats essentially what we have done in our non-container, bare metal installation. Here is the configure command

./configure CC=cc FC=ftn CXX=CC CFLAGS="--cray-bypass-pkgconfig" CXXFLAGS="--cray-bypass-pkgconfig" FCFLAGS="--cray-bypass-pkgconfig" LDFLAGS="--cray    -bypass-pkgconfig" --enable-orterun-prefix-by-default --prefix=${ompi_install_dir}  --with-cuda=$CUDA_HOME --with-cuda-libdir=/usr/lib64 --with-ucx=n    o --with-verbs=no --enable-mpi-java --with-ofi --enable-mpi1-compatibility --with-pmix=internal --disable-sphinx

@ggouaillardet
Copy link
Contributor

Thanks, what are the SLURM and PMIx versions running on the host (running SLES if I understand correctly)?
Are these provided by the distro or were they rebuilt?
Last but not least, what does

srun --mpi=list

returns?

@rhc54
Copy link
Contributor

rhc54 commented Dec 5, 2023

As expected, I cannot reproduce this problem. It is undoubtedly due to an issue of PMIx version confusion, most likely being caused by some kind of LD_LIBRARY_PATH setting. I suspect OMPI is being built against one version, but another version is getting picked up at runtime.

The problem is caused by confusion over whether or not the messaging buffer between the PMIx server and client has been packed in "debug mode" (i.e., where it contains explicit information on the data type being packed at each step) vs "non-debug mode" (where it doesn't contain the data type info). This causes the unpacking procedure to mistake actual data as being the data type, and things go haywire. In this case, the error message is caused by the unpack function interpreting a value as being the length of a string, and that value is enormous (because it isn't really the packed string length).

We have handshake code that can detect which mode the other side's messaging buffer is in, but something in your environment is causing that to fail. If you do a srun --mpi=pmix -n 1 env | grep PMI, you should see something like:

PMIX_BFROP_BUFFER_TYPE=PMIX_BFROP_BUFFER_FULLY_DESC

in the output.

@lastephey
Copy link
Author

To answer your question @ggouaillardet,

stephey@perlmutter:login40:/pscratch/sd/s/stephey/containerfiles/openmpi> srun --mpi=list
MPI plugin types are...
	cray_shasta
	none
	pmi2
	pmix
specific pmix plugin versions available: pmix_v4
stephey@perlmutter:login40:/pscratch/sd/s/stephey/containerfiles/openmpi> 

@lastephey
Copy link
Author

@rhc54 thanks for looking into this and trying to reproduce.

stephey@nid200017:~> srun --mpi=pmix -n 1 env | grep PMI
PMI_SHARED_SECRET=8292616328418320780
SLURM_PMIXP_ABORT_AGENT_PORT=46879
SLURM_PMIX_MAPPING_SERV=(vector,(0,1,1))
PMIX_NAMESPACE=slurm.pmix.18982677.0
PMIX_RANK=0
PMIX_SERVER_URI41=pmix-server.1841867;tcp4://127.0.0.1:51435
PMIX_SERVER_URI4=pmix-server.1841867;tcp4://127.0.0.1:51435
PMIX_SERVER_URI3=pmix-server.1841867;tcp4://127.0.0.1:51435
PMIX_SERVER_URI2=pmix-server.1841867;tcp4://127.0.0.1:51435
PMIX_SERVER_URI21=pmix-server.1841867;tcp4://127.0.0.1:51435
PMIX_SECURITY_MODE=munge,native
PMIX_BFROP_BUFFER_TYPE=PMIX_BFROP_BUFFER_NON_DESC
PMIX_GDS_MODULE=ds21,ds12,hash
PMIX_SERVER_TMPDIR=/var/spool/slurmd/pmix.18982677.0/
PMIX_SYSTEM_TMPDIR=/tmp
PMIX_DSTORE_21_BASE_PATH=/var/spool/slurmd/pmix.18982677.0//pmix_dstor_ds21_1841867
PMIX_DSTORE_ESH_BASE_PATH=/var/spool/slurmd/pmix.18982677.0//pmix_dstor_ds12_1841867
PMIX_HOSTNAME=nid200017
PMIX_VERSION=4.2.3
stephey@nid200017:~> 

We do see the PMIX_BFROP_BUFFER_TYPE=PMIX_BFROP_BUFFER_NON_DESC as you mentioned.

If I understood you correctly, we'll need to track down the pmix version difference as you mentioned. Is there an easy way to determine either which pmix version ompi was built with, or which pmix version is currently in use?

@rgayatri23
Copy link

To add to @lastephey 's question, for the configure option of --pmix=internal , is there a way to specify which version to build internally so it matches with what we have on system?

@rhc54
Copy link
Contributor

rhc54 commented Dec 5, 2023

internal means whatever version OMPI included in its code distribution. For the v4.1 series, that would be PMIx v3.2.x. For OMPI v5, that would be PMIx v4.2.x.

If you look at the output from your last run, you'll see that the server puts its version in the environment of the proc:

PMIX_VERSION=4.2.3

I can post a little program that will get the client's version and print it out.

PMIx guarantees interoperability, so the difference there isn't the issue. The problem is that the client thinks the server is using one buffer type, when it is actually using the other. The question is: "why"?

You might try running mpirun -n 1 env | grep PMI and see what buffer type mpirun is using. That should involve the same PMIx lib used to build OMPI, and the default buffer type is set at time of configure. It will also tell you what PMIx version mpirun is using - which should be the same as an app built against OMPI would use.

But we do still handshake to deal with potential buffer type differences at runtime, so it is puzzling. Kind of fishing in the dark right now to see if something pops up.

I suppose if you can/want to grant me access to the machine, I can poke at it a bit for you. Up to you - I honestly don't know how many more question/answer rounds it will take to try and make sense of this.

@lastephey
Copy link
Author

Here's the output from mpirun:

stephey@perlmutter:login02:~/nersc-official-images/nersc/openmpi/4.1.6-tcp> mpirun -n 1 env | grep PMI
PMIX_NAMESPACE=prterun-login02-2260974@1
PMIX_RANK=0
PMIX_SERVER_URI41=prterun-login02-2260974@0.0;tcp4://10.249.0.178:45399
PMIX_SERVER_URI4=prterun-login02-2260974@0.0;tcp4://10.249.0.178:45399
PMIX_SERVER_URI3=prterun-login02-2260974@0.0;tcp4://10.249.0.178:45399
PMIX_SERVER_URI2=prterun-login02-2260974@0.0;tcp4://10.249.0.178:45399
PMIX_SERVER_URI21=prterun-login02-2260974@0.0;tcp4://10.249.0.178:45399
PMIX_SECURITY_MODE=munge,native
PMIX_BFROP_BUFFER_TYPE=PMIX_BFROP_BUFFER_NON_DESC
PMIX_GDS_MODULE=ds21,ds12,hash
PMIX_SERVER_TMPDIR=/tmp/prte.login02.75313/dvm.2260974
PMIX_SYSTEM_TMPDIR=/tmp
PMIX_DSTORE_21_BASE_PATH=/tmp/prte.login02.75313/dvm.2260974/pmix_dstor_ds21_2260974
PMIX_DSTORE_ESH_BASE_PATH=/tmp/prte.login02.75313/dvm.2260974/pmix_dstor_ds12_2260974
PMIX_HOSTNAME=login02
PMIX_VERSION=4.2.3

@lastephey
Copy link
Author

And just in case it's useful, here's the mpirun output inside a test container:

stephey@nid200020:/pscratch/sd/s/stephey/containerfiles/openmpi> mpirun -np 1 podman-hpc run --rm --openmpi-pmix openmpi:pmix /bin/bash -c "env | grep PMI"
PMIX_HOSTNAME=nid200020
PMIX_SECURITY_MODE=munge,native
PMIX_GDS_MODULE=ds21,ds12,hash
PMIX_SYSTEM_TMPDIR=/tmp
PMIX_SERVER_URI41=prterun-nid200020-84066@0.0;tcp4://10.100.104.240:58825
PMIX_BFROP_BUFFER_TYPE=PMIX_BFROP_BUFFER_NON_DESC
PMIX_DSTORE_21_BASE_PATH=/tmp/prte.nid200020.75313/dvm.84066/pmix_dstor_ds21_84066
PMIX_VERSION=4.2.3
ENABLE_OPENMPI_PMIX=1
PMI_SHARED_SECRET=9474821694400732881
PMIX_RANK=0
PMIX_SERVER_URI2=prterun-nid200020-84066@0.0;tcp4://10.100.104.240:58825
PMIX_SERVER_URI3=prterun-nid200020-84066@0.0;tcp4://10.100.104.240:58825
PMIX_SERVER_URI4=prterun-nid200020-84066@0.0;tcp4://10.100.104.240:58825
PMIX_SERVER_URI21=prterun-nid200020-84066@0.0;tcp4://10.100.104.240:58825
PMIX_DSTORE_ESH_BASE_PATH=/tmp/prte.nid200020.75313/dvm.84066/pmix_dstor_ds12_84066
PMIX_SERVER_TMPDIR=/tmp/prte.nid200020.75313/dvm.84066
PMIX_NAMESPACE=prterun-nid200020-84066@1
stephey@nid200020:/pscratch/sd/s/stephey/containerfiles/openmpi> 

@rhc54
Copy link
Contributor

rhc54 commented Dec 5, 2023

Hmmm...well, that all looks okay. Just for grins, let's try pushing PMIX_MCA_gds=hash into the environment and then let mpirun start an OMPI "hello". I don't think it will make any difference, but I know we saw something really weird from a user that runs in a qemu VM, so might as well cross this one off the list if we can.

@rhc54
Copy link
Contributor

rhc54 commented Dec 5, 2023

Actually, I have to eat my words. The PMIx version shown above when executing mpirun is incorrect. The minimum PMIx version for OMPI v5's mpirun is v4.2.6, and that is what shipped with OMPI. You are showing v4.2.3, which I believe is the system PMIx?

So mpirun must have built against the internal PMIx v4.2.6, or else it would have error'd out during configure. However, it must have been given a path to v4.2.3 at runtime, and unfortunately picked that version up.

I honestly have no idea how mpirun will behave in that scenario - might be okay, might not. I'll probably add a runtime check for PMIx version level to avoid such crossovers in the future. Meantime, if you can ensure that mpirun gets pointed at the PMIx it was built against, that would be a good first step.

@lastephey
Copy link
Author

Hi @rhc54,

I see, I think that makes sense. We'll work on that. Yes, I believe 4.2.3 is the system pmix.

Perlmutter has been down on and off since yesterday afternoon, so it might take a bit before we can do more testing.

I did test with PMI_MCA_gds=hash shortly before if went down and I didn't notice any difference in behavior.

@lastephey
Copy link
Author

@rgayatri23 and I did some more testing today- I'll try to summarize what we did.

First, I should clarify something that had both Rahul and I confused- his 5.0.0 build used --with-pmix=internal, and his 5.0rc12 build used --with-pmix=external. All of the tests I showed earlier were done with the 5.0rc12 version with external PMIx. I'll show some tests now with the internal PMIx version.

So in terms of OpenMPI being built with one pmix and maybe using another, I am not sure about that. I tested with Rahul's 5.0.0 build which used its own internal PMIx located at /global/common/software/nersc/pe/gnu/openmpi/5.0.0/lib/libpmix.so.2.9.3.

I checked and my test application was linked to this PMIx.

stephey@nid200061:/pscratch/sd/s/stephey/test_mpi> ldd ompi_hello | grep "pmix"
	/global/common/software/nersc/pe/gnu/openmpi/5.0.0/lib/libpmix.so.2.9.3 (0x00007f10c501a000)

I also used LD_PRELOAD to force it to use this pmix, but it didn't change the error message we reported earlier. I wonder if it is using the OpenMPI pmix, but that there is an issue with the mismatch of the PMIx that Slurm is using.

stephey@perlmutter:login34:/usr> find . -name "libpmix*"
./lib64/libpmix.la
./lib64/libpmix.so
./lib64/libpmix.so.2
./lib64/libpmix.so.2.7.0

Here's with srun + openmpi 5 internal pmix:

stephey@nid200245:/pscratch/sd/s/stephey/test_mpi> LD_PRELOAD=/global/common/software/nersc/pe/gnu/openmpi/5.0.0/lib/libpmix.so.2.9.3 srun --mpi=pmix -n 2 ./ompi_hello
srun: Job 19067036 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Step created for StepId=19067036.2
[nid200245:810490] PMIX ERROR: PMIX_ERROR in file client/pmix_client_topology.c at line 352
[nid200245:810490] PMIX ERROR: PMIX_ERROR in file client/pmix_client_topology.c at line 352
[nid200246:259699] PMIX ERROR: PMIX_ERROR in file client/pmix_client_topology.c at line 352
[nid200246:259699] PMIX ERROR: PMIX_ERROR in file client/pmix_client_topology.c at line 352
[nid200246:259690] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750
[nid200245:810481] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750
[nid200245:810481] UNPACK-PMIX-VALUE: UNSUPPORTED TYPE 104
[nid200246:259690] UNPACK-PMIX-VALUE: UNSUPPORTED TYPE 104
Hello world from processor nid200245, rank 0 out of 2 processors
Hello world from processor nid200246, rank 1 out of 2 processors
stephey@nid200245:/pscratch/sd/s/stephey/test_mpi> 

Here's with mpirun 5.0 + internal pmix:

stephey@nid200245:/pscratch/sd/s/stephey/test_mpi> mpirun -N 2 -np 4 ./ompi_hello
Hello world from processor nid200245, rank 0 out of 4 processors
Hello world from processor nid200245, rank 1 out of 4 processors
Hello world from processor nid200246, rank 2 out of 4 processors
Hello world from processor nid200246, rank 3 out of 4 processors

Given that the mpirun tests are "clean" (i.e. don't have any of the warnings we showed earlier), do you think we can infer anything? Maybe building with --pmix-internal and launching with mpirun rather than srun is a more robust solution?

To contrast, here's mpirun from 5.0rc12 (i.e. --with-pmix=external)

stephey@nid200252:/pscratch/sd/s/stephey/test_mpi> mpirun -N 2 -np 4 ./ompi_hello 
[nid200252:803657] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750
[nid200252:803657] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750
Hello world from processor nid200252, rank 1 out of 4 processors
Hello world from processor nid200252, rank 0 out of 4 processors
Hello world from processor nid200253, rank 3 out of 4 processors
Hello world from processor nid200253, rank 2 out of 4 processors
stephey@nid200252:/pscratch/sd/s/stephey/test_mpi> 

@rhc54
Copy link
Contributor

rhc54 commented Dec 8, 2023

First, I should clarify something that had both Rahul and I confused- his 5.0.0 build used --with-pmix=internal, and his 5.0rc12 build used --with-pmix=external. All of the tests I showed earlier were done with the 5.0rc12 version with external PMIx.

Let's please drop the rc12 build - it's old, there were many changes made before official release, etc. Can we just focus on a real official release?

I also used LD_PRELOAD to force it to use this pmix

Please do not use LD_PRELOAD in front of the srun command - you are now forcing Slurm to use a version that it wasn't built with and may not support.

Here's with mpirun 5.0 + internal pmix...Given that the mpirun tests are "clean"...

This shows we now have identified a working combination. We can now step forward with this combination.

Let's ask what happens if you simply srun (unmodified, no preloads - just your vanilla system installed Slurm) an ompi_hello built against the above version - does that work?

@lastephey
Copy link
Author

Sure, here are some tests using 5.0.0.

stephey@nid200035:/pscratch/sd/s/stephey/test_mpi> type mpirun
mpirun is hashed (/global/common/software/nersc/pe/gnu/openmpi/5.0.0/bin/mpirun)
stephey@nid200035:/pscratch/sd/s/stephey/test_mpi> 

Testing with mpirun looks clean:

stephey@nid200035:/pscratch/sd/s/stephey/test_mpi> mpirun -N 2 -np 4 ./ompi_hello 
Hello world from processor nid200035, rank 0 out of 4 processors
Hello world from processor nid200035, rank 1 out of 4 processors
Hello world from processor nid200036, rank 3 out of 4 processors
Hello world from processor nid200036, rank 2 out of 4 processors

Testing with srun shows the same issue we reported earlier:

stephey@nid200035:/pscratch/sd/s/stephey/test_mpi> srun --mpi=pmix -n 2 ./ompi_hello 
[nid200036:308364] PMIX ERROR: PMIX_ERROR in file client/pmix_client_topology.c at line 352
[nid200036:308364] PMIX ERROR: PMIX_ERROR in file client/pmix_client_topology.c at line 352
[nid200035:235305] PMIX ERROR: PMIX_ERROR in file client/pmix_client_topology.c at line 352
[nid200035:235305] PMIX ERROR: PMIX_ERROR in file client/pmix_client_topology.c at line 352
[nid200035:235296] UNPACK-PMIX-VALUE: UNSUPPORTED TYPE 105
[nid200036:308354] UNPACK-PMIX-VALUE: UNSUPPORTED TYPE 105
[nid200035:235296] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750
[nid200036:308354] UNPACK-PMIX-VALUE: UNSUPPORTED TYPE 105
Hello world from processor nid200036, rank 1 out of 2 processors
Hello world from processor nid200035, rank 0 out of 2 processors

I think slurm is using /usr/lib64/libpmix.so.2.7.0 and OpenMPI 5.0.0 is using /global/common/software/nersc/pe/gnu/openmpi/5.0.0/lib/libpmix.so.2.9.3.

@rhc54
Copy link
Contributor

rhc54 commented Dec 12, 2023

I'm afraid I simply cannot reproduce those results using PMIx v4.2.3 for the server and v4.2.6 for the client. I'm also unable to reproduce it when the client uses PMIx v5.0 or head of the master branch. Everything works just fine.

That said, I do see a code path that might get to that point when running under Slurm (which probably does not provide the apps with their binding info). I'll try to explore that next.

@rhc54
Copy link
Contributor

rhc54 commented Dec 12, 2023

@wenduwan @hppritcha I believe the problem here is that the compute_dev_distances function in common_ofi.c is calling PMIx_Compute_distances with a garbage cpuset. I'm not entirely sure why/how that would be happening, but I cannot reproduce it locally - so this may be something to do with the user's system.

I'm afraid I cannot debug it further as I can't reproduce it on any machine available to me. Can someone perhaps trace down the OPAL code to see where the thread goes wrong?

@rhc54
Copy link
Contributor

rhc54 commented Dec 12, 2023

I think I may have tracked this down to a "free" that wasn't followed by setting the free'd field to NULL, thus leaving a garbage dangling address.

@lastephey If I give you a diff for OMPI v5.0.0, would you folks be able to apply it and recompile so you can test it?

@rgayatri23
Copy link

@rhc54 , Yes we can compile and test it if you can give us a patch.

@wenduwan
Copy link
Contributor

wenduwan commented Dec 12, 2023

Hmmm I was NOT able to reproduce the issue on AWS but I can see it is a latent bug.

IIRC Ralph pushed a fix in https://github.com/openpmix/openpmix/commits/master @rgayatri23 could you give it a try?

@rhc54
Copy link
Contributor

rhc54 commented Dec 12, 2023

The following patch:

diff --git a/src/hwloc/pmix_hwloc.c b/src/hwloc/pmix_hwloc.c
index b485036d..40d5b40e 100644
--- a/src/hwloc/pmix_hwloc.c
+++ b/src/hwloc/pmix_hwloc.c
@@ -1016,6 +1016,7 @@ pmix_status_t pmix_hwloc_get_cpuset(pmix_cpuset_t *cpuset, pmix_bind_envelope_t
     }
     if (0 != rc) {
         hwloc_bitmap_free(cpuset->bitmap);
+        cpuset->bitmap = NULL;
         return PMIX_ERR_NOT_FOUND;
     }
     if (NULL == cpuset->source) {

needs to be applied to the 3rd-party/openpmix directory.

@rgayatri23
Copy link

Thanks @rhc54 .
Can you confirm whether we should apply this patch to 5.0.0 or the upstream main branch ?

@rhc54
Copy link
Contributor

rhc54 commented Dec 12, 2023

Either one should be fine - that diff was from 5.0.0

@lastephey
Copy link
Author

Thanks very much @rhc54 for trying to reproduce and giving us a possible patch!

@rgayatri23
Copy link

Ok sorry I was confused. Did not read the entire message. The patch is for openpmix.
So you want me to build pmix separately and point Open-MPI build to it?

@rgayatri23
Copy link

Ignore my previous message. Just realized what you meant.

@rgayatri23
Copy link

It looks like the issue persists even with the patch

rgayatri@nid200472:/pscratch/sd/r/rgayatri/HelloWorld> srun -n4 --mpi=pmix ./mpihello.ex
[nid200472:383798] shmem: mmap: an error occurred while determining whether or not /tmp/spmix_appdir_73349_19263286.1/shared_mem_cuda_pool.nid200472 could be created.
[nid200472:383798] create_and_attach: unable to create shared memory BTL coordinating structure :: size 134217728
[nid200473:373446] shmem: mmap: an error occurred while determining whether or not /tmp/spmix_appdir_73349_19263286.1/shared_mem_cuda_pool.nid200473 could be created.
[nid200473:373446] create_and_attach: unable to create shared memory BTL coordinating structure :: size 134217728
[nid200472:383789] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file base/gds_base_fns.c at line 268
[nid200472:383789] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file dstore_base.c at line 2624
[nid200472:383789] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file server/pmix_server.c at line 3417
[nid200473:373437] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file base/gds_base_fns.c at line 268
[nid200473:373437] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file dstore_base.c at line 2624
[nid200473:373437] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file server/pmix_server.c at line 3417
Lrank from MPI = 0Hello from processor nid200472, rank = 0 out of 4 processors

************************************************************************************
Lrank from MPI = 1Hello from processor nid200472, rank = 1 out of 4 processors

************************************************************************************
Lrank from MPI = 2Hello from processor nid200473, rank = 2 out of 4 processors

************************************************************************************
Lrank from MPI = 3Hello from processor nid200473, rank = 3 out of 4 processors

************************************************************************************
rgayatri@nid200472:/pscratch/sd/r/rgayatri/HelloWorld>

Works fine with mpirun

rgayatri@nid200472:/pscratch/sd/r/rgayatri/HelloWorld> mpirun -np 4 ./mpihello.ex
Lrank from MPI = 0Hello from processor nid200472, rank = 0 out of 4 processors
Lrank from MPI = 0Hello from processor nid200472, rank = 1 out of 4 processors
Lrank from MPI = 0Hello from processor nid200472, rank = 2 out of 4 processors
Lrank from MPI = 0Hello from processor nid200472, rank = 3 out of 4 processors

@rhc54
Copy link
Contributor

rhc54 commented Dec 12, 2023

No, that's a different error output from elsewhere in the code. Looks to me like you hit an error trying to create a shared memory backing file, which then falls into a bunch of other problems.

Afraid I don't know anything about the CUDA support to know where shared_mem_cuda_pool is trying to be created. My best guess is that mpirun sets us up with a TMPDIR that we can use, while srun is pointing us at a privileged directory that we cannot access. 🤷‍♂️

@rgayatri23
Copy link

Thats the first part about CUDA error (which I think I know how to resolve). But the 2nd part still shows the following error
[nid200472:383789] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file base/gds_base_fns.c at line 268

Or was the patch to resolve the following error
[nid200036:308354] UNPACK-PMIX-VALUE: UNSUPPORTED TYPE 105

There are multiple errors so its a bit confusing. Sorry about that.

@hppritcha
Copy link
Member

may want to try running with

export PMIX_MCA_gds=hash

and see if the warning messages change.

@rgayatri23
Copy link

may want to try running with

export PMIX_MCA_gds=hash

and see if the warning messages change.

Thanks @hppritcha , this solved one of the issues. (UNPACK-PMIX-VALUE: UNSUPPORTED TYPE 105)
For the other PMIX error, there is an internal issue that we are tracking which pointed me in the direction that if I set PMIX_DEBUG the issue seems to go away. Value of PMIX_DEBUG does not seem to matter in this case.

rgayatri@nid200484:/pscratch/sd/r/rgayatri/HelloWorld> PMIX_DEBUG=hello srun -N1 --ntasks-per-node=2 --mpi=pmix --gpus-per-task=1 --gpu-bind=none ./mpihello.ex
[nid200484:1228192] shmem: mmap: an error occurred while determining whether or not /tmp/spmix_appdir_73349_19266816.3/shared_mem_cuda_pool.nid200484 could be created.
[nid200484:1228192] create_and_attach: unable to create shared memory BTL coordinating structure :: size 134217728
Lrank from MPI = 0Hello from processor nid200484, rank = 0 out of 2 processors

************************************************************************************
Lrank from MPI = 1Hello from processor nid200484, rank = 1 out of 2 processors

************************************************************************************

So now we only need to understand the cuda issue. My tricks did not work on resolving it.

@rgayatri23
Copy link

I think the following issue #11831 is trying to address a similar situation that I am observing with the cuda issue.

@hppritcha
Copy link
Member

How much space is in /tmp ?

If you could rebuild Open MPI with --enable-debug set and run with

export OMPI_MCA_shmem_base_verbose=100

we might get more info about why the creation of the file for shared memory is not succedding?

@rgayatri23
Copy link

Here is the relevant information that I saw with debug enabled:

[nid200305:346908] shmem: mmap: shmem_ds_resetting
[nid200305:346908] shmem: mmap: backing store base directory: /tmp/spmix_appdir_73349_19269759.0/shared_mem_cuda_pool.nid200305
[nid200305:346908] WARNING: opal_path_df failure!
[nid200305:346908] shmem: mmap: an error occurred while determining whether or not /tmp/spmix_appdir_73349_19269759.0/shared_mem_cuda_pool.nid200305 could be created.
[nid200305:346908] shmem: mmap: shmem_ds_resetting
[nid200305:346908] create_and_attach: unable to create shared memory BTL coordinating structure :: size 134217728
[nid200305:346911] shmem: mmap: shmem_ds_resetting
[nid200305:346911] shmem: mmap: backing store base directory: /dev/shm/sm_segment.nid200305.73349.4fb90000.3
[nid200305:346911] shmem: mmap: create successful (id: 59, size: 16777216, name: /dev/shm/sm_segment.nid200305.73349.4fb90000.3)
[nid200305:346911] shmem: mmap: attach successful (id: 59, size: 16777216, name: /dev/shm/sm_segment.nid200305.73349.4fb90000.3)
[nid200305:346909] shmem: mmap: shmem_ds_resetting

/tmp is completely empty so I am not sure what the issue is.

@hppritcha
Copy link
Member

opal_path_df is trying to stat /tmp/spmix_appdir_73349_19269759.0/ and getting an error.

@rhc54
Copy link
Contributor

rhc54 commented Dec 12, 2023

if I set PMIX_DEBUG the issue seems to go away

That makes no sense - that envar just controls debugging output. It has no influence over the pack/unpack system. I suspect all it did was tell the code "don't tell me about errors". The value of the envar is used to set the verbosity level - passing nothing but a string (e.g., "hello") just means that strtoul returns 0, which turns off the output.

What the error message is saying is that we were unable to complete the allgather of connection information across the procs. So your simple "hello" might work, but a real application will almost certainly fail. I suspect it has something to do with the problems in setting up the backing store as that directory/file name is one of the things we pass.

@hppritcha
Copy link
Member

I'm getting a little lost here and want to see if i can reproduce on perlmutter.
Which version of Open MPI are you now trying to use and which version of PMIx?
Also are you using the same config options for openmpi and pmix given in the inital comment? If not, what are they now?

@rgayatri23
Copy link

rgayatri23 commented Dec 12, 2023

This is my configure line and using --with-pmix=internal

 ./configure CC=cc FC=ftn CXX=CC CFLAGS=--cray-bypass-pkgconfig CXXFLAGS=--cray-bypass-pkgconfig FCFLAGS=--cray-bypass-pkgconfig LDFLAGS=--cray-bypass-pkgconfig --enable-orterun-prefix-by-default  --prefix=/pscratch/sd/r/rgayatri/software/gnu/openmpi/5.0.0 --with-cuda=/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/cuda/11.7 --with-cuda-libdir=/usr/lib64 --with-ucx=no --with-verbs=no --enable-mpi-java --with-ofi --enable-mpi1-compatibility --with-pmix=internal --disable-sphinx --enable-debug

@ggouaillardet
Copy link
Contributor

ggouaillardet commented Dec 13, 2023

My best bet would be to manually instrument opal_path_df() in opal/util/path.c to figure out what is going wrong here:

  • does the /tmp/spmix_appdir_73349_19269759.0 directory exist?
  • does the /tmp/spmix_appdir_73349_19269759.0/shared_mem_cuda_pool.nid200305 exist?
  • does statfs() or statvfs() fail and why?

@hppritcha
Copy link
Member

are these runs being done on a GPU partition or CPU?

I built ompi 4.1.6 without --enable-debug and with PMIx internal and when running on the CPU partition see this:

pp@nid004909:~/ompi/examples> ((v4.1.6))srun --mpi=pmix -n 16 -N 2 ./connectivity_c
[nid006594:1638790] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750
[nid004909:1830494] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750
Connectivity test on 16 processes PASSED.

@hppritcha
Copy link
Member

On the GPU side I see this:

hpp@nid001944:~/ompi/examples> ((v4.1.6))srun --mpi=pmix -n 16 -N 2 ./connectivity_c
[nid001944:2157258] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file base/gds_base_fns.c at line 268
[nid001944:2157258] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file dstore_base.c at line 2624
[nid001944:2157258] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file server/pmix_server.c at line 3417
[nid001945:1798015] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750
Connectivity test on 16 processes PASSED.

I recall we poked around with this previously and determined that the pid's here were coming from the transient slurmstepd that subsequently exec'd the reall application.

@rhc54
Copy link
Contributor

rhc54 commented Dec 14, 2023

You might check to ensure that you have the same Slurm running on all the nodes, and that each slurmd is in fact getting the same PMIx lib. If that error is coming from the slurmd, then I'll bet you that the slurmd on another node is picking up a debug PMIx lib, while this slurmd is using a non-debug one.

We handle such cross-over between slurmd and application proc - but not between slurmd's. They must be using the same library, including same debug setting.

@hppritcha
Copy link
Member

hmm... i suspect for this NERSC system that the same slurm is indeed running on all the nodes.
I tried some simple PMIx only tests and they don't emit these error messages.

I also noticed NERSC has several SLURM_PMIX envariables set by default. Any idea why these are set? (question for NERSC SLURM specialist).

I did some more careful testing of 4.1.6 and making absolutely sure I was using the PMIx that NERSC has used for building the PMIx plugin. I stopped using cuda and used the --disable-mca-dso config option to pull in the libpmix.so into the executable (the two don't mix). See below:

hpp@nid001497:~/ompi/examples> ((v4.1.6))ldd hello_c
	linux-vdso.so.1 (0x00007ffecbd25000)
	libxpmem.so.0 => /opt/cray/xpmem/2.6.2-2.5_2.33__gd067c3f.shasta/lib64/libxpmem.so.0 (0x00007fdbca6f5000)
	libmpi.so.40 => /global/homes/h/hpp/ompi/install_v416_epmix/lib/libmpi.so.40 (0x00007fdbca3ed000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fdbca3a6000)

...
	libfabric.so.1 => /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1 (0x00007f2c87200000)
	libpmix.so.2 => /usr/lib64/libpmix.so.2 (0x00007f2c86c00000)
	libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x00007f2c86bf5000)
...
	libc.so.6 => /lib64/libc.so.6 (0x00007fdbca1ad000)

To be more specific, I was using the pmix stuff from the NERSC PMIX RPM - pmix-4.2.3-2_nersc.x86_64
And here's the output from the pmix_info from that RPM:

hpp@login13:~/ompi/examples> ((v4.1.6))which pmix_info
/usr/bin/pmix_info
hpp@login13:~/ompi/examples> ((v4.1.6))rpm -qf /usr/bin/pmix_info
pmix-4.2.3-2_nersc.x86_64
hpp@login13:~/ompi/examples> ((v4.1.6))pmix_info
                 Package: PMIx root@runner-vzcz-kjx-project-87-concurrent-0
                          Distribution
                    PMIX: 4.2.3
      PMIX repo revision: gitc5661387
       PMIX release date: Feb 07, 2023
           PMIX Standard: 4.2
       PMIX Standard ABI: Stable (0.0), Provisional (0.0)
                  Prefix: /usr
 Configured architecture: pmix.arch
          Configure host: runner-vzcz-kjx-project-87-concurrent-0
           Configured by: root
           Configured on: Sun Sep 10 00:33:52 UTC 2023
          Configure host: runner-vzcz-kjx-project-87-concurrent-0
  Configure command line: '--host=x86_64-suse-linux-gnu'
                          '--build=x86_64-suse-linux-gnu' '--program-prefix='
                          '--disable-dependency-tracking' '--prefix=/usr'
                          '--exec-prefix=/usr' '--bindir=/usr/bin'
                          '--sbindir=/usr/sbin' '--sysconfdir=/etc'
                          '--datadir=/usr/share' '--includedir=/usr/include'
                          '--libdir=/usr/lib64' '--libexecdir=/usr/lib'
                          '--localstatedir=/var' '--sharedstatedir=/var/lib'
                          '--mandir=/usr/share/man'
                          '--infodir=/usr/share/info'
                          '--disable-dependency-tracking'
                Built by: 
                Built on: Sun Sep 10 00:36:09 UTC 2023
              Built host: runner-vzcz-kjx-project-87-concurrent-0
              C compiler: gcc
     C compiler absolute: /usr/bin/gcc
  C compiler family name: GNU
      C compiler version: "7" "." "5" "." "0"
  Internal debug support: no
              dl support: yes
     Symbol vis. support: yes
          Manpages built: yes
              MCA bfrops: v12 (MCA v2.1.0, API v1.0.0, Component v4.2.3)
              MCA bfrops: v20 (MCA v2.1.0, API v1.0.0, Component v4.2.3)
              MCA bfrops: v21 (MCA v2.1.0, API v1.0.0, Component v4.2.3)
              MCA bfrops: v3 (MCA v2.1.0, API v1.0.0, Component v4.2.3)
              MCA bfrops: v4 (MCA v2.1.0, API v1.0.0, Component v4.2.3)
              MCA bfrops: v41 (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                 MCA gds: hash (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                 MCA gds: ds12 (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                 MCA gds: ds21 (MCA v2.1.0, API v1.0.0, Component v4.2.3)
           MCA pcompress: zlib (MCA v2.1.0, API v2.0.0, Component v4.2.3)
                 MCA pdl: pdlopen (MCA v2.1.0, API v1.0.0, Component v4.2.3)
              MCA pfexec: linux (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                 MCA pif: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
                          v4.2.3)
                 MCA pif: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
                          v4.2.3)
        MCA pinstalldirs: env (MCA v2.1.0, API v1.0.0, Component v4.2.3)
        MCA pinstalldirs: config (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                MCA plog: default (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                MCA plog: stdfd (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                MCA plog: syslog (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                MCA pmdl: ompi (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                MCA pmdl: oshmem (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                MCA pnet: opa (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                MCA preg: compress (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                MCA preg: native (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                MCA preg: raw (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                 MCA prm: default (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                 MCA prm: slurm (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                MCA psec: native (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                MCA psec: none (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                MCA psec: munge (MCA v2.1.0, API v1.0.0, Component v4.2.3)
             MCA psensor: file (MCA v2.1.0, API v1.0.0, Component v4.2.3)
             MCA psensor: heartbeat (MCA v2.1.0, API v1.0.0, Component
                          v4.2.3)
              MCA pshmem: mmap (MCA v2.1.0, API v1.0.0, Component v4.2.3)
             MCA psquash: flex128 (MCA v2.1.0, API v1.0.0, Component v4.2.3)
             MCA psquash: native (MCA v2.1.0, API v1.0.0, Component v4.2.3)
               MCA pstat: linux (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                 MCA ptl: client (MCA v2.1.0, API v2.0.0, Component v4.2.3)
                 MCA ptl: server (MCA v2.1.0, API v2.0.0, Component v4.2.3)
                 MCA ptl: tool (MCA v2.1.0, API v2.0.0, Component v4.2.3)

I also noticed with my hello world program that the PMIX ERROR messages being emitted by what seems to be the slurmstepd daemons is not determinstic:

hpp@nid001497:~/ompi/examples> ((v4.1.6))srun --mpi=pmix -n 4 -N 2 ./hello_c
Hello, world, I am 1 of 4, (Open MPI v4.1.6rc4, package: Open MPI hpp@login13 Distribution, ident: 4.1.6rc4, repo rev: v4.1.6, Unreleased developer copy, 125)
Hello, world, I am 0 of 4, (Open MPI v4.1.6rc4, package: Open MPI hpp@login13 Distribution, ident: 4.1.6rc4, repo rev: v4.1.6, Unreleased developer copy, 125)
[nid002424:601313] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750
Hello, world, I am 3 of 4, (Open MPI v4.1.6rc4, package: Open MPI hpp@login13 Distribution, ident: 4.1.6rc4, repo rev: v4.1.6, Unreleased developer copy, 125)
Hello, world, I am 2 of 4, (Open MPI v4.1.6rc4, package: Open MPI hpp@login13 Distribution, ident: 4.1.6rc4, repo rev: v4.1.6, Unreleased developer copy, 125)
hpp@nid001497:~/ompi/examples> ((v4.1.6))env | grep PMIX
PMIXHOME=/
SLURM_PMIX_DIRECT_CONN_UCX=false
SLURM_PMIX_DIRECT_CONN=false
hpp@nid001497:~/ompi/examples> ((v4.1.6))unset SLURM_PMIX_DIRECT_CONN
hpp@nid001497:~/ompi/examples> ((v4.1.6))unset SLURM_PMIX_DIRECT_CONN_UCX
hpp@nid001497:~/ompi/examples> ((v4.1.6))srun --mpi=pmix -n 4 -N 2 ./hello_c
Hello, world, I am 1 of 4, (Open MPI v4.1.6rc4, package: Open MPI hpp@login13 Distribution, ident: 4.1.6rc4, repo rev: v4.1.6, Unreleased developer copy, 125)
Hello, world, I am 0 of 4, (Open MPI v4.1.6rc4, package: Open MPI hpp@login13 Distribution, ident: 4.1.6rc4, repo rev: v4.1.6, Unreleased developer copy, 125)
Hello, world, I am 2 of 4, (Open MPI v4.1.6rc4, package: Open MPI hpp@login13 Distribution, ident: 4.1.6rc4, repo rev: v4.1.6, Unreleased developer copy, 125)
Hello, world, I am 3 of 4, (Open MPI v4.1.6rc4, package: Open MPI hpp@login13 Distribution, ident: 4.1.6rc4, repo rev: v4.1.6, Unreleased developer copy, 125)
hpp@nid001497:~/ompi/examples> ((v4.1.6))srun --mpi=pmix -n 4 -N 2 ./hello_c
Hello, world, I am 2 of 4, (Open MPI v4.1.6rc4, package: Open MPI hpp@login13 Distribution, ident: 4.1.6rc4, repo rev: v4.1.6, Unreleased developer copy, 125)
Hello, world, I am 3 of 4, (Open MPI v4.1.6rc4, package: Open MPI hpp@login13 Distribution, ident: 4.1.6rc4, repo rev: v4.1.6, Unreleased developer copy, 125)
Hello, world, I am 0 of 4, (Open MPI v4.1.6rc4, package: Open MPI hpp@login13 Distribution, ident: 4.1.6rc4, repo rev: v4.1.6, Unreleased developer copy, 125)
Hello, world, I am 1 of 4, (Open MPI v4.1.6rc4, package: Open MPI hpp@login13 Distribution, ident: 4.1.6rc4, repo rev: v4.1.6, Unreleased developer copy, 125)
hpp@nid001497:~/ompi/examples> ((v4.1.6))srun --mpi=pmix -n 4 -N 2 ./hello_c
[nid001497:46252] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750
[nid002424:602238] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750
Hello, world, I am 0 of 4, (Open MPI v4.1.6rc4, package: Open MPI hpp@login13 Distribution, ident: 4.1.6rc4, repo rev: v4.1.6, Unreleased developer copy, 125)
Hello, world, I am 1 of 4, (Open MPI v4.1.6rc4, package: Open MPI hpp@login13 Distribution, ident: 4.1.6rc4, repo rev: v4.1.6, Unreleased developer copy, 125)
Hello, world, I am 2 of 4, (Open MPI v4.1.6rc4, package: Open MPI hpp@login13 Distribution, ident: 4.1.6rc4, repo rev: v4.1.6, Unreleased developer copy, 125)
Hello, world, I am 3 of 4, (Open MPI v4.1.6rc4, package: Open MPI hpp@login13 Distribution, ident: 4.1.6rc4, repo rev: v4.1.6, Unreleased developer copy, 125)

What I think is we have at least two problems here, possibly three.

  • First is the original problem and the reason for opening this issue.
  • The PMIX error messages are a different problem and one that I really think needs to be reported to SchedMD to see if they can fix this in their pmix plugin.
  • Then there's the TMPDIR problem with the btl smcuda. I'm not observing that with my builds with cuda enabled so can only suggest you do what @ggouaillardet is suggesting to add debug statements into Open MPI.

@rgayatri23
Copy link

So NERSC updated PMIx to 4.2.7 in the recent maintenance and I built OpenMPI/5.0.0 using the newer PMIx.
This resolved the PMIX error we were observing from before, speficically these ones:
[nid001944:2157258] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file base/gds_base_fns.c at line 268

@jsquyres
Copy link
Member

jsquyres commented Jan 8, 2024

I'm just catching up on this -- has the discussion moved from v4.1.x to v5.0.x? I'm guessing that if there are fixes that are needed on the run-time side of things, it will be significantly easier to get them in a v5.0.x-related release (e.g., for PMIx and/or PRRTE).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants