Skip to content

Multiple active srun invocations fail in an allocation on PSM system #5008

@lee218llnl

Description

@lee218llnl

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

v2.1.3 and v 3.0.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

github releases
./configure --with-slurm --with-psm --prefix=$PWD/install --with-slurm --with-pmi

Please describe the system on which you are running

  • Operating system/version: RHEL7
  • Computer hardware: x86_64
  • Network type: PSM

Details of the problem

When running multiple srun invocations within an allocation on a PSM system, I noticed that the first invocation would run my application OK, but trying to simultaneously run another job would fail with these errors:

catalyst285: Unable to allocate shared memory for intra-node messaging.
catalyst285: Delete stale shared memory files in /dev/shm

After doing some debugging, I noticed that both srun invocations were trying to create a file with the same name in /dev/shm, i.e.:

bash-4.2$ ls -l /dev/shm/psm*
-rwx------ 1 lee218 lee218 6352896 Apr 4 08:12 /dev/shm/psm_shm.1a000000-ed6e-0000-1a00-00001a000000

On PSM 2 systems, the same OpenMPI installation generates unique hashes:

bash-4.2$ ls -l /dev/shm/psm2_shm.5264903ffffff00bd0*
-rw------- 1 lee218 lee218 4337664 Apr 4 08:12 /dev/shm/psm2_shm.5264903ffffff00bd03010
-rw------- 1 lee218 lee218 4337664 Apr 4 08:12 /dev/shm/psm2_shm.5264903ffffff00bd04010

I was able to track this issue down to https://github.com/open-mpi/ompi/blob/master/opal/mca/pmix/s2/pmix_s2.c#L233. The jobid and stepid are used on PSM systems to generate the hash (I believe some other mechanism is used on PSM2 systems). The problem with this attempt to add the stepid to the jobid is that the stepid will always return 0. When run under srun, the pmix_kvs_name is of the form "jobid.jobstep". The first strtoul will set str to point to the "." in that string and thus calling strtoul(str,...) will always return 0 since the "." character is not a digit. I believe that the correct thing to do would be to look at the following character (i.e., call strtoul(str+1,...)). This is probably a problem in https://github.com/open-mpi/ompi/blob/master/opal/mca/pmix/s1/pmix_s1.c#L226 too.

I will also note that when running on the PSM system using orterun instead of srun, a unique hash is generated for multiple orterun invocations:

bash-4.2$ ls -l /dev/shm/psm*
-rwx------ 1 lee218 lee218 12566528 Apr 4 08:28 /dev/shm/psm_shm.452df885-6aec-13fc-52f9-2dc5ff846503
-rwx------ 1 lee218 lee218 12566528 Apr 4 08:28 /dev/shm/psm_shm.ba118e7f-7761-16d9-8a95-6f1e3de17946

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions