-
Notifications
You must be signed in to change notification settings - Fork 931
Description
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
v2.1.3 and v 3.0.1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
github releases
./configure --with-slurm --with-psm --prefix=$PWD/install --with-slurm --with-pmi
Please describe the system on which you are running
- Operating system/version: RHEL7
- Computer hardware: x86_64
- Network type: PSM
Details of the problem
When running multiple srun invocations within an allocation on a PSM system, I noticed that the first invocation would run my application OK, but trying to simultaneously run another job would fail with these errors:
catalyst285: Unable to allocate shared memory for intra-node messaging.
catalyst285: Delete stale shared memory files in /dev/shm
After doing some debugging, I noticed that both srun invocations were trying to create a file with the same name in /dev/shm, i.e.:
bash-4.2$ ls -l /dev/shm/psm*
-rwx------ 1 lee218 lee218 6352896 Apr 4 08:12 /dev/shm/psm_shm.1a000000-ed6e-0000-1a00-00001a000000
On PSM 2 systems, the same OpenMPI installation generates unique hashes:
bash-4.2$ ls -l /dev/shm/psm2_shm.5264903ffffff00bd0*
-rw------- 1 lee218 lee218 4337664 Apr 4 08:12 /dev/shm/psm2_shm.5264903ffffff00bd03010
-rw------- 1 lee218 lee218 4337664 Apr 4 08:12 /dev/shm/psm2_shm.5264903ffffff00bd04010
I was able to track this issue down to https://github.com/open-mpi/ompi/blob/master/opal/mca/pmix/s2/pmix_s2.c#L233. The jobid and stepid are used on PSM systems to generate the hash (I believe some other mechanism is used on PSM2 systems). The problem with this attempt to add the stepid to the jobid is that the stepid will always return 0. When run under srun, the pmix_kvs_name is of the form "jobid.jobstep". The first strtoul will set str to point to the "." in that string and thus calling strtoul(str,...) will always return 0 since the "." character is not a digit. I believe that the correct thing to do would be to look at the following character (i.e., call strtoul(str+1,...)). This is probably a problem in https://github.com/open-mpi/ompi/blob/master/opal/mca/pmix/s1/pmix_s1.c#L226 too.
I will also note that when running on the PSM system using orterun instead of srun, a unique hash is generated for multiple orterun invocations:
bash-4.2$ ls -l /dev/shm/psm*
-rwx------ 1 lee218 lee218 12566528 Apr 4 08:28 /dev/shm/psm_shm.452df885-6aec-13fc-52f9-2dc5ff846503
-rwx------ 1 lee218 lee218 12566528 Apr 4 08:28 /dev/shm/psm_shm.ba118e7f-7761-16d9-8a95-6f1e3de17946