Unable to allocate shared memory for intra-node messaging. Delete stale shared memory files in /dev/shm. #6727

connorourke · 2019-05-31T12:32:41Z

Background information

Runnig a simple python program that spawns a fortran executable fails

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

(Open MPI) 4.0.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Installed from source tarball:

  Configure command line: 'CC=gcc' 'CXX=g++' 'FC=gfortran'
                          '--prefix=/home/c/cor22/scratch/PPFIT_DEVEL/HSE_2.0_DEVEL/PPFIT_ENV/openmpi.4.0.1'
                          '--with-wrapper-ldflags=-m64'
                          '--with-wrapper-fcflags=-m64'
                          '--with-wrapper-cxxflags=-m64'
                          '--with-wrapper-cflags=-m64' '--with-psm=/usr'
                          '--with-psm-libdir=/usr/lib64'
                          '--enable-contrib-no-build=vt' '--with-pic'
                          '--enable-shared'
                          '--with-cuda=/apps/nvidia/toolkit/current'
                          '--with-slurm=/cm/shared/apps/slurm/17.11.7'

Please describe the system on which you are running

Operating system/version:
Scientific Linux release 6.9 (Carbon)
Computer hardware:
Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
Network type:
Intel true scale infiniband

Details of the problem

Running the following:

test.py:

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = MPI.COMM_WORLD.Get_rank()
mpi_info = MPI.Info.Create()
mpi_info.Set("host",  MPI.Get_processor_name())
executable = "./hello"
commspawn = MPI.COMM_SELF.Spawn(executable, args="--oversubscribe", maxprocs=2, info=mpi_info)
commspawn.Barrier()
commspawn.Disconnect()

print("rank",rank)

hello.f90:

   program hello
   include 'mpif.h'
   integer::rank, size, ierror, tag, status(MPI_STATUS_SIZE)
   integer::mpi_comm_parent,length,dum_err
    CHARACTER(MPI_MAX_ERROR_STRING):: message

       call MPI_INIT(ierror)
       call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
       call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
       print*, 'node', rank, ': Hello world'


        call mpi_COMM_get_parent(mpi_comm_parent,ierr)

        IF (mpi_comm_parent .ne. MPI_COMM_NULL) THEN
            CALL MPI_BARRIER(mpi_comm_parent,ierr)
            CALL MPI_COMM_DISCONNECT(mpi_comm_parent,ierr)
        end if

        if (ierr .ne. MPI_success) THEN
            write(6,*) 'Error detected in disconnect_from_parent(). Exiting'
            call mpi_error_string(ierr, message, length, dum_err)
            write(6,*) message
            call mpi_abort(MPI_COMM_WORLD, 1, ierr)
        END IF

   call MPI_FINALIZE(ierror)
   end

as

mpirun --mca btl_openib_allow_ib 1 --oversubscribe --hostfile slurm.hosts -np 4  python3 ./test.py

Fails with the following error:

itd-ngpu-02: Unable to allocate shared memory for intra-node messaging.
itd-ngpu-02: Delete stale shared memory files in /dev/shm.
--------------------------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

After execution the following files appear in /dev/shm:

/dev/shm/psm_shm.96c0d5f5-ae86-9515-0beb-0fe6ad698210

But there is certainly space:

df /dev/shm
Filesystem     1K-blocks  Used Available Use% Mounted on
none            66124848  1588  66123260   1% /dev/shm

The text was updated successfully, but these errors were encountered:

connorourke · 2019-05-31T14:54:09Z

Same thing with 3fd5c84 checked out from the repo.

rhc54 · 2019-05-31T17:43:46Z

Your listing shows a psm shmem file in /dev/shm - I suspect it may be the cause of the conflict. Try removing it.

connorourke · 2019-05-31T17:50:24Z

I have tried removing these files - makes no difference. They are produced during the run and don't get cleaned up when the run fails. The df /dev/shm was issued after the run, but the `/dev/shm/' folder was empty before it started.

connorourke · 2019-06-03T12:59:57Z

Same thing happens when running the spawn tests in the mpi4py test suite.

hppritcha · 2019-06-19T17:42:53Z

Could you check which version of PSM2 is installed on your system? Run the following command on your system and post the output to this issue:

rpm -qa | grep -i psm2

heasterday · 2019-07-02T15:27:42Z

I'm not able to reproduce this issue currently. @connorourke if you can provide your psm2 version that would allow me to better replicate your setup and perhaps reproduce the issue.

connorourke · 2019-07-02T20:15:30Z

Hi @heasterday, sorry for the delay. I'm on holiday, back in a couple of weeks - I'll get on it then. 👍

heasterday · 2019-08-27T19:57:10Z

@connorourke By any chance have you had the time to look at this?

connorourke · 2019-08-30T14:09:56Z

@heasterday - apologies for the delay.

rpm -qa | grep -i psm2 produces no output.

I built psm myself, and its ver 3.3.

heasterday · 2019-09-04T17:06:44Z

@connorourke No worries. Good catch, psm is what we were interested in, not psm2.

I got access to a system with psm compatible cards and built up a similar stack to what you are using of psm 3.3, ompi 4.0.1, and python 3. So far I still haven't been able to reproduce what you are seeing although the usage of --oversubscribe and a host file complicated things. I see you build with slurm support, are you running under an allocation? If so you shouldn't need the hostfile. In the interest of simplifying the reproducer could you run without the --oversubscribe and --hostfile options? Given your example you would need to request an allocation of at least 8 processes. I'm curious if this changes your outcome.

connorourke · 2019-09-05T09:29:18Z

@heasterday the reason I had the --oversubscribe in there was that I was running these tests on an interactive node, but the same thing happens when it's removed (as well as the hostfile) and submitted to the queue with slurm.

I can get this test code to run fine using v3.1.4 in a conda env, but was running into this problem: #6710. I was trying to find a setup that didnt fail like #6710 when I came across the current issue you are looking at.

I have since refactored the actual code these pieces of test code were written for so that it spawns a single instance of the executable and tears down from within, rather than spawning multiple instances and closing down each time. This works fine, so I am happy if no-one else gets this error and you want to close this issue.

heasterday · 2019-09-17T16:09:58Z

Happy to hear you were able to find a workaround. @hppritcha sounds like this can be closed

github-actions · 2024-02-16T17:03:55Z

It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.

connorourke mentioned this issue Jun 6, 2019

PMIx runtime error. #6742

Closed

hppritcha self-assigned this Jun 17, 2019

hppritcha assigned heasterday and unassigned hppritcha Jun 19, 2019

hppritcha added the State-Awaiting user information label Jul 1, 2019

github-actions bot added the Stale label Feb 16, 2024

jsquyres removed the State-Awaiting user information label Feb 19, 2024

jsquyres closed this as completed Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to allocate shared memory for intra-node messaging. Delete stale shared memory files in /dev/shm. #6727

Unable to allocate shared memory for intra-node messaging. Delete stale shared memory files in /dev/shm. #6727

connorourke commented May 31, 2019 •

edited

connorourke commented May 31, 2019

rhc54 commented May 31, 2019

connorourke commented May 31, 2019

connorourke commented Jun 3, 2019

hppritcha commented Jun 19, 2019

heasterday commented Jul 2, 2019 •

edited

connorourke commented Jul 2, 2019

heasterday commented Aug 27, 2019

connorourke commented Aug 30, 2019 •

edited

heasterday commented Sep 4, 2019

connorourke commented Sep 5, 2019

heasterday commented Sep 17, 2019 •

edited

github-actions bot commented Feb 16, 2024

Navigation Menu

Unable to allocate shared memory for intra-node messaging. Delete stale shared memory files in /dev/shm. #6727

Unable to allocate shared memory for intra-node messaging. Delete stale shared memory files in /dev/shm. #6727

Comments

connorourke commented May 31, 2019 • edited

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

connorourke commented May 31, 2019

rhc54 commented May 31, 2019

connorourke commented May 31, 2019

connorourke commented Jun 3, 2019

hppritcha commented Jun 19, 2019

heasterday commented Jul 2, 2019 • edited

connorourke commented Jul 2, 2019

heasterday commented Aug 27, 2019

connorourke commented Aug 30, 2019 • edited

heasterday commented Sep 4, 2019

connorourke commented Sep 5, 2019

heasterday commented Sep 17, 2019 • edited

github-actions bot commented Feb 16, 2024

connorourke commented May 31, 2019 •

edited

heasterday commented Jul 2, 2019 •

edited

connorourke commented Aug 30, 2019 •

edited

heasterday commented Sep 17, 2019 •

edited