Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to allocate shared memory for intra-node messaging. Delete stale shared memory files in /dev/shm. #6727

Closed
connorourke opened this issue May 31, 2019 · 13 comments
Assignees
Labels

Comments

@connorourke
Copy link

connorourke commented May 31, 2019

Background information

Runnig a simple python program that spawns a fortran executable fails

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

(Open MPI) 4.0.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Installed from source tarball:

  Configure command line: 'CC=gcc' 'CXX=g++' 'FC=gfortran'
                          '--prefix=/home/c/cor22/scratch/PPFIT_DEVEL/HSE_2.0_DEVEL/PPFIT_ENV/openmpi.4.0.1'
                          '--with-wrapper-ldflags=-m64'
                          '--with-wrapper-fcflags=-m64'
                          '--with-wrapper-cxxflags=-m64'
                          '--with-wrapper-cflags=-m64' '--with-psm=/usr'
                          '--with-psm-libdir=/usr/lib64'
                          '--enable-contrib-no-build=vt' '--with-pic'
                          '--enable-shared'
                          '--with-cuda=/apps/nvidia/toolkit/current'
                          '--with-slurm=/cm/shared/apps/slurm/17.11.7'

Please describe the system on which you are running

  • Operating system/version:
    Scientific Linux release 6.9 (Carbon)
  • Computer hardware:
    Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
  • Network type:
    Intel true scale infiniband

Details of the problem

Running the following:

test.py:

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = MPI.COMM_WORLD.Get_rank()
mpi_info = MPI.Info.Create()
mpi_info.Set("host",  MPI.Get_processor_name())
executable = "./hello"
commspawn = MPI.COMM_SELF.Spawn(executable, args="--oversubscribe", maxprocs=2, info=mpi_info)
commspawn.Barrier()
commspawn.Disconnect()

print("rank",rank)

hello.f90:

   program hello
   include 'mpif.h'
   integer::rank, size, ierror, tag, status(MPI_STATUS_SIZE)
   integer::mpi_comm_parent,length,dum_err
    CHARACTER(MPI_MAX_ERROR_STRING):: message

       call MPI_INIT(ierror)
       call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
       call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
       print*, 'node', rank, ': Hello world'


        call mpi_COMM_get_parent(mpi_comm_parent,ierr)

        IF (mpi_comm_parent .ne. MPI_COMM_NULL) THEN
            CALL MPI_BARRIER(mpi_comm_parent,ierr)
            CALL MPI_COMM_DISCONNECT(mpi_comm_parent,ierr)
        end if

        if (ierr .ne. MPI_success) THEN
            write(6,*) 'Error detected in disconnect_from_parent(). Exiting'
            call mpi_error_string(ierr, message, length, dum_err)
            write(6,*) message
            call mpi_abort(MPI_COMM_WORLD, 1, ierr)
        END IF

   call MPI_FINALIZE(ierror)
   end

as

mpirun --mca btl_openib_allow_ib 1 --oversubscribe --hostfile slurm.hosts -np 4  python3 ./test.py

Fails with the following error:

itd-ngpu-02: Unable to allocate shared memory for intra-node messaging.
itd-ngpu-02: Delete stale shared memory files in /dev/shm.
--------------------------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

After execution the following files appear in /dev/shm:

/dev/shm/psm_shm.96c0d5f5-ae86-9515-0beb-0fe6ad698210

But there is certainly space:

df /dev/shm
Filesystem     1K-blocks  Used Available Use% Mounted on
none            66124848  1588  66123260   1% /dev/shm
@connorourke
Copy link
Author

Same thing with 3fd5c84 checked out from the repo.

@rhc54
Copy link
Contributor

rhc54 commented May 31, 2019

Your listing shows a psm shmem file in /dev/shm - I suspect it may be the cause of the conflict. Try removing it.

@connorourke
Copy link
Author

I have tried removing these files - makes no difference. They are produced during the run and don't get cleaned up when the run fails. The df /dev/shm was issued after the run, but the `/dev/shm/' folder was empty before it started.

@connorourke
Copy link
Author

Same thing happens when running the spawn tests in the mpi4py test suite.

@hppritcha hppritcha self-assigned this Jun 17, 2019
@hppritcha
Copy link
Member

Could you check which version of PSM2 is installed on your system? Run the following command on your system and post the output to this issue:

rpm -qa | grep -i psm2

@heasterday
Copy link
Contributor

heasterday commented Jul 2, 2019

I'm not able to reproduce this issue currently. @connorourke if you can provide your psm2 version that would allow me to better replicate your setup and perhaps reproduce the issue.

@connorourke
Copy link
Author

Hi @heasterday, sorry for the delay. I'm on holiday, back in a couple of weeks - I'll get on it then. 👍

@heasterday
Copy link
Contributor

@connorourke By any chance have you had the time to look at this?

@connorourke
Copy link
Author

connorourke commented Aug 30, 2019

@heasterday - apologies for the delay.

rpm -qa | grep -i psm2 produces no output.

I built psm myself, and its ver 3.3.

@heasterday
Copy link
Contributor

@connorourke No worries. Good catch, psm is what we were interested in, not psm2.

I got access to a system with psm compatible cards and built up a similar stack to what you are using of psm 3.3, ompi 4.0.1, and python 3. So far I still haven't been able to reproduce what you are seeing although the usage of --oversubscribe and a host file complicated things. I see you build with slurm support, are you running under an allocation? If so you shouldn't need the hostfile. In the interest of simplifying the reproducer could you run without the --oversubscribe and --hostfile options? Given your example you would need to request an allocation of at least 8 processes. I'm curious if this changes your outcome.

@connorourke
Copy link
Author

@heasterday the reason I had the --oversubscribe in there was that I was running these tests on an interactive node, but the same thing happens when it's removed (as well as the hostfile) and submitted to the queue with slurm.

I can get this test code to run fine using v3.1.4 in a conda env, but was running into this problem: #6710. I was trying to find a setup that didnt fail like #6710 when I came across the current issue you are looking at.

I have since refactored the actual code these pieces of test code were written for so that it spawns a single instance of the executable and tears down from within, rather than spawning multiple instances and closing down each time. This works fine, so I am happy if no-one else gets this error and you want to close this issue.

@heasterday
Copy link
Contributor

heasterday commented Sep 17, 2019

Happy to hear you were able to find a workaround. @hppritcha sounds like this can be closed

Copy link

It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants