New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to allocate shared memory for intra-node messaging. Delete stale shared memory files in /dev/shm. #6727
Comments
Same thing with 3fd5c84 checked out from the repo. |
Your listing shows a psm shmem file in /dev/shm - I suspect it may be the cause of the conflict. Try removing it. |
I have tried removing these files - makes no difference. They are produced during the run and don't get cleaned up when the run fails. The |
Same thing happens when running the spawn tests in the mpi4py test suite. |
Could you check which version of PSM2 is installed on your system? Run the following command on your system and post the output to this issue:
|
I'm not able to reproduce this issue currently. @connorourke if you can provide your psm2 version that would allow me to better replicate your setup and perhaps reproduce the issue. |
Hi @heasterday, sorry for the delay. I'm on holiday, back in a couple of weeks - I'll get on it then. 👍 |
@connorourke By any chance have you had the time to look at this? |
@heasterday - apologies for the delay.
I built |
@connorourke No worries. Good catch, psm is what we were interested in, not psm2. I got access to a system with psm compatible cards and built up a similar stack to what you are using of psm 3.3, ompi 4.0.1, and python 3. So far I still haven't been able to reproduce what you are seeing although the usage of --oversubscribe and a host file complicated things. I see you build with slurm support, are you running under an allocation? If so you shouldn't need the hostfile. In the interest of simplifying the reproducer could you run without the --oversubscribe and --hostfile options? Given your example you would need to request an allocation of at least 8 processes. I'm curious if this changes your outcome. |
@heasterday the reason I had the I can get this test code to run fine using v3.1.4 in a conda env, but was running into this problem: #6710. I was trying to find a setup that didnt fail like #6710 when I came across the current issue you are looking at. I have since refactored the actual code these pieces of test code were written for so that it spawns a single instance of the executable and tears down from within, rather than spawning multiple instances and closing down each time. This works fine, so I am happy if no-one else gets this error and you want to close this issue. |
Happy to hear you were able to find a workaround. @hppritcha sounds like this can be closed |
It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it. |
Background information
Runnig a simple python program that spawns a fortran executable fails
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
(Open MPI) 4.0.1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Installed from source tarball:
Please describe the system on which you are running
Scientific Linux release 6.9 (Carbon)
Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
Intel true scale infiniband
Details of the problem
Running the following:
test.py:
hello.f90:
as
Fails with the following error:
After execution the following files appear in
/dev/shm
:But there is certainly space:
The text was updated successfully, but these errors were encountered: