You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v3.1.3, v3.1.4, v4.0.1, and v4.0.2
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Open MPI v3.1.4, v4.0.1, and v4.0.2 were installed from their respective source tarballs,
v3.1.3 came with the PGI-19.10 Compilers&Tools.
Please describe the system on which you are running
Operating system/version: Ubuntu 18.04.3 LTS
Computer hardware: 2 x Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Network type:
Details of the problem
When I start the program below with:
mpirun -mca shmem posix -np 8 test
and interrupt it (CTRL-C) after it has allocated its shared memory segments several files are left behind in /dev/shm:
ls -ltr /dev/shm
total 576
-rw------- 1 mmars users 4194312 Feb 13 12:31 open_mpi.0000
-rw------- 1 mmars users 4194312 Feb 13 12:31 open_mpi.0001
-rw------- 1 mmars users 4194312 Feb 13 12:31 open_mpi.0002
-rw------- 1 mmars users 4194312 Feb 13 12:31 open_mpi.0003
-rw------- 1 mmars users 4194312 Feb 13 12:31 open_mpi.0004
-rw------- 1 mmars users 4194312 Feb 13 12:31 open_mpi.0005
-rw------- 1 mmars users 4194312 Feb 13 12:31 open_mpi.0006
-rw------- 1 mmars users 4194312 Feb 13 12:31 open_mpi.0007
Repeating this will add more and more files of this kind in /dev/shm, until there is 128 of them.
After that the program will not run at all anymore and exits with:
[guppy01:14529] shmem: posix: file name search - max attempts exceeded.cannot continue with posix.
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
opal_shmem_base_select failed
--> Returned value -1 instead of OPAL_SUCCESS
--------------------------------------------------------------------------
The following program can be used to reproduce this behaviour:
program test
use iso_c_binding, only: c_ptr
use mpi_f08, only : MPI_ADDRESS_KIND, &
MPI_COMM_WORLD, &
MPI_INFO_NULL, &
MPI_Win, &
MPI_Sizeof, &
MPI_Win_allocate_shared, &
MPI_Win_allocate, &
MPI_Win_free
type(MPI_Win) :: shmem_win
type (c_ptr) :: shmem_ptr
integer(kind=MPI_ADDRESS_KIND) :: segmentsize = 10
integer :: sizeoftype
integer :: ierr
real :: array(10)
call MPI_Init(ierr)
call MPI_Sizeof(array, sizeoftype, ierr)
call MPI_Win_allocate_shared(segmentsize*sizeoftype, sizeoftype, MPI_INFO_NULL, MPI_COMM_WORLD, shmem_ptr, shmem_win, ierr)
call sleep(10)
call MPI_Win_free(shmem_win, ierr)
call MPI_Finalize(ierr)
end program
In Open MPI v4.0.1 and v.4.0.2 using "mmap" (mpirun -mca shmem mmap ...) instead of "posix" solves this problem, but unfortunately these Open MPI versions suffer from another shmem related problem (see issue #7393 ) that prohibits me from using them.
With Open MPI v3.1.3 and v3.1.4 using "mmap" solves the problem partly: after the interrupt /dev/shm/vader_segment.* files remain behind but this does not lead to the kind of problems described above ("opal_shmem_base_select failed").
The text was updated successfully, but these errors were encountered:
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v3.1.3, v3.1.4, v4.0.1, and v4.0.2
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Open MPI v3.1.4, v4.0.1, and v4.0.2 were installed from their respective source tarballs,
v3.1.3 came with the PGI-19.10 Compilers&Tools.
Please describe the system on which you are running
Details of the problem
When I start the program below with:
and interrupt it (CTRL-C) after it has allocated its shared memory segments several files are left behind in /dev/shm:
Repeating this will add more and more files of this kind in /dev/shm, until there is 128 of them.
After that the program will not run at all anymore and exits with:
The following program can be used to reproduce this behaviour:
In Open MPI v4.0.1 and v.4.0.2 using "mmap" (mpirun -mca shmem mmap ...) instead of "posix" solves this problem, but unfortunately these Open MPI versions suffer from another shmem related problem (see issue #7393 ) that prohibits me from using them.
With Open MPI v3.1.3 and v3.1.4 using "mmap" solves the problem partly: after the interrupt /dev/shm/vader_segment.* files remain behind but this does not lead to the kind of problems described above ("opal_shmem_base_select failed").
The text was updated successfully, but these errors were encountered: