Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shmem related files left in /dev/shm after interrupt #7394

Open
manomars opened this issue Feb 13, 2020 · 0 comments
Open

shmem related files left in /dev/shm after interrupt #7394

manomars opened this issue Feb 13, 2020 · 0 comments

Comments

@manomars
Copy link

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v3.1.3, v3.1.4, v4.0.1, and v4.0.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Open MPI v3.1.4, v4.0.1, and v4.0.2 were installed from their respective source tarballs,
v3.1.3 came with the PGI-19.10 Compilers&Tools.

Please describe the system on which you are running

  • Operating system/version: Ubuntu 18.04.3 LTS
  • Computer hardware: 2 x Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
  • Network type:

Details of the problem

When I start the program below with:

mpirun -mca shmem posix -np 8 test 

and interrupt it (CTRL-C) after it has allocated its shared memory segments several files are left behind in /dev/shm:

ls -ltr /dev/shm
total 576
-rw------- 1 mmars users 4194312 Feb 13 12:31 open_mpi.0000
-rw------- 1 mmars users 4194312 Feb 13 12:31 open_mpi.0001
-rw------- 1 mmars users 4194312 Feb 13 12:31 open_mpi.0002
-rw------- 1 mmars users 4194312 Feb 13 12:31 open_mpi.0003
-rw------- 1 mmars users 4194312 Feb 13 12:31 open_mpi.0004
-rw------- 1 mmars users 4194312 Feb 13 12:31 open_mpi.0005
-rw------- 1 mmars users 4194312 Feb 13 12:31 open_mpi.0006
-rw------- 1 mmars users 4194312 Feb 13 12:31 open_mpi.0007

Repeating this will add more and more files of this kind in /dev/shm, until there is 128 of them.
After that the program will not run at all anymore and exits with:

[guppy01:14529] shmem: posix: file name search - max attempts exceeded.cannot continue with posix.
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_shmem_base_select failed
  --> Returned value -1 instead of OPAL_SUCCESS
--------------------------------------------------------------------------

The following program can be used to reproduce this behaviour:

    program test

    use iso_c_binding, only: c_ptr
    use mpi_f08, only : MPI_ADDRESS_KIND, &
                        MPI_COMM_WORLD, &
                        MPI_INFO_NULL, &
                        MPI_Win, &
                        MPI_Sizeof, &
                        MPI_Win_allocate_shared, &
                        MPI_Win_allocate, &
                        MPI_Win_free

    type(MPI_Win)  :: shmem_win
    type (c_ptr)   :: shmem_ptr
    integer(kind=MPI_ADDRESS_KIND) :: segmentsize = 10
    integer :: sizeoftype
    integer :: ierr

    real :: array(10)

    call MPI_Init(ierr)
    call MPI_Sizeof(array, sizeoftype, ierr)
    call MPI_Win_allocate_shared(segmentsize*sizeoftype, sizeoftype, MPI_INFO_NULL, MPI_COMM_WORLD, shmem_ptr, shmem_win, ierr)

    call sleep(10)
    call MPI_Win_free(shmem_win, ierr)
    call MPI_Finalize(ierr)

    end program

In Open MPI v4.0.1 and v.4.0.2 using "mmap" (mpirun -mca shmem mmap ...) instead of "posix" solves this problem, but unfortunately these Open MPI versions suffer from another shmem related problem (see issue #7393 ) that prohibits me from using them.

With Open MPI v3.1.3 and v3.1.4 using "mmap" solves the problem partly: after the interrupt /dev/shm/vader_segment.* files remain behind but this does not lead to the kind of problems described above ("opal_shmem_base_select failed").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant