Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A system call failed during shared memory initialization ... #7393

Open
manomars opened this issue Feb 13, 2020 · 22 comments
Open

A system call failed during shared memory initialization ... #7393

manomars opened this issue Feb 13, 2020 · 22 comments

Comments

@manomars
Copy link

manomars commented Feb 13, 2020

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.0.1 and v4.0.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Installed from the source tarball (both with Intel Parallel Studio 2020.0.088 as well as with GCC-7.4.0).

Please describe the system on which you are running

  • Operating system/version: Ubuntu 18.04.3 LTS
  • Computer hardware: 2 x Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
  • Network type:

Details of the problem

When I split the comm_world communicator into two groups (comm_shmem) and try to allocate shmem segments on the latter by means of MPI_win_allocate I get the following error message:

--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  guppy01
  System call: unlink(2) /dev/shm/osc_rdma.guppy01.fd690001.4
  Error:       No such file or directory (errno 2)
--------------------------------------------------------------------------

I used the following program:

    PROGRAM test

    USE mpi_f08

    TYPE(MPI_comm)  :: comm_world, comm_shmem
    TYPE(MPI_group) :: group_world,group_shmem

    TYPE(MPI_win)   :: win
    TYPE(c_ptr)     :: baseptr

    INTEGER(KIND=MPI_ADDRESS_KIND) :: winsize

    INTEGER, ALLOCATABLE :: group(:)

    INTEGER :: nrank,irank,nrank_shmem,irank_shmem,nshmem
    INTEGER :: i,n,sizeoftype
    INTEGER :: ierror

    CALL MPI_init( ierror )

    comm_world = MPI_comm_world

    CALL MPI_comm_rank( comm_world, irank, ierror )
    CALL MPI_comm_size( comm_world, nrank, ierror )

    WRITE(*,'(a,i4,2x,a,i4)') 'nrank:',nrank,'irank:',irank

    ALLOCATE(group(0:nrank-1))

    nshmem=4

    n=0
    DO i=0,nrank-1
       IF (i/nshmem == irank/nshmem) THEN
          group(n)=i
          n=n+1
       ENDIF
    ENDDO

    CALL MPI_comm_group( comm_world, group_world, ierror )
    CALL MPI_group_incl( group_world, n, group, group_shmem, ierror )
    CALL MPI_comm_create( comm_world, group_shmem, comm_shmem, ierror )

    DEALLOCATE(group)

    CALL MPI_comm_rank( comm_shmem, irank_shmem, ierror )
    CALL MPI_comm_size( comm_shmem, nrank_shmem, ierror )

    WRITE(*,'(a,i4,2x,a,i4)') 'irank:',irank,'irank_shmem:',irank_shmem

    CALL MPI_sizeof( i, sizeoftype, ierror )
    winsize=10*sizeoftype

    CALL MPI_win_allocate( winsize, sizeoftype, MPI_INFO_NULL, comm_shmem, baseptr, win, ierror )

    CALL MPI_win_free( win, ierror )

    CALL MPI_finalize( ierror )

    END PROGRAM

and ran it with 8 ranks:

mpirun -mca shmem mmap -np 8 test

Switching to "posix" (mpirun -mca shmem posix ...) gets rid of this error but has problems of its own for which I'll submit a separate issue.

@janjust
Copy link
Contributor

janjust commented Jul 9, 2020

@devreal @artpol84 (for visibility).

@manomars

I tried the test and I'm able to reproduce with osc rdma, but osc ucx works fine.
Ran it under strace quickly, looks like a mmap/munmap fail...I think.

tomislavj@hpchead /hpc/mtr_scrap/users/tomislavj/debug/ompi_shmem_issue
$ mpirun -np 8 -mca osc rdma ./a.out
nrank:   8  irank:   0
irank:   0  irank_shmem:   0
nrank:   8  irank:   2
irank:   2  irank_shmem:   2
nrank:   8  irank:   3
irank:   3  irank_shmem:   3
nrank:   8  irank:   4
irank:   4  irank_shmem:   0
nrank:   8  irank:   5
irank:   5  irank_shmem:   1
nrank:   8  irank:   6
irank:   6  irank_shmem:   2
nrank:   8  irank:   7
irank:   7  irank_shmem:   3
nrank:   8  irank:   1
irank:   1  irank_shmem:   1
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  jazz23.swx.labs.mlnx
  System call: unlink(2) /dev/shm/osc_rdma.jazz23.2ac10001.4
  Error:       No such file or directory (errno 2)
--------------------------------------------------------------------------
tomislavj@hpchead /hpc/mtr_scrap/users/tomislavj/debug/ompi_shmem_issue
$ mpirun -np 8 -mca osc ucx ./a.out
nrank:   8  irank:   0
irank:   0  irank_shmem:   0
nrank:   8  irank:   1
irank:   1  irank_shmem:   1
nrank:   8  irank:   2
irank:   2  irank_shmem:   2
nrank:   8  irank:   3
irank:   3  irank_shmem:   3
nrank:   8  irank:   4
irank:   4  irank_shmem:   0
nrank:   8  irank:   5
irank:   5  irank_shmem:   1
nrank:   8  irank:   6
irank:   6  irank_shmem:   2
nrank:   8  irank:   7
irank:   7  irank_shmem:   3

@marcpaterno
Copy link

I am observing a similar failure. I am using OpenMPI 4.0.5 on macOS Catalina (version 10.15.7 (19H2)); OpenMPI was installed through Homebrew. My machine has 8 real cores, and 32 GB of RAM; the program never uses more than about 6 GB. The output I get is attached below. The first line of the output is the correct output printed at the end of the program.

 pandana on work via <python> v3.9.0 (work-venv) ❯ mpirun --mca orte_base_help_aggregate 0 -np 8 python Demos/candidate_selection.py ../computing-model-benchmarks/nova-data/fourth/big/160_subruns.h5caf.h5 evtseq
Selected  18452.0  events from  1.091441e+19   POT.
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  mac-135594
  System call: unlink(2) /var/folders/wz/vk7vs2qj4lg3kg2cs62mw8jh0000gp/T//ompi.mac-135594.502/pid.25484/1/vader_segment.mac-135594.2cf60001.2
  Error:       No such file or directory (errno 2)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  mac-135594
  System call: unlink(2) /var/folders/wz/vk7vs2qj4lg3kg2cs62mw8jh0000gp/T//ompi.mac-135594.502/pid.25484/1/vader_segment.mac-135594.2cf60001.7
  Error:       No such file or directory (errno 2)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  mac-135594
  System call: unlink(2) /var/folders/wz/vk7vs2qj4lg3kg2cs62mw8jh0000gp/T//ompi.mac-135594.502/pid.25484/1/vader_segment.mac-135594.2cf60001.5
  Error:       No such file or directory (errno 2)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  mac-135594
  System call: unlink(2) /var/folders/wz/vk7vs2qj4lg3kg2cs62mw8jh0000gp/T//ompi.mac-135594.502/pid.25484/1/vader_segment.mac-135594.2cf60001.6
  Error:       No such file or directory (errno 2)
--------------------------------------------------------------------------

@jsquyres
Copy link
Member

This seems to be two different unlink problems.

  1. The first one is in OSC RDMA.
  2. The second one is in vader, and could be related to v4.0.x: Cleanup race condition in finalize that leads to incomplete vader cleanup #6550 -- although that was supposedly fixed a long time ago.

Others have seen the 2nd one (in vader) since v4.0.2, but it's been very difficult to reproduce and isolate.

@aivazis
Copy link

aivazis commented Jul 18, 2021

Any progress on this? It happens for me very repeatably, so if there is any guidance on what to instrument to shed some light on this, I'll happily do the work. I see the problem on both mac os catalina and big sur, with a macports installation across multiple versions since 4.0.1, compiled with multiple versions of both gcc and clang

@ggouaillardet
Copy link
Contributor

Try export TMPDIR=/tmp

Truncation can occur on OSX with the default TMPDIR

@aivazis
Copy link

aivazis commented Sep 28, 2021

Do you think it's worth exploring the cause of this failure and a more permanent solution? I guess it's probably acceptable to tell users to set their TMPDIR, but it might be good to know why.

@jsquyres
Copy link
Member

We do have https://www.open-mpi.org/faq/?category=osx#startup-errors-with-open-mpi-2.0.x, but it looks like the verbiage on it is a bit out of date. The underlying macOS issue is the same, however.

@jsquyres
Copy link
Member

More specifically, it looks like we had a specific check for this back in the 2.0.x/2.1.x timeframe (i.e., emit a very specific error message that helped users workaround that issue). But apparently that very specific message either has gotten lost or isn't functioning properly in Open MPI v4.0.x/v4.1.x.

A little history here: Open MPI's underlying run-time system has been slowly been evolving into its own project. For example, the PMIx project directly evolved from a good chunk of what used to be part of Open MPI itself (i.e., Open MPI's run-time system). Ever since PMIx split off into its own project, Open MPI has distributed an embedded copy of the PMIx source code. In this way, 99% of Open MPI users aren't even aware of the code split.

In the upcoming Open MPI v5.0.x, basically the rest of Open MPI's run-time system is splitting off into a project called PRTE. As such, Open MPI v5.0.x will carry embedded copies of both PMIx and PRTE.

All this is to say that the error (i.e., either the lack of or the malfunctioning of the specific macOS TMPDIR error message) is almost certainly in PMIx: https://github.com/openpmix/openpmix. The error should be fixed over there and then back-ported to the embedded copies in Open MPI v4.0.x, 4.1.x, and the upcoming 5.0.x.

@rhc54
Copy link
Contributor

rhc54 commented Sep 28, 2021

PMIx used to do this check because we were using Unix domain sockets back in those days:

    // If the above set temporary directory name plus the pmix-PID string
    // plus the '/' separator are too long, just fail, so the caller
    // may provide the user with a proper help... *Cough*, *Cough* OSX...
    if ((strlen(tdir) + strlen(pmix_pid) + 1) > sizeof(myaddress.sun_path)-1) {
        free(pmix_pid);
        /* we don't have show-help in this version, so pretty-print something
         * the hard way */
        fprintf(stderr, "PMIx has detected a temporary directory name that results\n");
        fprintf(stderr, "in a path that is too long for the Unix domain socket:\n\n");
        fprintf(stderr, "    Temp dir: %s\n\n", tdir);
        fprintf(stderr, "Try setting your TMPDIR environmental variable to point to\n");
        fprintf(stderr, "something shorter in length\n");
        return PMIX_ERR_SILENT; // return a silent error so our host knows we printed a message
    }

The check was removed once we went away from that method (switching back to TCP). We should probably discuss where a more permanent location should be - it's a shared memory problem (which is in OMPI, not PMIx), so I'm leery of putting something in PMIx that assumes how long a shmem backing filename might be.

@jsquyres
Copy link
Member

Ah, ok, that sounds totally reasonable (that the check should be in Open MPI, not PMIx). That makes things simpler, too.

@marcpaterno @aivazis Is the case where the problem occurs the same / similar to the originally-cited problem on this issue?

@aivazis
Copy link

aivazis commented Sep 28, 2021

My case pops up when multiple MPI jobs are launched at the same time by the same user: I have a test suite that exercises my python bindings that runs in parallel. What I think is happening is that the openmpi clean up code tries to remove the temporary files it created as children of a temporary directory. It seems that the name of this directory is seeded with my uid, instead of the process id, and the first job that terminates destroys the temporary directory the other instances rely on. Fixing my problem could be as simple as tweaking the algorithm that names the temporary directory.

I'll try to verify this and post a screenshot from a session.

@aivazis
Copy link

aivazis commented Sep 29, 2021

Took another look. I was partly wrong: the temporary path contains both my uid and the pid of the running process. The error I get mentions the directory that couldn't be unlinked:

System call: unlink(2) /var/folders/zy/_pjxfbbs4wsf4_f_0dg4bwr40000gn/T//ompi.cygnus.501/pid.64173/1/vader_segment.cygnus.501.bc7b0001.6

And partly correct: the clean up code appears to remove a directory a few levels up. As I watch the filesystem, the directory at

/var/folders/zy/_pjxfbbs4wsf4_f_0dg4bwr40000gn/T//ompi.cygnus.501

disappears, and the other MPI instances start crashing

@rhc54
Copy link
Contributor

rhc54 commented Sep 29, 2021

So there are two very different problems being discussed here - which is fine, just wanted to be clear. The first problem has to do with the length of the TMPDIR path on Mac OSX - which has become insanely long. This is something we can rather easily detect in OMPI and warn you about - should be done in OMPI where we know that the combination of the long TMPDIR path and the shmem backing filename will cause a problem.

The second problem is the one mentioned by @aivazis. This is caused by a race condition - mpirun A is cleaning up the session directory tree while another one (B) is trying to set it up. If you hit things right, A will remove the top-level directory in the session just as B attempts to create a file inside it.

We use only the uid in the top-level directory so that sys admins have an easier time cleaning up should someone have a bunch of unclean terminations. They just look for the directory with that uid in it and rm -rf it. We previously included the pid in the top-level name, but that could lead to 1000s of directories in the tmp dir and make it harder for the sys admins to cleanup.

There are two solutions to the problem. First, we could provide an option telling mpirun to add the pid to the top-level directory name. This would allow those with the use-case described by @aivazis to avoid the problem while preserving the sys admin's request for simplicity.

Other solution is to use PRRTE, which would also allow your test suite to complete faster (far less time starting/stopping each job). What you would do is have your test setup start the prte DVM, and then use prun --personality ompi (instead of mpirun) to run your jobs. Because the DVM is persistent, we don't tear down the top-level directory between jobs, and thus it avoids this problem.

Frankly, it is one of the primary use-cases for PRRTE. You can learn more about it here and find the code here

@jsquyres
Copy link
Member

jsquyres commented Oct 1, 2021

@rhc54 This issue is opened against Open MPI v4.0.x. Does PRTE work with the Open MPI v4.0.x and v4.1.x series?

@rhc54
Copy link
Contributor

rhc54 commented Oct 1, 2021

Sure - so long as PRRTE is built against PMIx v4.1 or above it will support any OMPI version starting with the 2.x series.

@rhc54
Copy link
Contributor

rhc54 commented Oct 9, 2021

There are two solutions to the problem. First, we could provide an option telling mpirun to add the pid to the top-level directory name. This would allow those with the use-case described by @aivazis to avoid the problem while preserving the sys admin's request for simplicity.

@aivazis I have created this option - it will become available in OMPI v5. You just need to add --prtemca add_pid_to_session_dirname 1 to your cmd line, or put add_pid_to_session_dirname = 1 in the PRRTE default MCA param file for your installation.

@BrushXue
Copy link

BrushXue commented Nov 1, 2021

So there are two very different problems being discussed here - which is fine, just wanted to be clear. The first problem has to do with the length of the TMPDIR path on Mac OSX - which has become insanely long. This is something we can rather easily detect in OMPI and warn you about - should be done in OMPI where we know that the combination of the long TMPDIR path and the shmem backing filename will cause a problem.

Any fixes for this problem? I've been seeing these error messages on my Mac for a long time:

A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  dyn161054.res.lehigh.edu
  System call: unlink(2) /var/folders/cb/ms6tghdd59s8wvt0l9xpkn500000gn/T//ompi.dyn161054.501/pid.41956/1/vader_segment.dyn161054.501.21ab0001.5
  Error:       No such file or directory (errno 2)

@rhc54
Copy link
Contributor

rhc54 commented Nov 1, 2021

You mean other than the one already outlined above - i.e., create a $HOME/tmp directory and set the TMPDIR envar to point at it? No, that is the fix.

@yyuan-luo
Copy link

yyuan-luo commented Nov 1, 2021

#include <mpi.h>
#include <stdio.h>

int main(int argc, char *argv[]) {
int rank, num_proc;
const int ROOT = 0;
double pi;

MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &num_proc);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);

// pi only initialized in root
if (rank == ROOT) pi = 3.1416;
// If you don't want to use if/else, consider:
// pi = 3.141516 * (!rank);

MPI_Bcast(&pi, 1, MPI_DOUBLE, ROOT, MPI_COMM_WORLD);

printf("[%d] pi = %lf\n", rank, pi);

MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();

return 0;
}

If the codes go like this, the problem appears.

But if I use this instead of if statement, there is no problem at all.

pi = 3.141516 * (!rank);

I am completely new to OpenMPI, I encountered this problem on my M1 Mac with the newest Monterey.

@BrushXue
Copy link

BrushXue commented Nov 1, 2021

You mean other than the one already outlined above - i.e., create a $HOME/tmp directory and set the TMPDIR envar to point at it? No, that is the fix.

If I do that it may screw up other program. Is there any possible fix from openmpi side?

@rhc54
Copy link
Contributor

rhc54 commented Nov 1, 2021

In one case, pi is initialized everywhere - in the other, it isn't. The two cases are not equivalent, so I expect the problem is in your code.

@flores-o
Copy link

flores-o commented Mar 7, 2023

This fixed it for me (Mac with M1 chip)

export TMPDIR=/tmp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants