Skip to content

osc/sm fence failed unlink syscall/segfault #5363

@PeterGottesman

Description

@PeterGottesman

An issue is showing up on Cisco MTT in the ibm/win_allocate_two_shared test, resulting in the following error:

--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  mpi031
  System call: unlink(2) /dev/shm/osc_sm.mpi031.25190001.5
  Error:       No such file or directory (errno 2)

I am able to reproduce the issue, although I am having issues tracking it down in a debugger.

In one case the issue appeared with a segfault in osc/sm fence:

(gdb) bt
#0  0x00002aaaaacf51f2 in ompi_osc_sm_fence (assert=0, win=0x740960) at osc_sm_active_target.c:103
#1  0x00002aaaaabaed7f in PMPI_Win_fence (assert=0, win=0x740960) at pwin_fence.c:60
#2  0x0000000000400d1e in main (argc=1, argv=0x7fffffffce18) at win_allocate_two_shared.c:47
(gdb) l
98              (ompi_osc_sm_module_t*) win->w_osc_module;
99
100         /* ensure all memory operations have completed */
101         opal_atomic_mb();
102
103         if (module->global_state->use_barrier_for_fence) {
104             return module->comm->c_coll->coll_barrier(module->comm,
105                                                      module->comm->c_coll->coll_barrier_module);
106         } else {
107             module->my_sense = !module->my_sense;
(gdb) p module
$1 = (ompi_osc_sm_module_t *) 0x740ed0
(gdb) p module->global_state
$2 = (ompi_osc_sm_global_state_t *) 0x0

This failure mode seems to be related to #5262, though I'm not convinced I'm not seeing two different issues here.

@hjelmn Do you think this might be related to #5262? Maybe it is also related to the older kernel this is running on(2.6.32-431.20.3)?

MTT: https://mtt.open-mpi.org/index.php?do_redir=2645

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions