-
Notifications
You must be signed in to change notification settings - Fork 929
Closed
Description
An issue is showing up on Cisco MTT in the ibm/win_allocate_two_shared test, resulting in the following error:
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have. It is likely that your MPI job will now either abort or
experience performance degradation.
Local host: mpi031
System call: unlink(2) /dev/shm/osc_sm.mpi031.25190001.5
Error: No such file or directory (errno 2)
I am able to reproduce the issue, although I am having issues tracking it down in a debugger.
In one case the issue appeared with a segfault in osc/sm fence:
(gdb) bt
#0 0x00002aaaaacf51f2 in ompi_osc_sm_fence (assert=0, win=0x740960) at osc_sm_active_target.c:103
#1 0x00002aaaaabaed7f in PMPI_Win_fence (assert=0, win=0x740960) at pwin_fence.c:60
#2 0x0000000000400d1e in main (argc=1, argv=0x7fffffffce18) at win_allocate_two_shared.c:47
(gdb) l
98 (ompi_osc_sm_module_t*) win->w_osc_module;
99
100 /* ensure all memory operations have completed */
101 opal_atomic_mb();
102
103 if (module->global_state->use_barrier_for_fence) {
104 return module->comm->c_coll->coll_barrier(module->comm,
105 module->comm->c_coll->coll_barrier_module);
106 } else {
107 module->my_sense = !module->my_sense;
(gdb) p module
$1 = (ompi_osc_sm_module_t *) 0x740ed0
(gdb) p module->global_state
$2 = (ompi_osc_sm_global_state_t *) 0x0
This failure mode seems to be related to #5262, though I'm not convinced I'm not seeing two different issues here.
@hjelmn Do you think this might be related to #5262? Maybe it is also related to the older kernel this is running on(2.6.32-431.20.3)?