cannot terminate MPI application

My MPI application seems to be unable to terminate if any of it's ranks fail.  From [here](https://github.com/chainer/chainermn/issues/236) I learned about a workaround which helps, but I still face the same problem if the `mpi4py` code is one fork down in the process chain.

Reproducer `t.py`:
```py
#!/usr/bin/env python3

import sys

# Global error handler
def global_except_hook(exctype, value, traceback):
    import sys
    try:
        import mpi4py.MPI
        sys.stderr.write("\n*****************************************************\n")
        sys.stderr.write("Uncaught exception was detected on rank {}. \n".format(
            mpi4py.MPI.COMM_WORLD.Get_rank()))
        from traceback import print_exception
        print_exception(exctype, value, traceback)
        sys.stderr.write("*****************************************************\n\n\n")
        sys.stderr.write("\n")
        sys.stderr.write("Calling MPI_Abort() to shut down MPI processes...\n")
        sys.stderr.flush()
    finally:
        try:
            import mpi4py.MPI
            mpi4py.MPI.COMM_WORLD.Abort(1)
        except Exception as e:
            sys.stderr.write("*****************************************************\n")
            sys.stderr.write("Sorry, we failed to stop MPI, this process will hang.\n")
            sys.stderr.write("*****************************************************\n")
            sys.stderr.flush()
            raise e


sys.excepthook = global_except_hook


def func():
    import time
    from mpi4py import MPI

    world = MPI.COMM_WORLD
    rank  = world.rank

    print(' = ', rank)

    if rank == 0:
        print(' - ', rank)
        raise RuntimeError('oops')

    while True:
        print(' + ', rank)
        time.sleep(1)
        world.Barrier()


if __name__ == '__main__':
    func()
```

Running `mpiexec -n 4 t.py` works as expected:
```sh
$ mpiexec -n 2 /tmp/t.py
 =  0
 -  0
 =  1
 +  1

*****************************************************
Uncaught exception was detected on rank 0. 
Traceback (most recent call last):
  File "/tmp/t.py", line 54, in <module>
    func()
  File "/tmp/t.py", line 45, in func
    raise RuntimeError('oops')
RuntimeError: oops
*****************************************************

Calling MPI_Abort() to shut down MPI processes...
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
$ 
```

but running the same in a subshell after the mpiexec results in the same output and infinite hang:
```sh
$ mpiexec -n 2 sh -c '/tmp/t.py'
 =  1
 +  1
 =  0
 -  0

*****************************************************
Uncaught exception was detected on rank 0. 
Traceback (most recent call last):
  File "/tmp/t.py", line 54, in <module>
    func()
  File "/tmp/t.py", line 45, in func
    raise RuntimeError('oops')
RuntimeError: oops
*****************************************************

Calling MPI_Abort() to shut down MPI processes...
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
```

Strace shows that the mpiexec propcess is hanging on some select call:
```sh
$ s strace -f -p 12772
[sudo] password for merzky: 
strace: Process 12772 attached with 5 threads
[pid 12786] select(16, [13 15], NULL, NULL, {tv_sec=3544, tv_usec=141885} <unfinished ...>
[pid 12785] select(24, [22 23], NULL, NULL, {tv_sec=0, tv_usec=178624} <unfinished ...>
[pid 12783] select(12, [10 11], NULL, NULL, {tv_sec=3544, tv_usec=107010} <unfinished ...>
[pid 12784] epoll_wait(16,  <unfinished ...>
[pid 12772] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid 12785] <... select resumed> )      = 0 (Timeout)
[pid 12785] select(24, [22 23], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
[pid 12785] select(24, [22 23], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
[pid 12785] select(24, [22 23], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
[pid 12785] select(24, [22 23], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
[pid 12785] select(24, [22 23], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
[pid 12785] select(24, [22 23], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
[pid 12785] select(24, [22 23], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
...
```

The remaining rank loops on a poll:
```
xxxxx   12790     1 99 12:57 pts/12   00:02:07 python3 /tmp/t.py

$ s strace -f -p 12790
strace: Process 12790 attached with 3 threads
strace: [ Process PID=12790 runs in x32 mode. ]
[pid 12794] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid 12793] epoll_wait(7, strace: [ Process PID=12790 runs in 64 bit mode. ]
 <unfinished ...>
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
...
```

I would appreciate any advise on how to handle this.  The actual software stack is much more involved - it will not be possible to remove the intermediate process layer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cannot terminate MPI application #290

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cannot terminate MPI application #290

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions