Skip to content

cannot terminate MPI application #290

@andre-merzky

Description

@andre-merzky

My MPI application seems to be unable to terminate if any of it's ranks fail. From here I learned about a workaround which helps, but I still face the same problem if the mpi4py code is one fork down in the process chain.

Reproducer t.py:

#!/usr/bin/env python3

import sys

# Global error handler
def global_except_hook(exctype, value, traceback):
    import sys
    try:
        import mpi4py.MPI
        sys.stderr.write("\n*****************************************************\n")
        sys.stderr.write("Uncaught exception was detected on rank {}. \n".format(
            mpi4py.MPI.COMM_WORLD.Get_rank()))
        from traceback import print_exception
        print_exception(exctype, value, traceback)
        sys.stderr.write("*****************************************************\n\n\n")
        sys.stderr.write("\n")
        sys.stderr.write("Calling MPI_Abort() to shut down MPI processes...\n")
        sys.stderr.flush()
    finally:
        try:
            import mpi4py.MPI
            mpi4py.MPI.COMM_WORLD.Abort(1)
        except Exception as e:
            sys.stderr.write("*****************************************************\n")
            sys.stderr.write("Sorry, we failed to stop MPI, this process will hang.\n")
            sys.stderr.write("*****************************************************\n")
            sys.stderr.flush()
            raise e


sys.excepthook = global_except_hook


def func():
    import time
    from mpi4py import MPI

    world = MPI.COMM_WORLD
    rank  = world.rank

    print(' = ', rank)

    if rank == 0:
        print(' - ', rank)
        raise RuntimeError('oops')

    while True:
        print(' + ', rank)
        time.sleep(1)
        world.Barrier()


if __name__ == '__main__':
    func()

Running mpiexec -n 4 t.py works as expected:

$ mpiexec -n 2 /tmp/t.py
 =  0
 -  0
 =  1
 +  1

*****************************************************
Uncaught exception was detected on rank 0. 
Traceback (most recent call last):
  File "/tmp/t.py", line 54, in <module>
    func()
  File "/tmp/t.py", line 45, in func
    raise RuntimeError('oops')
RuntimeError: oops
*****************************************************

Calling MPI_Abort() to shut down MPI processes...
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
$ 

but running the same in a subshell after the mpiexec results in the same output and infinite hang:

$ mpiexec -n 2 sh -c '/tmp/t.py'
 =  1
 +  1
 =  0
 -  0

*****************************************************
Uncaught exception was detected on rank 0. 
Traceback (most recent call last):
  File "/tmp/t.py", line 54, in <module>
    func()
  File "/tmp/t.py", line 45, in func
    raise RuntimeError('oops')
RuntimeError: oops
*****************************************************

Calling MPI_Abort() to shut down MPI processes...
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

Strace shows that the mpiexec propcess is hanging on some select call:

$ s strace -f -p 12772
[sudo] password for merzky: 
strace: Process 12772 attached with 5 threads
[pid 12786] select(16, [13 15], NULL, NULL, {tv_sec=3544, tv_usec=141885} <unfinished ...>
[pid 12785] select(24, [22 23], NULL, NULL, {tv_sec=0, tv_usec=178624} <unfinished ...>
[pid 12783] select(12, [10 11], NULL, NULL, {tv_sec=3544, tv_usec=107010} <unfinished ...>
[pid 12784] epoll_wait(16,  <unfinished ...>
[pid 12772] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid 12785] <... select resumed> )      = 0 (Timeout)
[pid 12785] select(24, [22 23], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
[pid 12785] select(24, [22 23], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
[pid 12785] select(24, [22 23], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
[pid 12785] select(24, [22 23], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
[pid 12785] select(24, [22 23], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
[pid 12785] select(24, [22 23], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
[pid 12785] select(24, [22 23], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
...

The remaining rank loops on a poll:

xxxxx   12790     1 99 12:57 pts/12   00:02:07 python3 /tmp/t.py

$ s strace -f -p 12790
strace: Process 12790 attached with 3 threads
strace: [ Process PID=12790 runs in x32 mode. ]
[pid 12794] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid 12793] epoll_wait(7, strace: [ Process PID=12790 runs in 64 bit mode. ]
 <unfinished ...>
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
...

I would appreciate any advise on how to handle this. The actual software stack is much more involved - it will not be possible to remove the intermediate process layer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions