-
Notifications
You must be signed in to change notification settings - Fork 128
Closed
Description
My MPI application seems to be unable to terminate if any of it's ranks fail. From here I learned about a workaround which helps, but I still face the same problem if the mpi4py
code is one fork down in the process chain.
Reproducer t.py
:
#!/usr/bin/env python3
import sys
# Global error handler
def global_except_hook(exctype, value, traceback):
import sys
try:
import mpi4py.MPI
sys.stderr.write("\n*****************************************************\n")
sys.stderr.write("Uncaught exception was detected on rank {}. \n".format(
mpi4py.MPI.COMM_WORLD.Get_rank()))
from traceback import print_exception
print_exception(exctype, value, traceback)
sys.stderr.write("*****************************************************\n\n\n")
sys.stderr.write("\n")
sys.stderr.write("Calling MPI_Abort() to shut down MPI processes...\n")
sys.stderr.flush()
finally:
try:
import mpi4py.MPI
mpi4py.MPI.COMM_WORLD.Abort(1)
except Exception as e:
sys.stderr.write("*****************************************************\n")
sys.stderr.write("Sorry, we failed to stop MPI, this process will hang.\n")
sys.stderr.write("*****************************************************\n")
sys.stderr.flush()
raise e
sys.excepthook = global_except_hook
def func():
import time
from mpi4py import MPI
world = MPI.COMM_WORLD
rank = world.rank
print(' = ', rank)
if rank == 0:
print(' - ', rank)
raise RuntimeError('oops')
while True:
print(' + ', rank)
time.sleep(1)
world.Barrier()
if __name__ == '__main__':
func()
Running mpiexec -n 4 t.py
works as expected:
$ mpiexec -n 2 /tmp/t.py
= 0
- 0
= 1
+ 1
*****************************************************
Uncaught exception was detected on rank 0.
Traceback (most recent call last):
File "/tmp/t.py", line 54, in <module>
func()
File "/tmp/t.py", line 45, in func
raise RuntimeError('oops')
RuntimeError: oops
*****************************************************
Calling MPI_Abort() to shut down MPI processes...
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
$
but running the same in a subshell after the mpiexec results in the same output and infinite hang:
$ mpiexec -n 2 sh -c '/tmp/t.py'
= 1
+ 1
= 0
- 0
*****************************************************
Uncaught exception was detected on rank 0.
Traceback (most recent call last):
File "/tmp/t.py", line 54, in <module>
func()
File "/tmp/t.py", line 45, in func
raise RuntimeError('oops')
RuntimeError: oops
*****************************************************
Calling MPI_Abort() to shut down MPI processes...
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
Strace shows that the mpiexec propcess is hanging on some select call:
$ s strace -f -p 12772
[sudo] password for merzky:
strace: Process 12772 attached with 5 threads
[pid 12786] select(16, [13 15], NULL, NULL, {tv_sec=3544, tv_usec=141885} <unfinished ...>
[pid 12785] select(24, [22 23], NULL, NULL, {tv_sec=0, tv_usec=178624} <unfinished ...>
[pid 12783] select(12, [10 11], NULL, NULL, {tv_sec=3544, tv_usec=107010} <unfinished ...>
[pid 12784] epoll_wait(16, <unfinished ...>
[pid 12772] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid 12785] <... select resumed> ) = 0 (Timeout)
[pid 12785] select(24, [22 23], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
[pid 12785] select(24, [22 23], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
[pid 12785] select(24, [22 23], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
[pid 12785] select(24, [22 23], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
[pid 12785] select(24, [22 23], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
[pid 12785] select(24, [22 23], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
[pid 12785] select(24, [22 23], NULL, NULL, {tv_sec=2, tv_usec=0}) = 0 (Timeout)
...
The remaining rank loops on a poll:
xxxxx 12790 1 99 12:57 pts/12 00:02:07 python3 /tmp/t.py
$ s strace -f -p 12790
strace: Process 12790 attached with 3 threads
strace: [ Process PID=12790 runs in x32 mode. ]
[pid 12794] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid 12793] epoll_wait(7, strace: [ Process PID=12790 runs in 64 bit mode. ]
<unfinished ...>
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 12790] poll([{fd=5, events=POLLIN}, {fd=18, events=POLLIN}], 2, 0) = 0 (Timeout)
...
I would appreciate any advise on how to handle this. The actual software stack is much more involved - it will not be possible to remove the intermediate process layer.
Metadata
Metadata
Assignees
Labels
No labels