You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The killall in the after_each_exec of the MPI Details section only runs on the node where mpirun was invoked (duh). It does not spread to all the other nodes where MPI was running.
Need to figure out how to make that go across all nodes.
The text was updated successfully, but these errors were encountered:
The OMPI MPI Install module creates a perl script in the $bindir of
each OMPI installation that it creates called mtt_ompi_cleanup.pl.
This script searches the process list for orteds and kills them --
except for its own parent orted (I thought this was particularly
clever ;-) ). This allows us to launch this script via orterun
itself, and therefore use whatever native launching mechanism is used
via ORTE (e.g., slurm, pbs, rsh/ssh, etc.).
So there's now a new after_each_exec section in the OMPI core template
that essentially runs:
The script will be found because OMPI's $bindir is automatically put
in the path by MTT. It'll run on all nodes and kill any orteds that
it finds (except its own). Then it will whack any session directories
that it finds (including its own -- this is safe), and exit. Since
its parent orted wasn't killed, it'll exit normally and the orterun
from the after_each_exec will complete normally.
The only drawback to this scheme is that we run $np copies of the
script, so we could be running multiple copies of the script on each
node. This is a bit wasteful, but it does work. Further optimization
someday to figure out how to only run 1 copy of the script on each
node.
The killall in the after_each_exec of the MPI Details section only runs on the node where mpirun was invoked (duh). It does not spread to all the other nodes where MPI was running.
Need to figure out how to make that go across all nodes.
The text was updated successfully, but these errors were encountered: