Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ompi-core-template.ini: MPI cleanup not run on all nodes #47

Closed
ompiteam opened this issue Sep 13, 2014 · 2 comments
Closed

ompi-core-template.ini: MPI cleanup not run on all nodes #47

ompiteam opened this issue Sep 13, 2014 · 2 comments
Assignees
Milestone

Comments

@ompiteam
Copy link
Contributor

The killall in the after_each_exec of the MPI Details section only runs on the node where mpirun was invoked (duh). It does not spread to all the other nodes where MPI was running.

Need to figure out how to make that go across all nodes.

@ompiteam ompiteam added this to the v1.0 milestone Sep 13, 2014
@ompiteam
Copy link
Contributor Author

Imported from trac issue 46. Created by jsquyres on 2006-08-30T10:29:01, last modified: 2006-09-13T09:02:35

@ompiteam
Copy link
Contributor Author

Trac comment by jsquyres on 2006-09-13 09:02:35:

(In [323]) Fixes #46.

The OMPI MPI Install module creates a perl script in the $bindir of
each OMPI installation that it creates called mtt_ompi_cleanup.pl.
This script searches the process list for orteds and kills them --
except for its own parent orted (I thought this was particularly
clever ;-) ). This allows us to launch this script via orterun
itself, and therefore use whatever native launching mechanism is used
via ORTE (e.g., slurm, pbs, rsh/ssh, etc.).

So there's now a new after_each_exec section in the OMPI core template
that essentially runs:

{{{
orterun $args -np $MTT_TEST_NP --prefix $MTT_TEST_PREFIX mtt_ompi_cleanup.pl
}}}

The script will be found because OMPI's $bindir is automatically put
in the path by MTT. It'll run on all nodes and kill any orteds that
it finds (except its own). Then it will whack any session directories
that it finds (including its own -- this is safe), and exit. Since
its parent orted wasn't killed, it'll exit normally and the orterun
from the after_each_exec will complete normally.

The only drawback to this scheme is that we run $np copies of the
script, so we could be running multiple copies of the script on each
node. This is a bit wasteful, but it does work. Further optimization
someday to figure out how to only run 1 copy of the script on each
node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants