- 
                Notifications
    You must be signed in to change notification settings 
- Fork 929
Closed
Description
On master, v2.0.x, and v2.x, if the application launch fails on a non-local node, mpirun hangs.  More specifically, the orteds detect the failed launch (e.g., if you specify a non-existent executable), but that error never seems to make it over to the errmgr.
For example:
$ salloc -N 1
$ mpirun -np 1 --mca state_base_verbose 100 --mca errmgr_base_verbose 100 this_does_not_exist
...lots of output...
[savbu-usnic-a:30016] [[9286,0],0] ACTIVATE JOB [9286,1] STATE PENDING FINAL SYSTEM PREP AT base/plm_base_launch_support.c:252
[savbu-usnic-a:30016] [[9286,0],0] ACTIVATING JOB [9286,1] STATE PENDING FINAL SYSTEM PREP PRI 4
[savbu-usnic-a:30016] [[9286,0],0] ACTIVATE JOB [9286,1] STATE PENDING APP LAUNCH AT base/plm_base_launch_support.c:451
[savbu-usnic-a:30016] [[9286,0],0] ACTIVATING JOB [9286,1] STATE PENDING APP LAUNCH PRI 4
[savbu-usnic-a:30016] [[9286,0],0] ACTIVATE JOB [9286,1] STATE LOCAL LAUNCH COMPLETE AT base/odls_base_default_fns.c:1046
[savbu-usnic-a:30016] [[9286,0],0] ACTIVATING JOB [9286,1] STATE LOCAL LAUNCH COMPLETE PRI 4
[savbu-usnic-a:30016] [[9286,0],0] ACTIVATE PROC [[9286,1],0] STATE RUNNING AT base/plm_base_receive.c:333
[savbu-usnic-a:30016] [[9286,0],0] ACTIVATING PROC [[9286,1],0] STATE RUNNING PRI 4
[savbu-usnic-a:30016] [[9286,0],0] state:base:track_procs called for proc [[9286,1],0] state RUNNING
[savbu-usnic-a:30016] [[9286,0],0] ACTIVATE JOB [9286,1] STATE RUNNING AT base/state_base_fns.c:488
[savbu-usnic-a:30016] [[9286,0],0] ACTIVATING JOB [9286,1] STATE RUNNING PRI 4
[hang]
Outside of a SLURM allocation, mpirun -np 1 --host foo this_does_not_exist hangs in the same way.
But mpirun -np 1 this_does_not_exist -- when executing on the local node -- does not hang.  Instead, it properly displays a show_help message and exits with a non-zero status (this is on 2.0.x):
$ mpirun this_does_not_exist
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job.  This error was first reported for process
rank 0; it may have occurred for other processes as well.
NOTE: A common cause for this error is misspelling a mpirun command
      line parameter option (remember that mpirun interprets the first
      unrecognized command line token as the executable).
Node:       savbu-usnic-a
Executable: this_does_not_exist
--------------------------------------------------------------------------
16 total processes failed to start
$ echo $status
134