Skip to content

mpirun of non-existent executable hangs #2233

@jsquyres

Description

@jsquyres

On master, v2.0.x, and v2.x, if the application launch fails on a non-local node, mpirun hangs. More specifically, the orteds detect the failed launch (e.g., if you specify a non-existent executable), but that error never seems to make it over to the errmgr.

For example:

$ salloc -N 1
$ mpirun -np 1 --mca state_base_verbose 100 --mca errmgr_base_verbose 100 this_does_not_exist
...lots of output...
[savbu-usnic-a:30016] [[9286,0],0] ACTIVATE JOB [9286,1] STATE PENDING FINAL SYSTEM PREP AT base/plm_base_launch_support.c:252
[savbu-usnic-a:30016] [[9286,0],0] ACTIVATING JOB [9286,1] STATE PENDING FINAL SYSTEM PREP PRI 4
[savbu-usnic-a:30016] [[9286,0],0] ACTIVATE JOB [9286,1] STATE PENDING APP LAUNCH AT base/plm_base_launch_support.c:451
[savbu-usnic-a:30016] [[9286,0],0] ACTIVATING JOB [9286,1] STATE PENDING APP LAUNCH PRI 4
[savbu-usnic-a:30016] [[9286,0],0] ACTIVATE JOB [9286,1] STATE LOCAL LAUNCH COMPLETE AT base/odls_base_default_fns.c:1046
[savbu-usnic-a:30016] [[9286,0],0] ACTIVATING JOB [9286,1] STATE LOCAL LAUNCH COMPLETE PRI 4
[savbu-usnic-a:30016] [[9286,0],0] ACTIVATE PROC [[9286,1],0] STATE RUNNING AT base/plm_base_receive.c:333
[savbu-usnic-a:30016] [[9286,0],0] ACTIVATING PROC [[9286,1],0] STATE RUNNING PRI 4
[savbu-usnic-a:30016] [[9286,0],0] state:base:track_procs called for proc [[9286,1],0] state RUNNING
[savbu-usnic-a:30016] [[9286,0],0] ACTIVATE JOB [9286,1] STATE RUNNING AT base/state_base_fns.c:488
[savbu-usnic-a:30016] [[9286,0],0] ACTIVATING JOB [9286,1] STATE RUNNING PRI 4
[hang]

Outside of a SLURM allocation, mpirun -np 1 --host foo this_does_not_exist hangs in the same way.

But mpirun -np 1 this_does_not_exist -- when executing on the local node -- does not hang. Instead, it properly displays a show_help message and exits with a non-zero status (this is on 2.0.x):

$ mpirun this_does_not_exist
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job.  This error was first reported for process
rank 0; it may have occurred for other processes as well.

NOTE: A common cause for this error is misspelling a mpirun command
      line parameter option (remember that mpirun interprets the first
      unrecognized command line token as the executable).

Node:       savbu-usnic-a
Executable: this_does_not_exist
--------------------------------------------------------------------------
16 total processes failed to start
$ echo $status
134

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions