Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 20 additions & 14 deletions orte/mca/errmgr/default_hnp/errmgr_default_hnp.c
Original file line number Diff line number Diff line change
Expand Up @@ -443,27 +443,32 @@ static void proc_errors(int fd, short args, void *cbdata)
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), ORTE_NAME_PRINT(proc)));
/* record the first one to fail */
if (!ORTE_FLAG_TEST(jdata, ORTE_JOB_FLAG_ABORTED)) {
/* output an error message so the user knows what happened */
orte_show_help("help-errmgr-base.txt", "node-died", true,
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
orte_process_info.nodename,
ORTE_NAME_PRINT(proc),
pptr->node->name);
/* mark the daemon job as failed */
jdata->state = ORTE_JOB_STATE_COMM_FAILED;
/* point to the lowest rank to cause the problem */
orte_set_attribute(&jdata->attributes, ORTE_JOB_ABORTED_PROC, ORTE_ATTR_LOCAL, pptr, OPAL_PTR);
/* retain the object so it doesn't get free'd */
OBJ_RETAIN(pptr);
ORTE_FLAG_SET(jdata, ORTE_JOB_FLAG_ABORTED);
/* update our exit code */
ORTE_UPDATE_EXIT_STATUS(pptr->exit_code);
/* just in case the exit code hadn't been set, do it here - this
* won't override any reported exit code */
ORTE_UPDATE_EXIT_STATUS(ORTE_ERR_COMM_FAILURE);
if (!orte_enable_recovery) {
/* output an error message so the user knows what happened */
orte_show_help("help-errmgr-base.txt", "node-died", true,
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
orte_process_info.nodename,
ORTE_NAME_PRINT(proc),
pptr->node->name);
/* update our exit code */
ORTE_UPDATE_EXIT_STATUS(pptr->exit_code);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why the exit code should be set twice ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Urrr...it isn't? We set the status on the proc itself, and then we set the status for mpirun separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I don't understand. ORTE_UPDATE_EXIT_STATUS is used twice in this block of code (461 and 464). If I look at the macro definition it is protected by 0 == orte_exit_status, so the second invocation is never successful as the first one will already define orte_exit_status.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I assumed (and this might have been an error on my part) that pptr->exit_codemust be non-zero as we are in the error case and that process somehow failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i realized after i thought for a minute that this would be confusing. Actually, we wrote it this way because there were use-cases where we wound up with a zero exit status on the process (e.g., when slurm killed it and we don't get an exit status back). So we do it the second time just to be absolutely certain we return a non-zero status out of mpirun.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I realize after writing the commit that there was a possible path in which having 2 one after the other make sense. But, I agree with you it was a little confusing.

/* just in case the exit code hadn't been set, do it here - this
* won't override any reported exit code */
ORTE_UPDATE_EXIT_STATUS(ORTE_ERR_COMM_FAILURE);
}
}
/* if recovery is enabled, then we are done - otherwise,
* abort the system */
if (!orte_enable_recovery) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is exactly the same as above. merge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually isn't quite the same. The prior code block only gets executed if we aren't already aborting the job. We want to protect the call to abort the HNP as that code can be executed even if we aren't already aborting the job. I'll take another look in case I'm missing something and we actually cannot go down the separate paths.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I missed the second closing bracket, so we are not in the same code-block.

default_hnp_abort(jdata);
}
/* abort the system */
default_hnp_abort(jdata);
goto cleanup;
}

Expand Down Expand Up @@ -498,7 +503,8 @@ static void proc_errors(int fd, short args, void *cbdata)
keep_going:
/* if this is a continuously operating job, then there is nothing more
* to do - we let the job continue to run */
if (orte_get_attribute(&jdata->attributes, ORTE_JOB_CONTINUOUS_OP, NULL, OPAL_BOOL)) {
if (orte_get_attribute(&jdata->attributes, ORTE_JOB_CONTINUOUS_OP, NULL, OPAL_BOOL) ||
ORTE_FLAG_TEST(jdata, ORTE_JOB_FLAG_RECOVERABLE)) {
/* always mark the waitpid as having fired */
ORTE_ACTIVATE_PROC_STATE(&pptr->name, ORTE_PROC_STATE_WAITPID_FIRED);
/* if this is a remote proc, we won't hear anything more about it
Expand Down
6 changes: 4 additions & 2 deletions orte/mca/plm/slurm/plm_slurm_module.c
Original file line number Diff line number Diff line change
Expand Up @@ -267,8 +267,10 @@ static void launch_daemons(int fd, short args, void *cbdata)
/* start one orted on each node */
opal_argv_append(&argc, &argv, "--ntasks-per-node=1");

/* alert us if any orteds die during startup */
opal_argv_append(&argc, &argv, "--kill-on-bad-exit");
if (!orte_enable_recovery) {
/* kill the job if any orteds die */
opal_argv_append(&argc, &argv, "--kill-on-bad-exit");
}

/* ensure the orteds are not bound to a single processor,
* just in case the TaskAffinity option is set by default.
Expand Down
2 changes: 1 addition & 1 deletion orte/tools/orte-clean/orte-clean.c
Original file line number Diff line number Diff line change
Expand Up @@ -183,7 +183,7 @@ main(int argc, char *argv[])
free(legacy);

/* and finally get rid of any lingering pmix-related artifacts */
asprintf(&legacy, "rm -f %s/pmix*", orte_process_info.tmpdir_base);
asprintf(&legacy, "rm -rf %s/pmix*", orte_process_info.tmpdir_base);
system(legacy);
free(legacy);

Expand Down