-
Notifications
You must be signed in to change notification settings - Fork 934
Enable ORTE to continue running when a node fails #3772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -443,27 +443,32 @@ static void proc_errors(int fd, short args, void *cbdata) | |
| ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), ORTE_NAME_PRINT(proc))); | ||
| /* record the first one to fail */ | ||
| if (!ORTE_FLAG_TEST(jdata, ORTE_JOB_FLAG_ABORTED)) { | ||
| /* output an error message so the user knows what happened */ | ||
| orte_show_help("help-errmgr-base.txt", "node-died", true, | ||
| ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), | ||
| orte_process_info.nodename, | ||
| ORTE_NAME_PRINT(proc), | ||
| pptr->node->name); | ||
| /* mark the daemon job as failed */ | ||
| jdata->state = ORTE_JOB_STATE_COMM_FAILED; | ||
| /* point to the lowest rank to cause the problem */ | ||
| orte_set_attribute(&jdata->attributes, ORTE_JOB_ABORTED_PROC, ORTE_ATTR_LOCAL, pptr, OPAL_PTR); | ||
| /* retain the object so it doesn't get free'd */ | ||
| OBJ_RETAIN(pptr); | ||
| ORTE_FLAG_SET(jdata, ORTE_JOB_FLAG_ABORTED); | ||
| /* update our exit code */ | ||
| ORTE_UPDATE_EXIT_STATUS(pptr->exit_code); | ||
| /* just in case the exit code hadn't been set, do it here - this | ||
| * won't override any reported exit code */ | ||
| ORTE_UPDATE_EXIT_STATUS(ORTE_ERR_COMM_FAILURE); | ||
| if (!orte_enable_recovery) { | ||
| /* output an error message so the user knows what happened */ | ||
| orte_show_help("help-errmgr-base.txt", "node-died", true, | ||
| ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), | ||
| orte_process_info.nodename, | ||
| ORTE_NAME_PRINT(proc), | ||
| pptr->node->name); | ||
| /* update our exit code */ | ||
| ORTE_UPDATE_EXIT_STATUS(pptr->exit_code); | ||
| /* just in case the exit code hadn't been set, do it here - this | ||
| * won't override any reported exit code */ | ||
| ORTE_UPDATE_EXIT_STATUS(ORTE_ERR_COMM_FAILURE); | ||
| } | ||
| } | ||
| /* if recovery is enabled, then we are done - otherwise, | ||
| * abort the system */ | ||
| if (!orte_enable_recovery) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This test is exactly the same as above. merge?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It actually isn't quite the same. The prior code block only gets executed if we aren't already aborting the job. We want to protect the call to abort the HNP as that code can be executed even if we aren't already aborting the job. I'll take another look in case I'm missing something and we actually cannot go down the separate paths.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, I missed the second closing bracket, so we are not in the same code-block. |
||
| default_hnp_abort(jdata); | ||
| } | ||
| /* abort the system */ | ||
| default_hnp_abort(jdata); | ||
| goto cleanup; | ||
| } | ||
|
|
||
|
|
@@ -498,7 +503,8 @@ static void proc_errors(int fd, short args, void *cbdata) | |
| keep_going: | ||
| /* if this is a continuously operating job, then there is nothing more | ||
| * to do - we let the job continue to run */ | ||
| if (orte_get_attribute(&jdata->attributes, ORTE_JOB_CONTINUOUS_OP, NULL, OPAL_BOOL)) { | ||
| if (orte_get_attribute(&jdata->attributes, ORTE_JOB_CONTINUOUS_OP, NULL, OPAL_BOOL) || | ||
| ORTE_FLAG_TEST(jdata, ORTE_JOB_FLAG_RECOVERABLE)) { | ||
| /* always mark the waitpid as having fired */ | ||
| ORTE_ACTIVATE_PROC_STATE(&pptr->name, ORTE_PROC_STATE_WAITPID_FIRED); | ||
| /* if this is a remote proc, we won't hear anything more about it | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why the exit code should be set twice ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Urrr...it isn't? We set the status on the proc itself, and then we set the status for mpirun separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I don't understand. ORTE_UPDATE_EXIT_STATUS is used twice in this block of code (461 and 464). If I look at the macro definition it is protected by
0 == orte_exit_status, so the second invocation is never successful as the first one will already define orte_exit_status.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I assumed (and this might have been an error on my part) that
pptr->exit_codemust be non-zero as we are in the error case and that process somehow failed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, i realized after i thought for a minute that this would be confusing. Actually, we wrote it this way because there were use-cases where we wound up with a zero exit status on the process (e.g., when slurm killed it and we don't get an exit status back). So we do it the second time just to be absolutely certain we return a non-zero status out of mpirun.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I realize after writing the commit that there was a possible path in which having 2 one after the other make sense. But, I agree with you it was a little confusing.