-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fatal error on terminal job #1370
Conversation
… failed terminal state Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>
Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>
…om condition in fatal error Add e2e test Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>
I think we should talk about this... I'm not sure if job = complete == health is accidental or intentional... I find it odd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Not sure though if we should only look for job failures with "DeadlineExceeded" or if there are other failures that would indicate a transient error.
|
||
func isJobTerminallyFailed(job *batchv1.Job) (bool, string) { | ||
for _, c := range job.Status.Conditions { | ||
if c.Type == batchv1.JobFailed && c.Status == corev1.ConditionTrue { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to check for c.Reason == "DeadlineExceeded"
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had that in first, with special cases for DeadlineExceeded
and BackoffLimitExceeded
, but I don't think it makes sense to distinguish here: If the condition is Failed
, the job won't complete, so we abort. The reason is included in the error message and should show up in the FatalError of the task.
Yeah, let's have a talk about this later. I think it makes sense in the case of a Job - it either completes or it doesn't. For something like the |
* Return a fatal engine error if a job never gets healthy but reaches a failed terminal state Signed-off-by: Andreas Neumann <aneumann@mesosphere.com> Signed-off-by: Thomas Runyon <runyontr@gmail.com>
What this PR does / why we need it:
At the moment, the TaskApply waits for all resources to become healthy. Healthy for a job is defined as
Successful
. But there are jobs which may never reach that state, as they fail and reach theirbackOfLimit
.KUDO should acknowledge this and set the task to a FATAL_ERROR so the user gets a feedback and a different job can be executed.
Fixes #1367