Fatal error on terminal job #1370

ANeumann82 · 2020-02-28T17:55:24Z

What this PR does / why we need it:

At the moment, the TaskApply waits for all resources to become healthy. Healthy for a job is defined as Successful. But there are jobs which may never reach that state, as they fail and reach their backOfLimit.

KUDO should acknowledge this and set the task to a FATAL_ERROR so the user gets a feedback and a different job can be executed.

Fixes #1367

… failed terminal state Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

pkg/engine/health/health.go

…om condition in fatal error Add e2e test Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

kensipe · 2020-03-03T00:00:20Z

I think we should talk about this... I'm not sure if job = complete == health is accidental or intentional... I find it odd

nfnt

LGTM! Not sure though if we should only look for job failures with "DeadlineExceeded" or if there are other failures that would indicate a transient error.

nfnt · 2020-03-03T08:58:28Z

pkg/engine/health/health.go

+
+func isJobTerminallyFailed(job *batchv1.Job) (bool, string) {
+	for _, c := range job.Status.Conditions {
+		if c.Type == batchv1.JobFailed && c.Status == corev1.ConditionTrue {


Does it make sense to check for c.Reason == "DeadlineExceeded" as well?

I had that in first, with special cases for DeadlineExceeded and BackoffLimitExceeded, but I don't think it makes sense to distinguish here: If the condition is Failed, the job won't complete, so we abort. The reason is included in the error message and should show up in the FatalError of the task.

ANeumann82 · 2020-03-03T09:57:12Z

I think we should talk about this... I'm not sure if job = complete == health is accidental or intentional... I find it odd

Yeah, let's have a talk about this later. I think it makes sense in the case of a Job - it either completes or it doesn't. For something like the CronJob, which spawns new jobs regularly, it's more complicated.

* Return a fatal engine error if a job never gets healthy but reaches a failed terminal state Signed-off-by: Andreas Neumann <aneumann@mesosphere.com> Signed-off-by: Thomas Runyon <runyontr@gmail.com>

ANeumann82 added 2 commits February 28, 2020 18:50

Return a fatal engine error if a job never gets healthy but reaches a…

56219a9

… failed terminal state Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

Fixed comparison

aeb07b2

Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

ANeumann82 requested review from alenkacz, gerred, kensipe, nfnt and zen-dog as code owners February 28, 2020 17:55

nfnt reviewed Mar 2, 2020

View reviewed changes

pkg/engine/health/health.go Outdated Show resolved Hide resolved

Use job conditions to determine if job has failed, include message fr…

2f126d8

…om condition in fatal error Add e2e test Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

nfnt approved these changes Mar 3, 2020

View reviewed changes

ANeumann82 merged commit 6871b2f into master Mar 3, 2020

ANeumann82 deleted the an/fatal-error-on-terminal-job branch March 3, 2020 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fatal error on terminal job #1370

Fatal error on terminal job #1370

ANeumann82 commented Feb 28, 2020

kensipe commented Mar 3, 2020

nfnt left a comment

nfnt Mar 3, 2020

ANeumann82 Mar 3, 2020

ANeumann82 commented Mar 3, 2020

Fatal error on terminal job #1370

Fatal error on terminal job #1370

Conversation

ANeumann82 commented Feb 28, 2020

kensipe commented Mar 3, 2020

nfnt left a comment

Choose a reason for hiding this comment

nfnt Mar 3, 2020

Choose a reason for hiding this comment

ANeumann82 Mar 3, 2020

Choose a reason for hiding this comment

ANeumann82 commented Mar 3, 2020