Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal error on terminal job #1370

Merged
merged 3 commits into from
Mar 3, 2020
Merged

Conversation

ANeumann82
Copy link
Member

What this PR does / why we need it:

At the moment, the TaskApply waits for all resources to become healthy. Healthy for a job is defined as Successful. But there are jobs which may never reach that state, as they fail and reach their backOfLimit.

KUDO should acknowledge this and set the task to a FATAL_ERROR so the user gets a feedback and a different job can be executed.

Fixes #1367

… failed terminal state

Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>
Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>
pkg/engine/health/health.go Outdated Show resolved Hide resolved
…om condition in fatal error

Add e2e test

Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>
@kensipe
Copy link
Member

kensipe commented Mar 3, 2020

I think we should talk about this... I'm not sure if job = complete == health is accidental or intentional... I find it odd

Copy link
Member

@nfnt nfnt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Not sure though if we should only look for job failures with "DeadlineExceeded" or if there are other failures that would indicate a transient error.


func isJobTerminallyFailed(job *batchv1.Job) (bool, string) {
for _, c := range job.Status.Conditions {
if c.Type == batchv1.JobFailed && c.Status == corev1.ConditionTrue {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to check for c.Reason == "DeadlineExceeded" as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had that in first, with special cases for DeadlineExceeded and BackoffLimitExceeded, but I don't think it makes sense to distinguish here: If the condition is Failed, the job won't complete, so we abort. The reason is included in the error message and should show up in the FatalError of the task.

@ANeumann82
Copy link
Member Author

I think we should talk about this... I'm not sure if job = complete == health is accidental or intentional... I find it odd

Yeah, let's have a talk about this later. I think it makes sense in the case of a Job - it either completes or it doesn't. For something like the CronJob, which spawns new jobs regularly, it's more complicated.

@ANeumann82 ANeumann82 merged commit 6871b2f into master Mar 3, 2020
@ANeumann82 ANeumann82 deleted the an/fatal-error-on-terminal-job branch March 3, 2020 16:03
runyontr pushed a commit that referenced this pull request Mar 11, 2020
* Return a fatal engine error if a job never gets healthy but reaches a failed terminal state

Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>
Signed-off-by: Thomas Runyon <runyontr@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Health status of Jobs is incomplete
3 participants