Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include a Failed condition in Workloads #1982

Closed
3 tasks
alculquicondor opened this issue Apr 15, 2024 · 15 comments · Fixed by #2026
Closed
3 tasks

Include a Failed condition in Workloads #1982

alculquicondor opened this issue Apr 15, 2024 · 15 comments · Fixed by #2026
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@alculquicondor
Copy link
Contributor

alculquicondor commented Apr 15, 2024

What would you like to be added:

A Failed condition when a Workload finishes because the corresponding Job failed, with a reason coming from the job integration.

Just a reason in the existing Finished condition could work, although that could lead to different job integrations choosing different strings for the common idea of "failed".

Why is this needed:

Improved visibility.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@alculquicondor alculquicondor added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 15, 2024
@alculquicondor
Copy link
Contributor Author

I remember I discussed this with someone in the past. Was it you @astefanutti ?

@astefanutti
Copy link
Member

@alculquicondor that is likely me :) I've also touched on this with @mimowo a little while.

For the UI we're building (and generally for other services that could integrate with Kueue) the Workload API provides a convenient abstraction over the different underlying workload types.

There is already the Finished condition that's set when the workload has completed, though most of the time other services / UIs also need to differentiate successful completion from failed one. At the moment, each of these services / UIs has to re-implement the logic / heuristic for each integrations.

It seems that logic could be encapsulated into the framework integrations, and increase the added-value of the Workload API even more.

@alculquicondor
Copy link
Contributor Author

Everyone ok with using a Reason to indicate failure?

Adding an additional condition could also be an option, but the current reason is just a placeholder.

cc @tenzen-y

@tenzen-y
Copy link
Member

Actually, this discussion happened at the KubeCon Paris 2024 with @mimowo, @astefanutti, and me :)
Also, I agree with this idea.

@alculquicondor
Copy link
Contributor Author

/assign @trasc

@trasc
Copy link
Contributor

trasc commented Apr 16, 2024

The reason and message of the finished condition should already provide this, Am I missing something?

@astefanutti
Copy link
Member

The purpose of this is to have the reason be "fixed", like Failed, computed by each framework integration, similarly to how GenericJob.Finished is implemented.

@alculquicondor
Copy link
Contributor Author

In the case of Job, we have reason JobFinished regardless of whether the job finished successfully or not:

Reason: "JobFinished",

So the question is whether we want to change all implementations of Finished to have a common reason for the case of "Failed" or add an additional condition based on a new Failed method in the GenericJob interface.

A separate condition could give us a standard condition plus additional granularity for failure reasons.

@alculquicondor
Copy link
Contributor Author

Still, it might be worth it to start just with a standard Failed reason in the existing Finished condition.

@astefanutti
Copy link
Member

Right, as you said previously, the current reason is just a placeholder, so this can probably be changed.

As to whether introduce a separate condition, the batch Job API has one:

JobFailed JobConditionType = "Failed"

I don't know what drove that design decision, but if that was for "good" reasons, we could be consistent with it.

@tenzen-y
Copy link
Member

Right, as you said previously, the current reason is just a placeholder, so this can probably be changed.

As to whether introduce a separate condition, the batch Job API has one:

JobFailed JobConditionType = "Failed"

I don't know what drove that design decision, but if that was for "good" reasons, we could be consistent with it.

As far as I understand, the Job API was provided before the standardized metav1.Conditions including reason.
So, currently, I'm curious about whether we can use the Reason field as @alculquicondor mentioned above.

@alculquicondor
Copy link
Contributor Author

I think we should just start simple. The current reason JobFinished is not helpful at all.

@astefanutti
Copy link
Member

That sounds good to me.

@mimowo
Copy link
Contributor

mimowo commented Apr 18, 2024

I like this idea of starting simple and using reason in the Finished condition. Instead of the generic reason JobFinished we would have two: Failed and Succeeded.

@vladikkuzn
Copy link
Contributor

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants