Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update wl/job handling for the case, when the job is finished #1383

Merged
merged 2 commits into from Dec 1, 2023

Conversation

achernevskii
Copy link
Contributor

@achernevskii achernevskii commented Nov 30, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

  • Workload is not being recreated, when the job is finished.

  • Job will be stopped if it's not suspended and matching workload is not found. This action doesn't depend on the job.Finished condition anymore.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Avoid recreating a Workload for a finished Job and finalize a job when the workload is declared finished.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. labels Nov 30, 2023
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 30, 2023
Copy link

netlify bot commented Nov 30, 2023

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 9eb11c5
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/656a47ba7af24100089af66b

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 30, 2023
@tenzen-y
Copy link
Member

@achernevskii Why do we need this change? I couldn't see what is bug.

@achernevskii
Copy link
Contributor Author

@achernevskii Why do we need this change? I couldn't see what is bug.

It's not a bug. The behavior has changed since v0.4.x in those two PRs:
https://github.com/kubernetes-sigs/kueue/pull/1072/files#diff-552ab657667de70900b502e2617bcb405fbf4203ff08b2a35d7c87f2d637a75bR431
https://github.com/kubernetes-sigs/kueue/pull/1033/files#diff-552ab657667de70900b502e2617bcb405fbf4203ff08b2a35d7c87f2d637a75bR186
And now workload is being recreated after the job is Finished.

After this discussion I've decided to go back to the original reconciler logic:
#1373 (comment)

if err := r.stopJob(ctx, job, w, StopReasonNoMatchingWorkload, "No matching Workload"); err != nil {
return nil, fmt.Errorf("stopping job with no matching workload: %w", err)
}
if err := r.stopJob(ctx, job, w, StopReasonNoMatchingWorkload, "No matching Workload"); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only call stop if it's not already finished. Otherwise it's just a waste of API calls.

Is there a case where we need this for pod groups? It doesn't look like it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hypothetically, a job could be finished but unsuspended. I don't think it's possible for any of the existing integrations, but it could be possible from the GeneticJob interface perspective.

If we're trying to minimize the number of API calls in this case, shouldn't we call IsSuspended here instead of Finished?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsSuspended is just about the suspend field. It would still return false for the job if it's finished but the suspend field is false. That's why we were still checking for IsFinished as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So Finished==true will always mean that no pods related to the job are active?

For the current reconciler, workload will be recreated. So if a job is finished, but not suspended, and new workload is not admitted, the job will be stopped:
https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/jobframework/reconciler.go#L367

With this implementation we have to be sure that the job is stopped when we find out that workload is not found.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So Finished==true will always mean that no pods related to the job are active?

IIUC, Finished==true means that the job has succeeded or failed regardless of being suspended.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So Finished==true will always mean that no pods related to the job are active?

It should, unless the job implemented IsFinished wrong.

@tenzen-y
Copy link
Member

tenzen-y commented Dec 1, 2023

@achernevskii Why do we need this change? I couldn't see what is bug.

It's not a bug. The behavior has changed since v0.4.x in those two PRs: https://github.com/kubernetes-sigs/kueue/pull/1072/files#diff-552ab657667de70900b502e2617bcb405fbf4203ff08b2a35d7c87f2d637a75bR431 https://github.com/kubernetes-sigs/kueue/pull/1033/files#diff-552ab657667de70900b502e2617bcb405fbf4203ff08b2a35d7c87f2d637a75bR186 And now workload is being recreated after the job is Finished.

After this discussion I've decided to go back to the original reconciler logic: #1373 (comment)

That makes sense. Thanks for the clarification.

/remove-kind-bug
/kind cleanup

@k8s-ci-robot k8s-ci-robot added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Dec 1, 2023
@tenzen-y
Copy link
Member

tenzen-y commented Dec 1, 2023

/remove-kind bug

@k8s-ci-robot k8s-ci-robot removed the kind/bug Categorizes issue or PR as related to a bug. label Dec 1, 2023
Comment on lines 269 to 267
if err != nil {
log.Error(err, "Updating workload status")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 this might have escaped previous reviews.

If the Workload update fails, we should return the error, so it can be reconciled again (except for NotFound errors).

Another thing that escaped:

Finalizing the job should happen after we update the Workload. Otherwise, the job and the Workload (which is a dependent object) could potentially be deleted before it was transitioned to Finished. This could be fine, but it's better to give complete information before removing the Workload, in case a user is watching for the changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the job finalization/wl status update order: #1177

pkg/controller/jobs/job/job_controller_test.go Outdated Show resolved Hide resolved
@@ -1408,6 +1408,47 @@ func TestReconciler(t *testing.T) {
},
workloadCmpOpts: defaultWorkloadCmpOpts,
},
"if a pod group is finished and the wl is deleted, new workload shouldn't be created": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add another one (if it isn't there)?

If a pod group is not finished (for example, it has a non-terminated gated Pod), a workload should be recreated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're indirectly checking for this scenario here:
https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/jobs/pod/pod_controller_test.go#L1280

But I've added a test with gated pod anyway.

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 1, 2023
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 1, 2023
* Workload is not being recreated, when the job is
  finished.

* If related workload is not found, the job will be stopped
  only if it's not finished.
if wl != nil && !apimeta.IsStatusConditionTrue(wl.Status.Conditions, kueue.WorkloadFinished) {
err := workload.UpdateStatus(ctx, r.client, wl, condition.Type, condition.Status, condition.Reason, condition.Message, constants.JobControllerName)
if err != nil && !apierrors.IsNotFound(err) {
log.Error(err, "Updating workload status")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove this line, as returned errors are logged as well.

Copy link
Contributor

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 1, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: achernevskii, alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: e01b5e0a14ad32dc03cc44e76754e9a23505bec4

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 1, 2023
@k8s-ci-robot k8s-ci-robot merged commit dbca421 into kubernetes-sigs:main Dec 1, 2023
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.6 milestone Dec 1, 2023
@achernevskii achernevskii deleted the fix/finished_job_handling branch December 4, 2023 05:28
@alculquicondor
Copy link
Contributor

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 11, 2024
@alculquicondor
Copy link
Contributor

/remove-kind cleanup

@k8s-ci-robot k8s-ci-robot removed the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Jan 11, 2024
@alculquicondor
Copy link
Contributor

/release-note-edit

Avoid recreating a Workload for a finished Job and finalize a job when the workload is declared finished.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants