Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: remove finalizer if workload finished #1454

Closed

Conversation

woehrl01
Copy link
Contributor

@woehrl01 woehrl01 commented Dec 14, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

It removes the finalizer of workloads after it's completed, if the finalizer is not deleted by the job deletion (e.g. because of controller restart)

Which issue(s) this PR fixes:

Fixes #1450

Special notes for your reviewer:

Does this PR introduce a user-facing change?


@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Dec 14, 2023
Copy link

netlify bot commented Dec 14, 2023

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 81b8ab1
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/657c723a5a40fe0008805ad5

@k8s-ci-robot
Copy link
Contributor

Welcome @woehrl01!

It looks like this is your first PR to kubernetes-sigs/kueue 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/kueue has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 14, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @woehrl01. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

linux-foundation-easycla bot commented Dec 14, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Dec 14, 2023
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Dec 14, 2023
@woehrl01 woehrl01 changed the title cleanup finalizer workload fix: remove finalizer if workload finished Dec 14, 2023
@woehrl01 woehrl01 marked this pull request as ready for review December 14, 2023 07:58
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 14, 2023
@alculquicondor
Copy link
Contributor

cc @achernevskii
/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 14, 2023
@alculquicondor
Copy link
Contributor

Please add unit tests

@alculquicondor
Copy link
Contributor

alculquicondor commented Dec 14, 2023

Actually
/hold

Can you debug why this logic didn't remove the finalizer in this case?

return ctrl.Result{}, r.removeFinalizer(ctx, wl)

UPDATE: #1450 (comment)

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 14, 2023
Comment on lines 294 to 299
func (r *WorkloadReconciler) removeFinalizer(ctx context.Context, wl *kueue.Workload) error {
if controllerutil.RemoveFinalizer(wl, kueue.ResourceInUseFinalizerName) {
return r.client.Update(ctx, wl)
}
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a similar method in pkg/controller/jobframework/reconciler.go.

func (r *JobReconciler) removeFinalizer(ctx context.Context, wl *kueue.Workload) error {
if controllerutil.RemoveFinalizer(wl, kueue.ResourceInUseFinalizerName) {
return r.client.Update(ctx, wl)
}
return nil
}

We could convert that method into an util function and use it in both places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would hold from this, as I believe that the fix should be in the reconciler only #1450 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I just pushed a change which deletes the finalizer in the case that the stopJob results in the error that the object is not found. In any other case (and failure of removing the finalizer) results in returning an error on reconcile.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhmmm is the not found the case that you faced in production?
If somehow removing the finalizer fails, I think the next reconcile wouldn't be able to reach to this piece of the code, would it?
Maybe a better solution is closer what you had initially. But we shouldn't try to remove finalizers from two places. So perhaps just doing it in the workload controller only is better.
Whatever you do, add a unit test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the back and forth, but I think I have a better understanding of the problem now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the problem I faced is that the Pod is already deleted, but the finalizer is still on the workload resource.

Alright, let me think of an alternative implementation. If meanwhile you have some ideas, please let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarize, I think the solution should be:

  • Do finalizer removal from Workload controller
  • Do NOT do any workload finalizer removal from the job reconciler, for better separation of concerns and to avoid the controllers from stumbling into each other.

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 14, 2023
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 14, 2023
@woehrl01 woehrl01 force-pushed the cleanup_finalizer_workload branch 2 times, most recently from 4063476 to ee2a5f7 Compare December 14, 2023 21:23
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: woehrl01
Once this PR has been reviewed and has the lgtm label, please assign alculquicondor for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 15, 2023
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 15, 2023
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Dec 15, 2023

@woehrl01: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kueue-verify-main 81b8ab1 link true /test pull-kueue-verify-main
pull-kueue-test-unit-main 81b8ab1 link true /test pull-kueue-test-unit-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@woehrl01
Copy link
Contributor Author

@alculquicondor I'm encountering issues with the final unit tests. Although the logic in the changes seems correct, the jobs need an additional reconciliation loop to be marked as completed, setting the 'JobCompleted' condition on the workload to true. Do you have suggestions for an elegant solution to implement this?

Copy link
Contributor

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, in unit tests you should just test the effect of one reconcile loop.

If something requires more than one reconcile loop, you can consider them 2 different test cases. And if the interaction could be complex enough, it's worth considering an integration test.

@@ -0,0 +1,80 @@
package util
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need an additional mockClient? Prefer to use the fake client provided by controller-runtime

github.com/prometheus/common v0.44.0 // indirect
github.com/prometheus/procfs v0.11.1 // indirect
github.com/sirupsen/logrus v1.9.3 // indirect
github.com/spf13/cobra v1.7.0 // indirect
github.com/spf13/pflag v1.0.5 // indirect
github.com/stoewer/go-strcase v1.3.0 // indirect
github.com/stretchr/objx v0.5.0 // indirect
github.com/stretchr/testify v1.8.4 // indirect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@woehrl01
Copy link
Contributor Author

@alculquicondor because of the upcoming season I have to abandon this PR for a few weeks. Feel free to pick it up in the meantime, I see that there are already some PR depending on this and I don't want to introduce a blocker. Thanks!

@woehrl01 woehrl01 marked this pull request as draft December 21, 2023 05:08
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 21, 2023
@woehrl01 woehrl01 closed this Dec 28, 2023
@woehrl01
Copy link
Contributor Author

Superseded by #1523

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

High churn cluster with pod only causing stuck queue and overcommitment
4 participants