Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: remove workload finalizer if workload is finished #1523

Merged

Conversation

achernevskii
Copy link
Contributor

@achernevskii achernevskii commented Dec 27, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

Workload will be finalized if it has a Finished condition and no OwnerReferences.

Which issue(s) this PR fixes:

Fixes #1450

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Remove finalizer from Workloads that are orphaned (have no owners).

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. labels Dec 27, 2023
Copy link

netlify bot commented Dec 27, 2023

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit a5181ea
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/65a051d464cb780008a71433

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 27, 2023
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 27, 2023
@tenzen-y tenzen-y mentioned this pull request Jan 4, 2024
22 tasks
@@ -133,8 +134,8 @@ func (r *WorkloadReconciler) Reconcile(ctx context.Context, req ctrl.Request) (c
ctx = ctrl.LoggerInto(ctx, log)
log.V(2).Info("Reconciling Workload")

if apimeta.IsStatusConditionTrue(wl.Status.Conditions, kueue.WorkloadFinished) {
return ctrl.Result{}, nil
if apimeta.IsStatusConditionTrue(wl.Status.Conditions, kueue.WorkloadFinished) && len(wl.ObjectMeta.OwnerReferences) == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to check if the workload has ownerReference?
Will Argo remove ownerReference from the workload?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If workload has an owner reference, there's a possibility that the owner job should be stopped:

err := r.stopJob(ctx, job, wl, StopReasonWorkloadDeleted, "Workload is deleted")

We will introduce a race condition between the workload and job controller if we don't check for the owner references here.

And the workload is finalized here for the uncommon scenario, when the workload has no owner references, so the reconcile for respective job could not be queued.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should look more like this:

if len(owners) == 0 && deletionTimestamp != nil {
  finalize()
}
if finished {
  return
}

Note that workloads might not have owners in two cases:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@achernevskii Thank you for the clarifications!
As @alculquicondor mentioned, there are some positive situations when the workload doesn't have
ownerReference. So, I think that we should apply the Aldo's suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the behavior in f0690dc

Type: "Finished",
Status: "True",
}).
Obj(),
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a case for the workload with ownerReference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Obj(),
).
Admitted(true).
Delete().
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we verify in

if the workload has deletionTimestamp instead of adding such a deletionTimestamp?

kueue "sigs.k8s.io/kueue/apis/kueue/v1beta1"
)

func RemoveWorkloadFinalizer(ctx context.Context, c client.Client, wl *kueue.Workload) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this function to the workload (pkg/workload) package?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in f0690dc

@tenzen-y
Copy link
Member

tenzen-y commented Jan 4, 2024

@achernevskii Also, can you fix the CI error?

@alculquicondor
Copy link
Contributor

/cc

@alculquicondor
Copy link
Contributor

/assign @trasc

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm

Status: "True",
}).
OwnerReference(batchv1.SchemeGroupVersion.String(), "Job", "job", "test-uid", true, true).
Delete().
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing the deletionTimestamp would cause a flaky test?
So, can we verify if the workload has the deletionTImestamp in Line 427 instead of comparison?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you can use EquateApproxTime? or a AcyclicTransformer that only considers whether the values are nil.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably use EquateApproxTime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's just provide the desired time to the wrapper

Comment on lines 229 to 230
// Delete sets a deletion timestamp for the pod object
func (w *WorkloadWrapper) Delete() *WorkloadWrapper {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Delete sets a deletion timestamp for the pod object
func (w *WorkloadWrapper) Delete() *WorkloadWrapper {
// DeletionTimestamp sets a deletion timestamp for the pod object
func (w *WorkloadWrapper) DeletionTimestamp() *WorkloadWrapper {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 231 to 232
t := metav1.NewTime(time.Now()).Rfc3339Copy()
w.Workload.DeletionTimestamp = &t
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
t := metav1.NewTime(time.Now()).Rfc3339Copy()
w.Workload.DeletionTimestamp = &t
w.Workload.DeletionTimestamp := ptr.To(metav1.NewTime(time.Now()).Rfc3339Copy())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this Rfc3339Copy doing? truncating to second?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this Rfc3339Copy doing? truncating to second?

Correct, had to add it because of kubernetes/kubernetes#81026

Comment on lines 231 to 232
t := metav1.NewTime(time.Now()).Rfc3339Copy()
w.Workload.DeletionTimestamp = &t
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this Rfc3339Copy doing? truncating to second?

}

if err == nil && wl != nil {
err = r.removeFinalizer(ctx, wl)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change this? how will this workload be finalized?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted

Status: "True",
}).
OwnerReference(batchv1.SchemeGroupVersion.String(), "Job", "job", "test-uid", true, true).
Delete().
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you can use EquateApproxTime? or a AcyclicTransformer that only considers whether the values are nil.

@trasc trasc force-pushed the fix/workload_finalizer_removal branch from 160e42f to acbed0c Compare January 11, 2024 14:42
Copy link
Contributor

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 11, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 1f5b261905a2aafebc4e6108ac5fc65f9e5d10dc

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 11, 2024
@alculquicondor
Copy link
Contributor

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 11, 2024
@alculquicondor
Copy link
Contributor

Please squash, so that cherry-picking is less cumbersome.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Please squash, so that cherry-picking is less cumbersome.

@alculquicondor Isn't the default merge method squash?

https://github.com/kubernetes/test-infra/blob/c96b01283ef705e2a3c1f0b836b8ffeef0223c13/config/prow/config.yaml#L783

@tenzen-y
Copy link
Member

/cherry-pick release-0.5

@k8s-infra-cherrypick-robot

@tenzen-y: once the present PR merges, I will cherry-pick it on top of release-0.5 in a new PR and assign it to you.

In response to this:

/cherry-pick release-0.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@alculquicondor
Copy link
Contributor

It is, but the cherry-pick bot still goes by the original commits. Not sure if the individual commits could run into conflicts where the squashed commit wouldn't

@tenzen-y
Copy link
Member

It is, but the cherry-pick bot still goes by the original commits. Not sure if the individual commits could run into conflicts where the squashed commit wouldn't

I see. I'm fine with manually squash.

@trasc trasc force-pushed the fix/workload_finalizer_removal branch from acbed0c to a5181ea Compare January 11, 2024 20:38
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 11, 2024
@trasc
Copy link
Contributor

trasc commented Jan 11, 2024

squashed, but somehow it detected code changes due to rebase. Needs LGTM.

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 11, 2024
@trasc
Copy link
Contributor

trasc commented Jan 11, 2024

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 11, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 352be45fd720ee5036953f279568a014bda84ee7

@k8s-ci-robot k8s-ci-robot merged commit 336a3cb into kubernetes-sigs:main Jan 11, 2024
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.6 milestone Jan 11, 2024
@k8s-infra-cherrypick-robot

@tenzen-y: #1523 failed to apply on top of branch "release-0.5":

Applying: fix: remove finalizer if workload finished
Using index info to reconstruct a base tree...
M	pkg/controller/core/workload_controller.go
M	pkg/controller/jobframework/reconciler.go
M	pkg/controller/jobs/pod/pod_controller_test.go
M	pkg/util/testing/wrappers.go
M	pkg/workload/workload.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/workload/workload.go
Auto-merging pkg/util/testing/wrappers.go
CONFLICT (content): Merge conflict in pkg/util/testing/wrappers.go
Auto-merging pkg/controller/jobs/pod/pod_controller_test.go
CONFLICT (content): Merge conflict in pkg/controller/jobs/pod/pod_controller_test.go
Auto-merging pkg/controller/jobframework/reconciler.go
CONFLICT (content): Merge conflict in pkg/controller/jobframework/reconciler.go
Auto-merging pkg/controller/core/workload_controller.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 fix: remove finalizer if workload finished
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-0.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@alculquicondor
Copy link
Contributor

@tenzen-y do you think it's that important to cherry-pick this?

It's a bit of an edge case.

@tenzen-y
Copy link
Member

@tenzen-y do you think it's that important to cherry-pick this?

It's a bit of an edge case.

@alculquicondor I think that cherry-picking might be better since we have a chance to release a patch version (v0.5.2).
If we don't have any plan to release a patch version, I'm ok with either way.

Are you concerned about manyconflicts?

@alculquicondor
Copy link
Contributor

@trasc can you try to get a manual cherry-pick?

trasc pushed a commit to epam/kubernetes-kueue that referenced this pull request Jan 15, 2024
Co-authored-by: Lukas Wöhrl <lukas.woehrl@plentymarkets.com>
@trasc
Copy link
Contributor

trasc commented Jan 15, 2024

@trasc can you try to get a manual cherry-pick?

done #1583

k8s-ci-robot pushed a commit that referenced this pull request Jan 15, 2024
Co-authored-by: Aleksei Chernevskii <aleksei_chernevskii@epam.com>
Co-authored-by: Lukas Wöhrl <lukas.woehrl@plentymarkets.com>
@tenzen-y
Copy link
Member

/release-note-edit

Fix a bug in the pod integration that the pod's finalizer won't be removed even though the pod is finished

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jan 15, 2024
@alculquicondor
Copy link
Contributor

/release-note-edit

Remove finalizer from Workloads that are orphaned (have no owners).

@trasc trasc deleted the fix/workload_finalizer_removal branch March 12, 2024 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

High churn cluster with pod only causing stuck queue and overcommitment
7 participants