Fix grace period override used for immediate evictions in eviction manager #119570

claassen · 2023-07-25T19:05:29Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Currently when evicting pods due to disk or memory pressure, or when a pod's storage use has exceeded the configured ephemeral-storage resource limit we use a grace period override of 0 during eviction which actually ends up allowing the pod's full configured grace period rather than performing an immediate eviction due to the logic here:

https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/pod_workers.go#L1000-L1002

In order to actually perform an immediate eviction we should instead use a grace period override of 1 in these cases.

Which issue(s) this PR fixes:

Fixes #115819

Special notes for your reviewer:

Note that the linked issue also mentions a similar problem when force deleting pods via --grace-period=0 --force. This PR makes no attempt to address that issue. There are some notes around the --force option using a grace period of 0 for backwards compatibility which I am not sure the reason for, but pod deletion using --now or the equivalent --grace-period=1 behaves as expected and serves to illustrate how the correct grace period override for immediate eviction should be 1 rather than 0.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

linux-foundation-easycla · 2023-07-25T19:05:33Z

The committers listed above are authorized under a signed CLA.

✅ login: claassen / name: Michael Claassen (1d7921d, 12a8836)

k8s-ci-robot · 2023-07-25T19:05:37Z

Welcome @claassen!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2023-07-25T19:05:39Z

Hi @claassen. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2023-07-25T19:10:21Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: claassen
Once this PR has been reviewed and has the lgtm label, please assign random-liu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/kubelet/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

claassen · 2023-07-25T21:13:05Z

pkg/kubelet/eviction/eviction_manager_test.go

@@ -294,7 +295,7 @@ func TestDiskPressureNodeFs_VerifyPodStatus(t *testing.T) {
 			wantPodStatus: v1.PodStatus{
 				Phase:   v1.PodFailed,
 				Reason:  "Evicted",
-				Message: "The node was low on resource: ephemeral-storage. Threshold quantity: 2Gi, available: 1536Mi. ",
+				Message: "The node was low on resource: ephemeral-storage. Threshold quantity: 2Gi, available: 1536Mi. Container above-requests was using 700Mi, request is 100Mi, has larger consumption of ephemeral-storage. ",


The messages here changed because previously in the test helper we were not setting the container name on the ContainerStats objects so it wasn't picking up the resource usage

bart0sh · 2023-08-02T15:24:44Z

/triage accepted
/priority important-soon
/assign @bobbypage @SergeyKanzhelev

rphillips · 2023-10-25T15:08:24Z

/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-eviction
/test pull-crio-cgroupv1-node-e2e-eviction

k8s-ci-robot · 2023-10-25T17:22:40Z

@claassen: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-cos-cgroupv2-containerd-node-e2e-eviction	`12a8836`	link	false	`/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-eviction`
pull-crio-cgroupv1-node-e2e-eviction	`12a8836`	link	false	`/test pull-crio-cgroupv1-node-e2e-eviction`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

tuibeovince · 2023-10-27T05:09:13Z

Can't we fix this for all cases if we handle it in pod_workers.go? Something like:

diff --git a/pkg/kubelet/pod_workers.go b/pkg/kubelet/pod_workers.go
index 20e8b493a8f..b90ccad3e5a 100644
--- a/pkg/kubelet/pod_workers.go
+++ b/pkg/kubelet/pod_workers.go
@@ -981,10 +981,12 @@ func calculateEffectiveGracePeriod(status *podSyncStatus, pod *v1.Pod, options *
        // enforce the restriction that a grace period can only decrease and track whatever our value is,
        // then ensure a calculated value is passed down to lower levels
        gracePeriod := status.gracePeriod
+       overriden := false
        // this value is bedrock truth - the apiserver owns telling us this value calculated by apiserver
        if override := pod.DeletionGracePeriodSeconds; override != nil {
                if gracePeriod == 0 || *override < gracePeriod {
                        gracePeriod = *override
+                       overriden = true
                }
        }
        // we allow other parts of the kubelet (namely eviction) to request this pod be terminated faster
@@ -992,12 +994,13 @@ func calculateEffectiveGracePeriod(status *podSyncStatus, pod *v1.Pod, options *
                if override := options.PodTerminationGracePeriodSecondsOverride; override != nil {
                        if gracePeriod == 0 || *override < gracePeriod {
                                gracePeriod = *override
+                               overriden = true
                        }
                }
        }
        // make a best effort to default this value to the pod's desired intent, in the event
        // the kubelet provided no requested value (graceful termination?)
-       if gracePeriod == 0 && pod.Spec.TerminationGracePeriodSeconds != nil {
+       if !overriden && gracePeriod == 0 && pod.Spec.TerminationGracePeriodSeconds != nil {
                gracePeriod = *pod.Spec.TerminationGracePeriodSeconds
        }
        // no matter what, we always supply a grace period of 1

I agree with this perspective that fixing the problem on pod_workers.go side may be enough to cover all the cases.
But since, there is a code block that checks for all grace period values < 1 and sets it all to 1:

    // no matter what, we always supply a grace period of 1

leaving this PR's fix for assigning the grace period to 1 during evictions will skip this one condition check and be a less complicated implementation. In my opinion, there might be merit in applying the fix in both eviction_manager.go and pod_workers.go. What do you think?

k8s-ci-robot · 2023-11-04T12:05:18Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

zcahana · 2023-12-12T19:43:26Z

In my opinion, there might be merit in applying the fix in both eviction_manager.go and pod_workers.go. What do you think?

Applying the fix to pod_workers.go will have the added benefit of solving this for force deleted pods, which too are stopped with their full terminationGracePeriodSeconds instead of "immediately" (see #108741, as well as this comment in the original issue solved by this PR). While not explicitly in-scope for this PR, this could catch 2 (related) birds.

tuibeovince · 2023-12-13T10:13:28Z

Applying the fix to pod_workers.go will have the added benefit of solving this for force deleted pods, which too are stopped with their full terminationGracePeriodSeconds instead of "immediately" (see #108741, as well as this comment in the original issue solved by this PR). While not explicitly in-scope for this PR, this could catch 2 (related) birds.

I share the same sentiments. A fix in pod_workers.go may just be sufficient. Still, as stated in the issue being solved by this PR, the convention of "immediate", the documentation, and such must also be addressed with equal priority.

Sorry but I am still unfamiliar with the process of redefining conventions yet (I would love to know).

tuibeovince · 2024-01-11T05:52:03Z

There is a PR that I am also looking into that addresses the problem entirely on the pod_workers.go end (#120451) This might be a point of interest or a place for further discussion about this fix.

tuibeovince · 2024-01-12T07:37:20Z

@Seaiii Thank you. I will look into your PR as well.

Seaiii · 2024-01-12T08:07:47Z

@Seaiii Thank you. I will look into your PR as well.

Oh, sorry, I misunderstood, you were talking about forcing the first deletion to be 0. I mentioned issue before and now it doesn't allow first time deletion with a value of 0 so I turned it off

k8s-triage-robot · 2024-02-11T08:40:39Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2024-02-11T08:40:45Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rphillips · 2024-02-13T01:10:11Z

/reopen

k8s-ci-robot · 2024-02-13T01:10:18Z

@rphillips: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

bart0sh · 2024-02-13T09:52:02Z

/remove-lifecycle rotten

@claassen please, rebase the PR, thanks.

olyazavr · 2024-03-20T21:06:46Z

👋 We've encountered this bug in our setup, and would be interested in seeing this merged in

olyazavr · 2024-03-27T21:03:36Z

I've rebased this PR here: #124063

bart0sh · 2024-03-29T08:28:36Z

/close
in favor of #124063

k8s-ci-robot · 2024-03-29T08:28:40Z

@bart0sh: Closed this PR.

In response to this:

/close
in favor of #124063

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Fix grace period used for immediate evictions

1d7921d

k8s-ci-robot added do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 25, 2023

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 25, 2023

k8s-ci-robot requested review from odinuge and wzshiming July 25, 2023 19:10

claassen marked this pull request as ready for review July 25, 2023 20:45

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 25, 2023

k8s-ci-robot requested a review from derekwaynecarr July 25, 2023 20:50

claassen commented Jul 25, 2023

View reviewed changes

bart0sh added this to Triage in SIG Node PR Triage Jul 26, 2023

Refactor tests

12a8836

k8s-ci-robot assigned bobbypage and SergeyKanzhelev Aug 2, 2023

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 4, 2023

dims added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 4, 2024

k8s-ci-robot closed this Feb 11, 2024

SIG Node PR Triage automation moved this from Needs Approver to Done Feb 11, 2024

k8s-ci-robot reopened this Feb 13, 2024

SIG Node PR Triage automation moved this from Done to Triage Feb 13, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 13, 2024

bart0sh moved this from Triage to Waiting on Author in SIG Node PR Triage Feb 13, 2024

HirazawaUi mentioned this pull request Feb 26, 2024

force-delete pod execute prestop hook #123408

Open

olyazavr mentioned this pull request Mar 26, 2024

fix grace period used for immediate evictions #124063

Merged

k8s-ci-robot closed this Mar 29, 2024

SIG Node PR Triage automation moved this from Waiting on Author to Done Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix grace period override used for immediate evictions in eviction manager #119570

Fix grace period override used for immediate evictions in eviction manager #119570

claassen commented Jul 25, 2023 •

edited

Loading

linux-foundation-easycla bot commented Jul 25, 2023 •

edited

Loading

k8s-ci-robot commented Jul 25, 2023

k8s-ci-robot commented Jul 25, 2023

k8s-ci-robot commented Jul 25, 2023

claassen Jul 25, 2023

bart0sh commented Aug 2, 2023

rphillips commented Oct 25, 2023

k8s-ci-robot commented Oct 25, 2023

tuibeovince commented Oct 27, 2023

k8s-ci-robot commented Nov 4, 2023

zcahana commented Dec 12, 2023

tuibeovince commented Dec 13, 2023

tuibeovince commented Jan 11, 2024

tuibeovince commented Jan 12, 2024

Seaiii commented Jan 12, 2024

k8s-triage-robot commented Feb 11, 2024

k8s-ci-robot commented Feb 11, 2024

rphillips commented Feb 13, 2024

k8s-ci-robot commented Feb 13, 2024

bart0sh commented Feb 13, 2024

olyazavr commented Mar 20, 2024

olyazavr commented Mar 27, 2024

bart0sh commented Mar 29, 2024

k8s-ci-robot commented Mar 29, 2024

Fix grace period override used for immediate evictions in eviction manager #119570

Fix grace period override used for immediate evictions in eviction manager #119570

Conversation

claassen commented Jul 25, 2023 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

linux-foundation-easycla bot commented Jul 25, 2023 • edited Loading

k8s-ci-robot commented Jul 25, 2023

k8s-ci-robot commented Jul 25, 2023

k8s-ci-robot commented Jul 25, 2023

claassen Jul 25, 2023

Choose a reason for hiding this comment

bart0sh commented Aug 2, 2023

rphillips commented Oct 25, 2023

k8s-ci-robot commented Oct 25, 2023

tuibeovince commented Oct 27, 2023

k8s-ci-robot commented Nov 4, 2023

zcahana commented Dec 12, 2023

tuibeovince commented Dec 13, 2023

tuibeovince commented Jan 11, 2024

tuibeovince commented Jan 12, 2024

Seaiii commented Jan 12, 2024

k8s-triage-robot commented Feb 11, 2024

k8s-ci-robot commented Feb 11, 2024

rphillips commented Feb 13, 2024

k8s-ci-robot commented Feb 13, 2024

bart0sh commented Feb 13, 2024

olyazavr commented Mar 20, 2024

olyazavr commented Mar 27, 2024

bart0sh commented Mar 29, 2024

k8s-ci-robot commented Mar 29, 2024

claassen commented Jul 25, 2023 •

edited

Loading

linux-foundation-easycla bot commented Jul 25, 2023 •

edited

Loading