re-evict pod if hard eviction threshold reached while pod is soft-evicting #124081

olyazavr · 2024-03-27T19:12:30Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

When kubelet detects that it's under resource pressure, it first attempts to do soft evictions, until the hard eviction threshold is reached. When a pod is soft-evicted, it respects the configured max pod grace period seconds, and until the pod has shut down, kubelet will not attempt to soft OR hard evict another pod, even if the hard eviction threshold is reached.

As a result, one pod taking a long time to shut down can cause kubelet to run out of resources. From this comment and this comment this behavior seems to be by design

In our case, we saw one soft eviction take 7 hours to complete, and meanwhile, resources usage kept climbing without any automation trying to save the node. Had other pods gotten soft evicted while this pod shut down, this would not be an issue. Manual intervention prevented it from reaching hard-eviction thresholds, but had that not happened, this would have entirely exhausted the node with no automated action.

This fixes that by:

No longer waiting on soft evictions to complete. There is a semaphore to prevent concurrent soft evictions, but the eviction loop will resume after one soft eviction starts, so as to catch when we reach the hard eviction threshold
If we reach the hard eviction threshold, evict the pod that is the worst offender- this is can be a pod that has been previously soft evicted but is taking a while to shut down

This also changes the ordering of thresholds. Previously, it was ordering thresholds by prioritizing memory first, and thresholds with no resource to reclaim last, and now it also considers whether or not the threshold is a hard threshold. This is needed so that when something has resource pressure, it will first consider the hard threshold - because we want to immediately hard evict once we've reached that point.

Which issue(s) this PR fixes:

Fixes #123872

Special notes for your reviewer:

Does this PR introduce a user-facing change?

If a node is under resource pressure, and it reaches the hard eviction threshold, it can now re-evict a pod that was previously soft evicted if it is taking too long to shut down gracefully

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

None

linux-foundation-easycla · 2024-03-27T19:12:33Z

The committers listed above are authorized under a signed CLA.

✅ login: olyazavr / name: Olga Shestopalova (f5170b9, 1cbc3d5, 1426e5a, 8bb7897, 957424b, 63406d8, 41a37b1, 0570f93, 1b9af62, 6a65e03)

k8s-ci-robot · 2024-03-27T19:12:34Z

Please note that we're already in Test Freeze for the release-1.30 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.30.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Wed Mar 27 14:09:58 UTC 2024.

k8s-ci-robot · 2024-03-27T19:12:39Z

Hi @olyazavr. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ffromani

/ok-to-test

@kannon92 howdy! IIRC you have experience with eviction code, could you PTAL and/or suggest reviewers? thanks a ton!

ffromani · 2024-03-28T14:11:31Z

pkg/kubelet/eviction/eviction_manager.go

@@ -102,6 +102,8 @@ type managerImpl struct {
 	thresholdsLastUpdated time.Time
 	// whether can support local storage capacity isolation
 	localStorageCapacityIsolation bool
+	// podsEvicted is the set of pods (format: namespace/pod: bool) that we have already tried to evict
+	podsEvicted map[string]bool


should we track pods by their UID or by their namespace/name pair?
In addition, do we want to use a set (k8s libs: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/sets) ?

kannon92 · 2024-03-28T14:19:21Z

This is a pretty big change on the design of eviction manager.

To start,

In our case, we saw one soft eviction take 7 hours to complete, and meanwhile, resources usage kept climbing without any automation trying to save the node.

Why would eviction takes 7 hours? It seems the grace period for a soft eviction is set this high.

What kind of pressure did you experience?

kannon92 · 2024-03-28T14:21:07Z

/remove-kind bug
/kind feature

olyazavr · 2024-03-28T14:30:19Z

Why would eviction takes 7 hours? It seems the grace period for a soft eviction is set this high.

Our grace period is set this high intentionally, we have some prestop hooks that take a long time to run, and unless it's an emergency, we would like the prestop hooks to run to completion.

What kind of pressure did you experience?

Disk pressure

kannon92 · 2024-03-28T14:40:29Z

@derekwaynecarr could you weigh in here?

kannon92 · 2024-03-28T15:18:24Z

cc @haircommander @smarterclayton @mrunalp @rphillips

iholder101 · 2024-05-12T11:24:02Z

pkg/kubelet/eviction/eviction_manager.go

+				klog.InfoS("Eviction manager: soft eviction already in progress, will not soft evict another pod")
+				return nil, nil
+			}
+			m.softEvictionSemaphore <- true


If a pod is currently being evicted, i.e. the semaphore is not empty, this will be blocking, which I think is not the desired behavior. For example, it can lead to all of the controller's threads to get stuck here hence when a hard eviction threshold will be met the controller wouldn't be able to address it. Do you agree?

Is there a reason to use a semaphore here?
Perhaps we can use a simple lock instead, and if we see that a soft eviction threshold is taking place and hard thresholds aren't met - just return.

Does that make sense? Am I missing something?

Oh yeah, a simple lock does sound better, let me do that

olyazavr · 2024-05-13T12:51:48Z

I had to make a change to the ordering of thresholds. Previously, it was ordering thresholds by prioritizing memory first, and thresholds with no resource to reclaim last, and now it also considers whether or not the threshold is a hard threshold. This is needed so that when something has resource pressure, it will first consider the hard threshold - because we want to immediately hard evict once we've reached that point.

olyazavr · 2024-05-13T13:49:07Z

/retest

iholder101 · 2024-05-15T11:34:16Z

pkg/kubelet/eviction/eviction_manager_test.go

+func (m *mockPodKiller) killPodNow(pod *v1.Pod, evict bool, gracePeriodOverride *int64, lock *sync.Mutex, statusFn func(*v1.PodStatus)) error {
+	_ = m.killPodNowLongShutdown(pod, evict, gracePeriodOverride, lock, statusFn)
+	if lock != nil {
+		lock.Unlock()
 	}
 	return nil
 }

-// killPodNowNoSemaphoreRelease records the pod that was killed, and does not release the semaphore, simulating a long pod shutdown
-func (m *mockPodKiller) killPodNowNoSemaphoreRelease(pod *v1.Pod, evict bool, gracePeriodOverride *int64, semaphore chan bool, statusFn func(*v1.PodStatus)) error {
+// killPodNowLongShutdown records the pod that was killed, and does not unlock the lock, simulating a long pod shutdown
+func (m *mockPodKiller) killPodNowLongShutdown(pod *v1.Pod, evict bool, gracePeriodOverride *int64, lock *sync.Mutex, statusFn func(*v1.PodStatus)) error {


In the code itself, eviction.NewManager() is always being called with killPodFunc: killPodNowAsync(klet.podWorkers, kubeDeps.Recorder), so I wonder if it's necessary.

I think that in the test you should provide the same function (i.e. killPodNowAsync) as an argument for the manager instead.

WDYT? Am I missing something?

Hmmm, so that will mean re-working the tests so that they don't have this mockPodKiller and instead have mocked out podworkers that record the same info (what was evicted, grace period, etc). Is that what you mean?

PodWorkers are in the kubelet package, which causes a circular dependency. I can refactor all of this, but that would be a huge change

pkg/kubelet/pod_workers.go

iholder101 · 2024-05-15T12:00:52Z

I had to make a change to the ordering of thresholds. Previously, it was ordering thresholds by prioritizing memory first, and thresholds with no resource to reclaim last, and now it also considers whether or not the threshold is a hard threshold. This is needed so that when something has resource pressure, it will first consider the hard threshold - because we want to immediately hard evict once we've reached that point.

Ehum.
TBH I'm not entirely sure that's desired. Perhaps memory is so essential that we always want to address it first?

I tend to think that this is somewhat out of scope for this PR. Can we leave this discussion to a follow up? (please cc me there!)

olyazavr · 2024-05-15T13:31:37Z

TBH I'm not entirely sure that's desired. Perhaps memory is so essential that we always want to address it first?

The memory ordering is still preserved. The only difference is that on top of the existing ordering, it will also then prefer thresholds with a hard eviction grace period (0). I can add more testing here

This is necessary because as I found out when running the test I wrote- it's nondeteministic otherwise.

Say you have a threshold A for having 1Gi memory left with grace period 5m, and then threshold B for having 500Mi memory left with grace period 0. We start out having 1Gi memory left, so threshold A would be triggered, and a pod would be soft-evicted. Then, if the pod is taking a while to evict and meanwhile we're down to < 500Mi memory, it'll sometimes select threshold A (and not do anything since it is a soft threshold and we already have a pod evicting) OR it'll select threshold B (the desired outcome- it will hard evict that pod)

olyazavr · 2024-05-15T17:28:13Z

/retest

Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>

Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>

k8s-ci-robot · 2024-05-15T23:53:29Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: olyazavr
Once this PR has been reviewed and has the lgtm label, please assign mrunalp for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/kubelet/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

olyazavr · 2024-05-16T12:51:50Z

/retest

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. labels Mar 27, 2024

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Mar 27, 2024

k8s-ci-robot requested review from krmayankk and odinuge March 27, 2024 19:12

olyazavr force-pushed the soft-evict-timeout branch from a3e7fac to 6b3cb26 Compare March 27, 2024 19:12

ffromani reviewed Mar 28, 2024

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 28, 2024

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Mar 28, 2024

olyazavr force-pushed the soft-evict-timeout branch from 5b52b3b to 471d9b4 Compare March 28, 2024 14:30

iholder101 reviewed May 12, 2024

View reviewed changes

bart0sh mentioned this pull request May 15, 2024

The second time the pod is deleted the grace period does not take effect #113883

Open

iholder101 reviewed May 15, 2024

View reviewed changes

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 15, 2024

olyazavr force-pushed the soft-evict-timeout branch from 3d46d62 to e1c2f2a Compare May 15, 2024 15:15

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 15, 2024

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 15, 2024

olyazavr added 9 commits May 15, 2024 19:42

if a pod is taking too long to evict, move on to others

1cbc3d5

Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>

use uids instead of namespace/name, use actual set

f5170b9

Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>

fix logging

63406d8

Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>

create helper manager impl function for tests

0570f93

Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>

pass in args to newManagerImpl

1426e5a

Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>

kill func nonblocking, use semaphore to not allow concurrent soft evicts

957424b

Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>

use a lock instead of a channel semaphore

6a65e03

Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>

order thresholds by whether or not it is a hard threshold

8bb7897

Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>

add test, clean up

41a37b1

Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>

olyazavr force-pushed the soft-evict-timeout branch from e1c2f2a to 41a37b1 Compare May 15, 2024 23:51

k8s-ci-robot added area/code-generation sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 15, 2024

undo accidental add

1b9af62

Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

re-evict pod if hard eviction threshold reached while pod is soft-evicting #124081

re-evict pod if hard eviction threshold reached while pod is soft-evicting #124081

olyazavr commented Mar 27, 2024 •

edited

linux-foundation-easycla bot commented Mar 27, 2024 •

edited

k8s-ci-robot commented Mar 27, 2024

k8s-ci-robot commented Mar 27, 2024

ffromani left a comment

ffromani Mar 28, 2024

kannon92 commented Mar 28, 2024

kannon92 commented Mar 28, 2024

olyazavr commented Mar 28, 2024

kannon92 commented Mar 28, 2024

kannon92 commented Mar 28, 2024

iholder101 May 12, 2024

olyazavr May 12, 2024

olyazavr commented May 13, 2024

olyazavr commented May 13, 2024

iholder101 May 15, 2024

olyazavr May 15, 2024

olyazavr May 15, 2024 •

edited

iholder101 commented May 15, 2024

olyazavr commented May 15, 2024

olyazavr commented May 15, 2024

k8s-ci-robot commented May 15, 2024

olyazavr commented May 16, 2024

re-evict pod if hard eviction threshold reached while pod is soft-evicting #124081

Are you sure you want to change the base?

re-evict pod if hard eviction threshold reached while pod is soft-evicting #124081

Conversation

olyazavr commented Mar 27, 2024 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

linux-foundation-easycla bot commented Mar 27, 2024 • edited

k8s-ci-robot commented Mar 27, 2024

k8s-ci-robot commented Mar 27, 2024

ffromani left a comment

Choose a reason for hiding this comment

ffromani Mar 28, 2024

Choose a reason for hiding this comment

kannon92 commented Mar 28, 2024

kannon92 commented Mar 28, 2024

olyazavr commented Mar 28, 2024

kannon92 commented Mar 28, 2024

kannon92 commented Mar 28, 2024

iholder101 May 12, 2024

Choose a reason for hiding this comment

olyazavr May 12, 2024

Choose a reason for hiding this comment

olyazavr commented May 13, 2024

olyazavr commented May 13, 2024

iholder101 May 15, 2024

Choose a reason for hiding this comment

olyazavr May 15, 2024

Choose a reason for hiding this comment

olyazavr May 15, 2024 • edited

Choose a reason for hiding this comment

iholder101 commented May 15, 2024

olyazavr commented May 15, 2024

olyazavr commented May 15, 2024

k8s-ci-robot commented May 15, 2024

olyazavr commented May 16, 2024

olyazavr commented Mar 27, 2024 •

edited

linux-foundation-easycla bot commented Mar 27, 2024 •

edited

olyazavr May 15, 2024 •

edited