Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re-evict pod if hard eviction threshold reached while pod is soft-evicting #124081

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

olyazavr
Copy link
Contributor

@olyazavr olyazavr commented Mar 27, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

When kubelet detects that it's under resource pressure, it first attempts to do soft evictions, until the hard eviction threshold is reached. When a pod is soft-evicted, it respects the configured max pod grace period seconds, and until the pod has shut down, kubelet will not attempt to soft OR hard evict another pod, even if the hard eviction threshold is reached.

As a result, one pod taking a long time to shut down can cause kubelet to run out of resources. From this comment and this comment this behavior seems to be by design

In our case, we saw one soft eviction take 7 hours to complete, and meanwhile, resources usage kept climbing without any automation trying to save the node. Had other pods gotten soft evicted while this pod shut down, this would not be an issue. Manual intervention prevented it from reaching hard-eviction thresholds, but had that not happened, this would have entirely exhausted the node with no automated action.

This fixes that by:

  1. No longer waiting on soft evictions to complete. There is a semaphore to prevent concurrent soft evictions, but the eviction loop will resume after one soft eviction starts, so as to catch when we reach the hard eviction threshold
  2. If we reach the hard eviction threshold, evict the pod that is the worst offender- this is can be a pod that has been previously soft evicted but is taking a while to shut down

This also changes the ordering of thresholds. Previously, it was ordering thresholds by prioritizing memory first, and thresholds with no resource to reclaim last, and now it also considers whether or not the threshold is a hard threshold. This is needed so that when something has resource pressure, it will first consider the hard threshold - because we want to immediately hard evict once we've reached that point.

Which issue(s) this PR fixes:

Fixes #123872

Special notes for your reviewer:

Does this PR introduce a user-facing change?

If a node is under resource pressure, and it reaches the hard eviction threshold, it can now re-evict a pod that was previously soft evicted if it is taking too long to shut down gracefully

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

None

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. labels Mar 27, 2024
Copy link

linux-foundation-easycla bot commented Mar 27, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.30 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.30.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Wed Mar 27 14:09:58 UTC 2024.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 27, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @olyazavr. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Mar 27, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 27, 2024
Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

@kannon92 howdy! IIRC you have experience with eviction code, could you PTAL and/or suggest reviewers? thanks a ton!

@@ -102,6 +102,8 @@ type managerImpl struct {
thresholdsLastUpdated time.Time
// whether can support local storage capacity isolation
localStorageCapacityIsolation bool
// podsEvicted is the set of pods (format: namespace/pod: bool) that we have already tried to evict
podsEvicted map[string]bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we track pods by their UID or by their namespace/name pair?
In addition, do we want to use a set (k8s libs: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/sets) ?

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 28, 2024
@kannon92
Copy link
Contributor

This is a pretty big change on the design of eviction manager.

To start,

In our case, we saw one soft eviction take 7 hours to complete, and meanwhile, resources usage kept climbing without any automation trying to save the node.

Why would eviction takes 7 hours? It seems the grace period for a soft eviction is set this high.

What kind of pressure did you experience?

@kannon92
Copy link
Contributor

/remove-kind bug
/kind feature

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Mar 28, 2024
@olyazavr
Copy link
Contributor Author

Why would eviction takes 7 hours? It seems the grace period for a soft eviction is set this high.

Our grace period is set this high intentionally, we have some prestop hooks that take a long time to run, and unless it's an emergency, we would like the prestop hooks to run to completion.

What kind of pressure did you experience?

Disk pressure

@kannon92
Copy link
Contributor

@derekwaynecarr could you weigh in here?

@kannon92
Copy link
Contributor

klog.InfoS("Eviction manager: soft eviction already in progress, will not soft evict another pod")
return nil, nil
}
m.softEvictionSemaphore <- true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a pod is currently being evicted, i.e. the semaphore is not empty, this will be blocking, which I think is not the desired behavior. For example, it can lead to all of the controller's threads to get stuck here hence when a hard eviction threshold will be met the controller wouldn't be able to address it. Do you agree?

Is there a reason to use a semaphore here?
Perhaps we can use a simple lock instead, and if we see that a soft eviction threshold is taking place and hard thresholds aren't met - just return.

Does that make sense? Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, a simple lock does sound better, let me do that

@olyazavr
Copy link
Contributor Author

I had to make a change to the ordering of thresholds. Previously, it was ordering thresholds by prioritizing memory first, and thresholds with no resource to reclaim last, and now it also considers whether or not the threshold is a hard threshold. This is needed so that when something has resource pressure, it will first consider the hard threshold - because we want to immediately hard evict once we've reached that point.

@olyazavr
Copy link
Contributor Author

/retest

Comment on lines 63 to 72
func (m *mockPodKiller) killPodNow(pod *v1.Pod, evict bool, gracePeriodOverride *int64, lock *sync.Mutex, statusFn func(*v1.PodStatus)) error {
_ = m.killPodNowLongShutdown(pod, evict, gracePeriodOverride, lock, statusFn)
if lock != nil {
lock.Unlock()
}
return nil
}

// killPodNowNoSemaphoreRelease records the pod that was killed, and does not release the semaphore, simulating a long pod shutdown
func (m *mockPodKiller) killPodNowNoSemaphoreRelease(pod *v1.Pod, evict bool, gracePeriodOverride *int64, semaphore chan bool, statusFn func(*v1.PodStatus)) error {
// killPodNowLongShutdown records the pod that was killed, and does not unlock the lock, simulating a long pod shutdown
func (m *mockPodKiller) killPodNowLongShutdown(pod *v1.Pod, evict bool, gracePeriodOverride *int64, lock *sync.Mutex, statusFn func(*v1.PodStatus)) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the code itself, eviction.NewManager() is always being called with killPodFunc: killPodNowAsync(klet.podWorkers, kubeDeps.Recorder), so I wonder if it's necessary.

I think that in the test you should provide the same function (i.e. killPodNowAsync) as an argument for the manager instead.

WDYT? Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, so that will mean re-working the tests so that they don't have this mockPodKiller and instead have mocked out podworkers that record the same info (what was evicted, grace period, etc). Is that what you mean?

Copy link
Contributor Author

@olyazavr olyazavr May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PodWorkers are in the kubelet package, which causes a circular dependency. I can refactor all of this, but that would be a huge change

pkg/kubelet/pod_workers.go Show resolved Hide resolved
@iholder101
Copy link
Contributor

I had to make a change to the ordering of thresholds. Previously, it was ordering thresholds by prioritizing memory first, and thresholds with no resource to reclaim last, and now it also considers whether or not the threshold is a hard threshold. This is needed so that when something has resource pressure, it will first consider the hard threshold - because we want to immediately hard evict once we've reached that point.

Ehum.
TBH I'm not entirely sure that's desired. Perhaps memory is so essential that we always want to address it first?

I tend to think that this is somewhat out of scope for this PR. Can we leave this discussion to a follow up? (please cc me there!)

@olyazavr
Copy link
Contributor Author

TBH I'm not entirely sure that's desired. Perhaps memory is so essential that we always want to address it first?

The memory ordering is still preserved. The only difference is that on top of the existing ordering, it will also then prefer thresholds with a hard eviction grace period (0). I can add more testing here

This is necessary because as I found out when running the test I wrote- it's nondeteministic otherwise.

Say you have a threshold A for having 1Gi memory left with grace period 5m, and then threshold B for having 500Mi memory left with grace period 0. We start out having 1Gi memory left, so threshold A would be triggered, and a pod would be soft-evicted. Then, if the pod is taking a while to evict and meanwhile we're down to < 500Mi memory, it'll sometimes select threshold A (and not do anything since it is a soft threshold and we already have a pod evicting) OR it'll select threshold B (the desired outcome- it will hard evict that pod)

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 15, 2024
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 15, 2024
@olyazavr
Copy link
Contributor Author

/retest

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 15, 2024
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>
Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>
Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>
Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>
Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>
Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>
Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>
Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>
@k8s-ci-robot k8s-ci-robot added area/code-generation sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 15, 2024
Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: olyazavr
Once this PR has been reviewed and has the lgtm label, please assign mrunalp for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@olyazavr
Copy link
Contributor Author

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/code-generation area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
SIG Node PR Triage
Needs Reviewer
Development

Successfully merging this pull request may close these issues.

Soft eviction of pods with long grace periods blocks hard evictions when under resource pressure
7 participants