New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
re-evict pod if hard eviction threshold reached while pod is soft-evicting #124081
base: master
Are you sure you want to change the base?
Conversation
Please note that we're already in Test Freeze for the Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Wed Mar 27 14:09:58 UTC 2024. |
Hi @olyazavr. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
a3e7fac
to
6b3cb26
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok-to-test
@kannon92 howdy! IIRC you have experience with eviction code, could you PTAL and/or suggest reviewers? thanks a ton!
@@ -102,6 +102,8 @@ type managerImpl struct { | |||
thresholdsLastUpdated time.Time | |||
// whether can support local storage capacity isolation | |||
localStorageCapacityIsolation bool | |||
// podsEvicted is the set of pods (format: namespace/pod: bool) that we have already tried to evict | |||
podsEvicted map[string]bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we track pods by their UID or by their namespace/name pair?
In addition, do we want to use a set (k8s libs: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/sets) ?
This is a pretty big change on the design of eviction manager. To start,
Why would eviction takes 7 hours? It seems the grace period for a soft eviction is set this high. What kind of pressure did you experience? |
/remove-kind bug |
Our grace period is set this high intentionally, we have some prestop hooks that take a long time to run, and unless it's an emergency, we would like the prestop hooks to run to completion.
Disk pressure |
5b52b3b
to
471d9b4
Compare
@derekwaynecarr could you weigh in here? |
klog.InfoS("Eviction manager: soft eviction already in progress, will not soft evict another pod") | ||
return nil, nil | ||
} | ||
m.softEvictionSemaphore <- true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a pod is currently being evicted, i.e. the semaphore is not empty, this will be blocking, which I think is not the desired behavior. For example, it can lead to all of the controller's threads to get stuck here hence when a hard eviction threshold will be met the controller wouldn't be able to address it. Do you agree?
Is there a reason to use a semaphore here?
Perhaps we can use a simple lock instead, and if we see that a soft eviction threshold is taking place and hard thresholds aren't met - just return.
Does that make sense? Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, a simple lock does sound better, let me do that
I had to make a change to the ordering of thresholds. Previously, it was ordering thresholds by prioritizing memory first, and thresholds with no resource to reclaim last, and now it also considers whether or not the threshold is a hard threshold. This is needed so that when something has resource pressure, it will first consider the hard threshold - because we want to immediately hard evict once we've reached that point. |
/retest |
func (m *mockPodKiller) killPodNow(pod *v1.Pod, evict bool, gracePeriodOverride *int64, lock *sync.Mutex, statusFn func(*v1.PodStatus)) error { | ||
_ = m.killPodNowLongShutdown(pod, evict, gracePeriodOverride, lock, statusFn) | ||
if lock != nil { | ||
lock.Unlock() | ||
} | ||
return nil | ||
} | ||
|
||
// killPodNowNoSemaphoreRelease records the pod that was killed, and does not release the semaphore, simulating a long pod shutdown | ||
func (m *mockPodKiller) killPodNowNoSemaphoreRelease(pod *v1.Pod, evict bool, gracePeriodOverride *int64, semaphore chan bool, statusFn func(*v1.PodStatus)) error { | ||
// killPodNowLongShutdown records the pod that was killed, and does not unlock the lock, simulating a long pod shutdown | ||
func (m *mockPodKiller) killPodNowLongShutdown(pod *v1.Pod, evict bool, gracePeriodOverride *int64, lock *sync.Mutex, statusFn func(*v1.PodStatus)) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the code itself, eviction.NewManager()
is always being called with killPodFunc: killPodNowAsync(klet.podWorkers, kubeDeps.Recorder)
, so I wonder if it's necessary.
I think that in the test you should provide the same function (i.e. killPodNowAsync
) as an argument for the manager instead.
WDYT? Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, so that will mean re-working the tests so that they don't have this mockPodKiller and instead have mocked out podworkers that record the same info (what was evicted, grace period, etc). Is that what you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PodWorkers are in the kubelet package, which causes a circular dependency. I can refactor all of this, but that would be a huge change
Ehum. I tend to think that this is somewhat out of scope for this PR. Can we leave this discussion to a follow up? (please cc me there!) |
The memory ordering is still preserved. The only difference is that on top of the existing ordering, it will also then prefer thresholds with a hard eviction grace period (0). I can add more testing here This is necessary because as I found out when running the test I wrote- it's nondeteministic otherwise. Say you have a threshold A for having 1Gi memory left with grace period 5m, and then threshold B for having 500Mi memory left with grace period 0. We start out having 1Gi memory left, so threshold A would be triggered, and a pod would be soft-evicted. Then, if the pod is taking a while to evict and meanwhile we're down to < 500Mi memory, it'll sometimes select threshold A (and not do anything since it is a soft threshold and we already have a pod evicting) OR it'll select threshold B (the desired outcome- it will hard evict that pod) |
3d46d62
to
e1c2f2a
Compare
/retest |
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>
Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>
Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>
Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>
Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>
Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>
Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>
Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>
e1c2f2a
to
41a37b1
Compare
Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: olyazavr The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
What type of PR is this?
/kind bug
What this PR does / why we need it:
When kubelet detects that it's under resource pressure, it first attempts to do soft evictions, until the hard eviction threshold is reached. When a pod is soft-evicted, it respects the configured max pod grace period seconds, and until the pod has shut down, kubelet will not attempt to soft OR hard evict another pod, even if the hard eviction threshold is reached.
As a result, one pod taking a long time to shut down can cause kubelet to run out of resources. From this comment and this comment this behavior seems to be by design
In our case, we saw one soft eviction take 7 hours to complete, and meanwhile, resources usage kept climbing without any automation trying to save the node. Had other pods gotten soft evicted while this pod shut down, this would not be an issue. Manual intervention prevented it from reaching hard-eviction thresholds, but had that not happened, this would have entirely exhausted the node with no automated action.
This fixes that by:
This also changes the ordering of thresholds. Previously, it was ordering thresholds by prioritizing memory first, and thresholds with no resource to reclaim last, and now it also considers whether or not the threshold is a hard threshold. This is needed so that when something has resource pressure, it will first consider the hard threshold - because we want to immediately hard evict once we've reached that point.
Which issue(s) this PR fixes:
Fixes #123872
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
None