Added EvictSystemCriticalPods flag to descheduler #523

RyanDevlin · 2021-03-10T21:16:04Z

Fixes #378

The evictSystemCriticalPods flag disables priority checking by the descheduler. When this flag is true, pods of any priority are evicted, including system pods like kube-dns. Daemonsets, Mirror Pods, Static Pods, and pods without owner references are not evicted when this flag is true. If thresholdPriority or thresholdPriorityClassName are set, and evictSystemCriticalPods is true, then the threshold priority filtering will be disabled.

k8s-ci-robot · 2021-03-10T21:16:13Z

Hi @RyanDevlin. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

RyanDevlin · 2021-03-10T21:42:20Z

/assign @damemi

damemi

/ok-to-test
looks all good to me, just squash those extra commits :)

pkg/utils/pod.go

pkg/descheduler/evictions/evictions.go

pkg/api/types.go

pkg/descheduler/evictions/evictions.go

ingvagabund · 2021-03-11T09:35:04Z

#523 (comment)

ingvagabund · 2021-03-12T13:47:20Z

test/e2e/e2e_test.go

+	sort.Strings(initialPodNames)
+	t.Logf("Existing pods: %v", initialPodNames)
+
+	t.Logf("set the strategy to delete pods of any priority from %v namespace", namespace)


I'd prefer to delete a specific pod instead of deleting all pods in kube-system. E.g. create a fresh pod in kube-system with system critical class only which gets deleted.

Because almost all the pods in kube-system are either daemonsets or static the only pods deleted are the dns pods. I had some trouble in the beginning writing the test because I tried to create my own system critical pods, but the API server gives an error when you try to make a pod with priority greater than 1 billion (1000000000). That's why, to test the capability of evicting pods with priority greater than 1000000000, I settled on deleting the dns pods in kube-system.

If you scroll to the right on the line below you can see I'm only looking for the kube-dns pods to delete.

400 podList, err := clientSet.CoreV1().Pods(namespace).List(ctx, metav1.ListOptions{LabelSelector: labels.SelectorFromSet(map[string]string{"k8s-app": "kube-dns"}).String()})

Looking at line 409, I do see now that the log text should be changed. Other than that, do you think the test is okay? Or should I still try to create some sort of test pod?

test/e2e/e2e_test.go

RyanDevlin · 2021-03-17T21:11:36Z

@ingvagabund I've adjusted my e2e tests to your specifications. All of my tests now reside in the testEvictSystemCritical function. I found I'm able to mock system critical pods with the system-node-critical priority class. The priority on this class is the highest possible, so it should be sufficient for the tests.

I've also spent a lot of time testing and watching the pods be created by the e2e tests, and I've come up with a more elegant solution to the pending pods issue than using time.sleep(). By calling runPodLifetimeStrategy inside the polling loop on the tests, I run the descheduler strategy repeatedly until the pods are no longer pending. This should solve the problem of having to rely on syncing the scheduling of all pending pods with the one shot call to runPodLifetimeStrategy in the tests.

damemi · 2021-03-22T13:37:21Z

By calling runPodLifetimeStrategy inside the polling loop on the tests, I run the descheduler strategy repeatedly until the pods are no longer pending. This should solve the problem of having to rely on syncing the scheduling of all pending pods with the one shot call to runPodLifetimeStrategy in the tests.

Would it be easier to just poll the pod's statuses, rather than running the strategy repeatedly? Thinking that would use less resources

ingvagabund · 2021-03-23T12:37:18Z

By calling runPodLifetimeStrategy inside the polling loop on the tests, I run the descheduler strategy repeatedly until the pods are no longer pending.

This may actually hide problems since the descheduler runs each strategy only once (e.g. every hour). We can not assume running runPodLifetimeStrategy is idempotent.

ingvagabund · 2021-03-29T14:02:02Z

@RyanDevlin can you integrate 2b286d9 in your PR and see if it helps to avoid running runPodLifetimeStrategy multiple times?

RyanDevlin · 2021-03-29T14:08:58Z

@ingvagabund Not a problem! That looks like a good solution. I'm hoping to complete that work today.

RyanDevlin · 2021-03-30T03:39:21Z

@ingvagabund @damemi Thank you both for your suggestions. I've added @ingvagabund's elegant polling solution from 2b286d9 into my PR. Apologies in advance for the wall of text, but I wanted to provide some context to these changes.

I spent a lot more time debugging these e2e tests. The solution from 2b286d9 didn't reliably work all the time for me. After further testing I discovered that every once in a while, a single RC pod would hang around. It would be 1/1 READY, with a state of Terminating. This would happen randomly, but consistently every 4 or so runs of the full tests. I believe the cause had to do with the fact that the default terminationGracePeriodSeconds for the RC pods was 30 seconds, although I admit I can't figure out why that caused this issue. The wait.PollImmediate loop from 2b286d9 has a duration of 60 seconds, so it's a bit confusing how this was occurring.

I began to solve this issue by setting the gracePeriod parameter of the RcByNameContainer() function to 15 seconds. Before my implementation, all calls to RcByNameContainer() used a value of nil for the gracePeriod parameter. It took a while, but eventually I noticed that my defaultGracePeriod setting wasn't being applied to the pods, and each pod still had a terminationGracePeriodSeconds of 30 seconds.

This was when I discovered that the RcByNameContainer() function already had a zeroGracePeriod default built into it. This variable would be set to zero if the caller of RcByNameContainer() sets gracePeriod to nil. This functionality was created way back in #31, but was never properly hooked up to the creation of the podSpec. So I went ahead and reset all my gracePeriod parameters to nil and instead added this zero default functionality to the podSpec in MakePodSpec().

So currently the terminationGracePeriodSeconds for all pods created in these tests is zero. If a future test wants to override this, they can simply pass something other than nil as the gracePeriod parameter in RcByNameContainer().

ingvagabund · 2021-03-30T08:27:30Z

/lgtm

@RyanDevlin thanks for the thorough investigation. The PR looks great!!!

RyanDevlin · 2021-03-30T12:08:00Z

@ingvagabund Thanks for all the help! This was my first PR, I learned a lot.

damemi

/approve
Thanks @RyanDevlin !

k8s-ci-robot · 2021-03-30T17:41:37Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: damemi, RyanDevlin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [damemi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Ran "make gen" using Go 1.16.1. Some changes were merged, but "make gen" was not run. This fixes the problem. See below PR for reference: kubernetes-sigs#523

Added EvictSystemCriticalPods flag to descheduler

Ran "make gen" using Go 1.16.1. Some changes were merged, but "make gen" was not run. This fixes the problem. See below PR for reference: kubernetes-sigs#523

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 10, 2021

k8s-ci-robot requested review from ingvagabund and lixiang233 March 10, 2021 21:16

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 10, 2021

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Mar 10, 2021

k8s-ci-robot assigned damemi Mar 10, 2021

damemi reviewed Mar 10, 2021

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 10, 2021

RyanDevlin force-pushed the evict-critical branch from 987b6c9 to 91390d9 Compare March 10, 2021 21:56

lixiang233 reviewed Mar 11, 2021

View reviewed changes

pkg/utils/pod.go Outdated Show resolved Hide resolved

pkg/utils/pod.go Show resolved Hide resolved

pkg/descheduler/evictions/evictions.go Outdated Show resolved Hide resolved

RyanDevlin force-pushed the evict-critical branch from 91390d9 to 6c16b0e Compare March 11, 2021 03:06

ingvagabund reviewed Mar 11, 2021

View reviewed changes

pkg/api/types.go Outdated Show resolved Hide resolved

pkg/descheduler/evictions/evictions.go Outdated Show resolved Hide resolved

RyanDevlin force-pushed the evict-critical branch 2 times, most recently from 3ce4aa8 to 44ade23 Compare March 11, 2021 14:37

ingvagabund requested changes Mar 12, 2021

View reviewed changes

RyanDevlin force-pushed the evict-critical branch from 44ade23 to 4d8e370 Compare March 17, 2021 18:31

RyanDevlin force-pushed the evict-critical branch from 4d8e370 to 91a018b Compare March 18, 2021 02:18

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 24, 2021

RyanDevlin force-pushed the evict-critical branch from 91a018b to d8bc1a4 Compare March 30, 2021 02:07

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 30, 2021

RyanDevlin changed the title ~~Added EvictSystemCriticalPods flag to descheduler~~ [WIP] Added EvictSystemCriticalPods flag to descheduler Mar 30, 2021

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 30, 2021

Completed evictSystemCriticalPods feature

b5d7219

RyanDevlin force-pushed the evict-critical branch from d8bc1a4 to b5d7219 Compare March 30, 2021 03:13

RyanDevlin changed the title ~~[WIP] Added EvictSystemCriticalPods flag to descheduler~~ Added EvictSystemCriticalPods flag to descheduler Mar 30, 2021

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 30, 2021

k8s-ci-robot assigned ingvagabund Mar 30, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 30, 2021

damemi reviewed Mar 30, 2021

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 30, 2021

k8s-ci-robot merged commit a2746d0 into kubernetes-sigs:master Mar 30, 2021

RyanDevlin deleted the evict-critical branch March 30, 2021 17:50

damemi mentioned this pull request Mar 31, 2021

Add verify scripts for make gen to run during PR #507

Merged

seanmalloy added a commit to KohlsTechnology/descheduler that referenced this pull request Apr 1, 2021

Update Generated Code

af01b67

Ran "make gen" using Go 1.16.1. Some changes were merged, but "make gen" was not run. This fixes the problem. See below PR for reference: kubernetes-sigs#523

This was referenced Apr 1, 2021

Update Generated Code #541

Merged

[WIP] Add evict-system-critical-pods option #379

Closed

briend pushed a commit to briend/descheduler that referenced this pull request Feb 11, 2022

Merge pull request kubernetes-sigs#523 from RyanDevlin/evict-critical

ad2030f

Added EvictSystemCriticalPods flag to descheduler

briend pushed a commit to briend/descheduler that referenced this pull request Feb 11, 2022

Update Generated Code

a008528

Ran "make gen" using Go 1.16.1. Some changes were merged, but "make gen" was not run. This fixes the problem. See below PR for reference: kubernetes-sigs#523

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added EvictSystemCriticalPods flag to descheduler #523

Added EvictSystemCriticalPods flag to descheduler #523

RyanDevlin commented Mar 10, 2021

k8s-ci-robot commented Mar 10, 2021

RyanDevlin commented Mar 10, 2021

damemi left a comment

ingvagabund commented Mar 11, 2021

ingvagabund Mar 12, 2021

RyanDevlin Mar 15, 2021 •

edited

RyanDevlin commented Mar 17, 2021

damemi commented Mar 22, 2021

ingvagabund commented Mar 23, 2021

ingvagabund commented Mar 29, 2021

RyanDevlin commented Mar 29, 2021

RyanDevlin commented Mar 30, 2021

ingvagabund commented Mar 30, 2021

RyanDevlin commented Mar 30, 2021

damemi left a comment

k8s-ci-robot commented Mar 30, 2021

Added EvictSystemCriticalPods flag to descheduler #523

Added EvictSystemCriticalPods flag to descheduler #523

Conversation

RyanDevlin commented Mar 10, 2021

k8s-ci-robot commented Mar 10, 2021

RyanDevlin commented Mar 10, 2021

damemi left a comment

Choose a reason for hiding this comment

ingvagabund commented Mar 11, 2021

ingvagabund Mar 12, 2021

Choose a reason for hiding this comment

RyanDevlin Mar 15, 2021 • edited

Choose a reason for hiding this comment

RyanDevlin commented Mar 17, 2021

damemi commented Mar 22, 2021

ingvagabund commented Mar 23, 2021

ingvagabund commented Mar 29, 2021

RyanDevlin commented Mar 29, 2021

RyanDevlin commented Mar 30, 2021

ingvagabund commented Mar 30, 2021

RyanDevlin commented Mar 30, 2021

damemi left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Mar 30, 2021

RyanDevlin Mar 15, 2021 •

edited