Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added EvictSystemCriticalPods flag to descheduler #523

Merged
merged 1 commit into from
Mar 30, 2021

Conversation

RyanDevlin
Copy link
Contributor

Fixes #378

The evictSystemCriticalPods flag disables priority checking by the descheduler. When this flag is true, pods of any priority are evicted, including system pods like kube-dns. Daemonsets, Mirror Pods, Static Pods, and pods without owner references are not evicted when this flag is true. If thresholdPriority or thresholdPriorityClassName are set, and evictSystemCriticalPods is true, then the threshold priority filtering will be disabled.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 10, 2021
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 10, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @RyanDevlin. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Mar 10, 2021
@RyanDevlin
Copy link
Contributor Author

/assign @damemi

Copy link
Contributor

@damemi damemi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test
looks all good to me, just squash those extra commits :)

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 10, 2021
pkg/utils/pod.go Outdated Show resolved Hide resolved
pkg/utils/pod.go Show resolved Hide resolved
pkg/descheduler/evictions/evictions.go Outdated Show resolved Hide resolved
pkg/api/types.go Outdated Show resolved Hide resolved
pkg/descheduler/evictions/evictions.go Outdated Show resolved Hide resolved
@ingvagabund
Copy link
Contributor

#523 (comment)

@RyanDevlin RyanDevlin force-pushed the evict-critical branch 2 times, most recently from 3ce4aa8 to 44ade23 Compare March 11, 2021 14:37
sort.Strings(initialPodNames)
t.Logf("Existing pods: %v", initialPodNames)

t.Logf("set the strategy to delete pods of any priority from %v namespace", namespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to delete a specific pod instead of deleting all pods in kube-system. E.g. create a fresh pod in kube-system with system critical class only which gets deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because almost all the pods in kube-system are either daemonsets or static the only pods deleted are the dns pods. I had some trouble in the beginning writing the test because I tried to create my own system critical pods, but the API server gives an error when you try to make a pod with priority greater than 1 billion (1000000000). That's why, to test the capability of evicting pods with priority greater than 1000000000, I settled on deleting the dns pods in kube-system.

If you scroll to the right on the line below you can see I'm only looking for the kube-dns pods to delete.

400    podList, err := clientSet.CoreV1().Pods(namespace).List(ctx, metav1.ListOptions{LabelSelector: labels.SelectorFromSet(map[string]string{"k8s-app": "kube-dns"}).String()})

Looking at line 409, I do see now that the log text should be changed. Other than that, do you think the test is okay? Or should I still try to create some sort of test pod?

test/e2e/e2e_test.go Outdated Show resolved Hide resolved
test/e2e/e2e_test.go Outdated Show resolved Hide resolved
test/e2e/e2e_test.go Outdated Show resolved Hide resolved
@RyanDevlin
Copy link
Contributor Author

@ingvagabund I've adjusted my e2e tests to your specifications. All of my tests now reside in the testEvictSystemCritical function. I found I'm able to mock system critical pods with the system-node-critical priority class. The priority on this class is the highest possible, so it should be sufficient for the tests.

I've also spent a lot of time testing and watching the pods be created by the e2e tests, and I've come up with a more elegant solution to the pending pods issue than using time.sleep(). By calling runPodLifetimeStrategy inside the polling loop on the tests, I run the descheduler strategy repeatedly until the pods are no longer pending. This should solve the problem of having to rely on syncing the scheduling of all pending pods with the one shot call to runPodLifetimeStrategy in the tests.

@damemi
Copy link
Contributor

damemi commented Mar 22, 2021

By calling runPodLifetimeStrategy inside the polling loop on the tests, I run the descheduler strategy repeatedly until the pods are no longer pending. This should solve the problem of having to rely on syncing the scheduling of all pending pods with the one shot call to runPodLifetimeStrategy in the tests.

Would it be easier to just poll the pod's statuses, rather than running the strategy repeatedly? Thinking that would use less resources

@ingvagabund
Copy link
Contributor

By calling runPodLifetimeStrategy inside the polling loop on the tests, I run the descheduler strategy repeatedly until the pods are no longer pending.

This may actually hide problems since the descheduler runs each strategy only once (e.g. every hour). We can not assume running runPodLifetimeStrategy is idempotent.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 24, 2021
@ingvagabund
Copy link
Contributor

@RyanDevlin can you integrate 2b286d9 in your PR and see if it helps to avoid running runPodLifetimeStrategy multiple times?

@RyanDevlin
Copy link
Contributor Author

@ingvagabund Not a problem! That looks like a good solution. I'm hoping to complete that work today.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 30, 2021
@RyanDevlin RyanDevlin changed the title Added EvictSystemCriticalPods flag to descheduler [WIP] Added EvictSystemCriticalPods flag to descheduler Mar 30, 2021
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 30, 2021
@RyanDevlin
Copy link
Contributor Author

@ingvagabund @damemi Thank you both for your suggestions. I've added @ingvagabund's elegant polling solution from 2b286d9 into my PR. Apologies in advance for the wall of text, but I wanted to provide some context to these changes.

I spent a lot more time debugging these e2e tests. The solution from 2b286d9 didn't reliably work all the time for me. After further testing I discovered that every once in a while, a single RC pod would hang around. It would be 1/1 READY, with a state of Terminating. This would happen randomly, but consistently every 4 or so runs of the full tests. I believe the cause had to do with the fact that the default terminationGracePeriodSeconds for the RC pods was 30 seconds, although I admit I can't figure out why that caused this issue. The wait.PollImmediate loop from 2b286d9 has a duration of 60 seconds, so it's a bit confusing how this was occurring.

I began to solve this issue by setting the gracePeriod parameter of the RcByNameContainer() function to 15 seconds. Before my implementation, all calls to RcByNameContainer() used a value of nil for the gracePeriod parameter. It took a while, but eventually I noticed that my defaultGracePeriod setting wasn't being applied to the pods, and each pod still had a terminationGracePeriodSeconds of 30 seconds.

This was when I discovered that the RcByNameContainer() function already had a zeroGracePeriod default built into it. This variable would be set to zero if the caller of RcByNameContainer() sets gracePeriod to nil. This functionality was created way back in #31, but was never properly hooked up to the creation of the podSpec. So I went ahead and reset all my gracePeriod parameters to nil and instead added this zero default functionality to the podSpec in MakePodSpec().

So currently the terminationGracePeriodSeconds for all pods created in these tests is zero. If a future test wants to override this, they can simply pass something other than nil as the gracePeriod parameter in RcByNameContainer().

@RyanDevlin RyanDevlin changed the title [WIP] Added EvictSystemCriticalPods flag to descheduler Added EvictSystemCriticalPods flag to descheduler Mar 30, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 30, 2021
@ingvagabund
Copy link
Contributor

/lgtm

@RyanDevlin thanks for the thorough investigation. The PR looks great!!!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 30, 2021
@RyanDevlin
Copy link
Contributor Author

@ingvagabund Thanks for all the help! This was my first PR, I learned a lot.

Copy link
Contributor

@damemi damemi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
Thanks @RyanDevlin !

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: damemi, RyanDevlin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 30, 2021
@k8s-ci-robot k8s-ci-robot merged commit a2746d0 into kubernetes-sigs:master Mar 30, 2021
@RyanDevlin RyanDevlin deleted the evict-critical branch March 30, 2021 17:50
seanmalloy added a commit to KohlsTechnology/descheduler that referenced this pull request Apr 1, 2021
Ran "make gen" using Go 1.16.1. Some changes were merged, but "make gen"
was not run. This fixes the problem.

See below PR for reference:
kubernetes-sigs#523
briend pushed a commit to briend/descheduler that referenced this pull request Feb 11, 2022
Added EvictSystemCriticalPods flag to descheduler
briend pushed a commit to briend/descheduler that referenced this pull request Feb 11, 2022
Ran "make gen" using Go 1.16.1. Some changes were merged, but "make gen"
was not run. This fixes the problem.

See below PR for reference:
kubernetes-sigs#523
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Permit descheduling of critical pods
5 participants