Guaranteed admissions and evictions of "critical pods" and "static pods" #40573

vishh · 2017-01-27T01:38:31Z

Forked from #22212

Tl;dr; Work items in order of priority:

[From https://github.com/Memory pressure shouldn't hard evict static pods #38322#issuecomment-273631002]( Critical pod annotation will not be enabled by default and will only work for a special namespace (kube-system) PR Optionally avoid evicting critical pods in kubelet #40655
Critical pods do not get evicted PR Optionally avoid evicting critical pods in kubelet #40655
Kubelet preempts non critical pods to make room for "Critical" pods during admission Guaranteed admission for Critical Pods #40952
Kubelet avoids evicting multiple pods in succession unless absolutely necessary [Bug Fix]: Avoid evicting more pods than necessary by adding Timestamps for fsstats and ignoring stale stats #42435
Guarantee admission for critical pods even under resource pressure Admit critical pods under resource pressure #41701
Increase kube-proxy OOM score adj to reduce its chances of hitting system OOMs Protect kubeproxy deployed via kube-up from system OOMs #41700
Evict critical pods that are not static pods - Admit critical pods under resource pressure #41701

The rest of the work items are not strictly necessary for v1.6 but are included for the sake of completeness.

Kube-proxy is moved to Guaranteed QoS class (@thockin @matchstick FYI)
Kube-proxy is moved to a DaemonSet - We need to introduce node labels to avoid scheduling kube proxy on to nodes that are already running it via static pods. ( @thockin @matchstick FYI)
Document that usage of static pods along with regular apiserver pods on a node is discouraged.
Evict all critical pods. This cannot be done until kube-proxy moves to a daemonset.

If we care about supporting static pods safely along with regular apiserver pods, possibly for bootstrapping, the next feature would matter. Without the feature mentioned below, static pods that have been evicted earlier will get restarted after kubelet restarts and might cause other running pods to get booted due to resource constraints.

Kubelet switches to running only in "local" or "remote" mode

Kubernetes lacks the notion of "priority" among pods. A pod level annotation was introduced through which a rescheduler will make room and successfully place a pod marked as "Critical" onto a node by evicting some existing pods.
This feature however does not work for pods that have been directly placed on nodes. Examples of pods directly placed on nodes include pods managed by Daemonset and static pods.
Typically pods that are managed through Daemonsets and static manifest files are of high priority in the system. Given that the rescheduler will not help pods from these sources, there needs to be an alternative means to guarantee admission of such critical pods.

A viable solution is that of having the kubelet preempt existing pods to make room for a critical pod if required.

Note that the critical pod annotation can be easily abused and so as discussed in #38322 (comment), the critical pod annotation will be available only as an opt-in solution and be restricted to a single namespace. Any user who intends to manage critical pods will have to place them in that special namespace (kube-system).

Until we have preemption available in the kubelet though, it is not safe to evict "critical" pods for a couple of reasons:

Static pods do not get restarted once they get evicted (more on this below)
New instances of pods placed directly on the node (DaemonSet) may not be admitted due to lack of compute resources.

For these reasons, we will avoid evicting "Critical" pods.
There is a possibility of "critical" pods having memory or storage leak bugs that could lead to unexpected usage of memory or disk. For this purpose, it is safer to at-least restart "Critical" pods instead of evicting them. However, restart of pods in kubelet is currently difficult to implement. It requires tearing down of all containers, volumes, pod level cgroups, etc and re-creating them.

Given that we are implementing "preemption" soon, we will temporarily just simply ignore "Critical" pods when a node is under resource pressure.

Now onto static pods. Once a static pod gets evicted, it will not be re-started by a kubelet unless it restarts. When it restarts, the kubelet simply admits all static pods even ones that have been previously evicted (cc @yujuhong).
As a result of this, a static pod that was once evicted and is not expected to be running will be admitted and a regular pod that was already running on the node might be booted from the node.
This behavior is highly undesirable. Kubelet can look at the "mirror pod" for a static pod and decide not to recreate previously evicted static pods. But this is only possible if the kubelet has access to the apiserver before it processes static pods.
Instead of attempting to solve this problem, we will instead attempt to avoid using static pods as much as possible in production.
In addition, we need to update kubelet to only operate in one of two modes "local" or "remote". Once the kubelet switches to "remote" apiserver mode, it will stop accepting any "local" static (or http) pods. If the kubelet is explicitly switched back to "local" mode, it will evict all "remote"/unknown pods and only manage "local" pods. Upon switching from "local" to "remote", the kubelet will create mirror pod objects for "local" pods and then transfer control of those pods to the apiserver.

@dashpole ^^

FYI: @dchen1107 @thockin @davidopp @erictune @derekwaynecarr @smarterclayton

The text was updated successfully, but these errors were encountered:

thockin · 2017-01-27T05:34:09Z

There is a possibility of "critical" pods having memory or storage leak bugs that could lead to unexpected usage of memory or disk. For this purpose, it is safer to at-least restart "Critical" pods instead of evicting them. However, restart of pods in kubelet is currently difficult to implement. It requires tearing down of all containers, volumes, pod level cgroups, etc and re-creating them. Given that we are implementing "preemption" soon, we will temporarily just simply ignore "Critical" pods when a node is under resource pressure.

Without at least restarting a critical pod, it ill eventually OOM itself or the system. I guess that's OK, but not great

In addition, we need to update kubelet to only operate in one of two modes "local" or "remote". Once the kubelet switches to "remote" apiserver mode, it will stop accepting any "local" static (or http) pods. If the kubelet is explicitly switched back to "local" mode, it will evict all "remote"/unknown pods and only manage "local" pods. Upon switching from "local" to "remote", the kubelet will create mirror pod objects for "local" pods and then transfer control of those pods to the apiserver.

I do not recall discussing this part. If we have preemption, why do we need this?

[ ] Kubelet admits static pods before regular pods

With preemption we do not need this (and I think this is v.difficult to define anyway)

[ ] Kubelet switches to running only in "local" or "remote" mode

Not sure we need this As discussed offline, we might revisit all of this later, once we have a proper design for a more-than-binary priority scheme.

davidopp · 2017-01-27T09:10:13Z

That's a lot of work for something that will probably change substantially in 1.7. Are you sure all of it is necessary to solve the problem?

vishh · 2017-01-28T23:03:15Z

@thockin

I do not recall discussing this part. If we have preemption, why do
we need this?

Preemption to begin with will only work for "Critical" pods. Static pods being re-started unintentionally can result in disruption for regular pods since the node suddenly become overcommitted unintentionally.

With preemption we do not need this (and I think this is v.difficult
to define anyway)

Startup order will not be an issue for with preemptions.

[ ] Kubelet switches to running only in "local" or "remote" mode
Not sure we need this

As mentioned above, if we want the system to be reliable even in the presence of static pods, we need this. However, if we chose to ignore the issues with static pods, possibly by just documenting the issues associated with it, we can avoid implementing this behavior.

As discussed offline, we might revisit all of this later, once we have
a proper design for a more-than-binary priority scheme.

The issue with static pods will not be solved by the priority scheme. We "incorrectly" restart evicted or failed static pods.

thockin · 2017-01-28T23:08:54Z

> I do not recall discussing this part. If we have preemption, why do > we need this? Preemption to begin with will only work for "Critical" pods. Static pods being re-started unintentionally can result in disruption for regular pods since the node suddenly become overcommitted unintentionally.

I'm not sure I follow. If a static pod is marked critical, it should be priority. If it is not marked critical, it's considered for admission like everything else, right?

> With preemption we do not need this (and I think this is v.difficult > to define anyway) Startup order will not be an issue for with preemptions.

Right, so we shoul dtake it off the plan?

>> [ ] Kubelet switches to running only in "local" or "remote" mode > > Not sure we need this As mentioned above, if we want the system to be reliable even in the presence of static pods, we need this. However, if we chose to ignore the issues with static pods, possibly by just documenting the issues associated with it, we can avoid implementing this behavior. The issue with static pods will not be solved by the priority scheme. We "incorrectly" restart evicted or failed static pods.

Why is it incorrect? Static pods are no more important than scheduled pods.

vishh · 2017-01-28T23:18:33Z

@davidopp @thockin I made some updates to the issue. PTAL. I hope the updates make it obvious why and when the changes mentioned above are necessary.

Automatic merge from submit-queue Optionally avoid evicting critical pods in kubelet For #40573 ```release-note When feature gate "ExperimentalCriticalPodAnnotation" is set, Kubelet will avoid evicting pods in "kube-system" namespace that contains a special annotation - `scheduler.alpha.kubernetes.io/critical-pod` This feature should be used in conjunction with the rescheduler to guarantee availability for critical system pods - https://kubernetes.io/docs/admin/rescheduler/ ```

thockin · 2017-02-06T05:08:44Z

Vish,

I still have no recollection of "local" vs "remote" ?

vishh · 2017-02-06T23:30:50Z

@thockin It an idea that you also proposed - Handing over control of static pods to the api-server once kubelet connects successfully with an apiserver.

vishh · 2017-02-06T23:38:47Z

There is still one more issue that has not been captured.
System OOMs can lead to crash loop backoff which results in long downtimes for critical system pods. System OOMs cannot be reliably avoided using oom score adjust setting unless we disable system OOMs all together for critical pods. However this is dangerous because critical pods can also have memory leak bugs.

I'm inclined towards not meddling with oom score adjust, which is known to be unreliable, and instead reduce max backoff for crash looping critical pods to a low value (1s maybe?)

Until we have NodeAllocatable rolled out, even if critical pods do not exceed their requested memory, they might be OOM killed. To protect critical system pods from being OOM killed, they should ideally be in the Guaranteed QoS Class.

Automatic merge from submit-queue Flag gate critical pods annotation support in kubelet ```release-note If ExperimentalCriticalPodAnnotation=True flag gate is set, kubelet will ensure that pods with `scheduler.alpha.kubernetes.io/critical-pod` annotation will be admitted even under resource pressure, will not be evicted, and are reasonably protected from system OOMs. ``` For #40573

Automatic merge from submit-queue Make fluentd a critical pod For #40573 Based on #40655 (comment) ```release-note If `experimentalCriticalPodAnnotation` feature gate is set to true, fluentd pods will not be evicted by the kubelet. ```

Automatic merge from submit-queue Protect kubeproxy deployed via kube-up from system OOMs This change is necessary until it can be moved to Guaranteed QoS Class. For #40573

Automatic merge from submit-queue Admit critical pods under resource pressure And evict critical pods that are not static. Depends on #40952. For #40573

thockin · 2017-04-10T06:04:25Z

@mdelio We should get these last 2 added to backlog. I threw them in the spreadsheet

davidopp · 2017-04-10T06:08:34Z

IIRC @dchen1107 has said she doesn't think kube-proxy should be in Guaranteed QoS because it is hard to set a good limit for its memory consumption.

thockin · 2017-04-10T16:21:49Z

I think we can do guaranteed if we have a cluster-proportional scaler that governs it.

…

On Sun, Apr 9, 2017 at 11:08 PM, David Oppenheimer ***@***.*** > wrote: IIRC @dchen1107 <https://github.com/dchen1107> has said she doesn't think kube-proxy should be in Guaranteed QoS because it is hard to set a good limit for its memory consumption. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#40573 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFVgVPbwkVGwCa9WXpOp5iO_kSCH3lQ1ks5rucdzgaJpZM4LvWni> .

vishh · 2017-04-10T18:43:40Z

As an example, Heapster is in the Guaranteed QoS class and it runs a nanny that resizes it periodically based on cluster size. More details on the nanny program here.

As thockin@ suggested, if kube-proxy can be scaled proportional to the number of services in the cluster, kube-proxy can run in the Guaranteed class.

spiffxp · 2017-06-19T16:13:41Z

/remove-priority P1
/priority important-soon
(I'm not actually sure this is the right priority, just trying to remove the old priority/PN labels)

k8s-github-robot · 2017-09-09T08:02:39Z

[MILESTONENOTIFIER] Milestone Removed

@matchstick @thockin @vishh

Important:
This issue was missing labels required for the v1.8 milestone for more than 7 days:

kind: Must specify exactly one of [kind/bug, kind/cleanup, kind/feature].

Removing it from the milestone.

Additional instructions available here The commands available for adding these labels are documented here

fejta-bot · 2018-01-05T02:00:12Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-02-09T06:08:09Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-03-11T06:54:32Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

vishh added priority/P1 sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jan 27, 2017

vishh added this to the v1.6 milestone Jan 27, 2017

vishh assigned dashpole and vishh Jan 27, 2017

vishh mentioned this issue Jan 27, 2017

Memory pressure shouldn't hard evict static pods #38322

Closed

vishh mentioned this issue Jan 28, 2017

Optionally avoid evicting critical pods in kubelet #40655

Merged

vishh mentioned this issue Feb 6, 2017

Make fluentd a critical pod #41035

Merged

This was referenced Feb 7, 2017

Flag gate critical pods annotation support in kubelet #41052

Merged

Automated cherry pick of #41052 #41053

Merged

This was referenced Feb 19, 2017

Protect kubeproxy deployed via kube-up from system OOMs #41700

Merged

Admit critical pods under resource pressure #41701

Merged

janetkuo mentioned this issue Feb 24, 2017

Make DaemonSet respect "critical pods annotation" when scheduling #41612

Closed

dchen1107 mentioned this issue Feb 28, 2017

Preemption priority / scheme #22212

Open

vishh assigned thockin and matchstick and unassigned dashpole Mar 9, 2017

vishh modified the milestones: v1.7, v1.6 Mar 9, 2017

bgrant0607 mentioned this issue Mar 21, 2017

Workload API v1 requirements umbrella issue #42752

Closed

jayunit100 mentioned this issue Apr 24, 2017

DaemonSet pods should be scheduled by default scheduler, not DaemonSet controller #42002

Closed

dchen1107 modified the milestones: v1.8, v1.7 Jun 9, 2017

k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/P1 labels Jun 19, 2017

k8s-github-robot added milestone-labels-incomplete labels Sep 1, 2017

k8s-github-robot added milestone/removed and removed milestone-labels-incomplete labels Sep 9, 2017

k8s-github-robot removed this from the v1.8 milestone Sep 9, 2017

spxtr removed the milestone-labels-**incomplete** label Sep 11, 2017

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 5, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 9, 2018

k8s-ci-robot closed this as completed Mar 11, 2018

derekwaynecarr mentioned this issue Mar 26, 2018

UPSTREAM: <carry>: pods in openshift-* namespace can be marked critical openshift/origin#19104

Merged

yuchengwu mentioned this issue May 15, 2019

kubelet doesn't restart static pods after eviction #76598

Closed

brucelee138 mentioned this issue Oct 19, 2022

Exceeding the ephemeral storage limit not causing eviction in critical pods #112893

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guaranteed admissions and evictions of "critical pods" and "static pods" #40573

Guaranteed admissions and evictions of "critical pods" and "static pods" #40573

vishh commented Jan 27, 2017 •

edited

Loading

thockin commented Jan 27, 2017 via email

davidopp commented Jan 27, 2017

vishh commented Jan 28, 2017

thockin commented Jan 28, 2017 via email

vishh commented Jan 28, 2017

thockin commented Feb 6, 2017

vishh commented Feb 6, 2017

vishh commented Feb 6, 2017

thockin commented Apr 10, 2017

davidopp commented Apr 10, 2017

thockin commented Apr 10, 2017 via email

vishh commented Apr 10, 2017

spiffxp commented Jun 19, 2017

k8s-github-robot commented Sep 9, 2017

fejta-bot commented Jan 5, 2018

fejta-bot commented Feb 9, 2018

fejta-bot commented Mar 11, 2018

Guaranteed admissions and evictions of "critical pods" and "static pods" #40573

Guaranteed admissions and evictions of "critical pods" and "static pods" #40573

Comments

vishh commented Jan 27, 2017 • edited Loading

thockin commented Jan 27, 2017 via email

davidopp commented Jan 27, 2017

vishh commented Jan 28, 2017

thockin commented Jan 28, 2017 via email

vishh commented Jan 28, 2017

thockin commented Feb 6, 2017

vishh commented Feb 6, 2017

vishh commented Feb 6, 2017

thockin commented Apr 10, 2017

davidopp commented Apr 10, 2017

thockin commented Apr 10, 2017 via email

vishh commented Apr 10, 2017

spiffxp commented Jun 19, 2017

k8s-github-robot commented Sep 9, 2017

fejta-bot commented Jan 5, 2018

fejta-bot commented Feb 9, 2018

fejta-bot commented Mar 11, 2018

vishh commented Jan 27, 2017 •

edited

Loading