Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guaranteed admissions and evictions of "critical pods" and "static pods" #40573

Closed
7 of 12 tasks
vishh opened this issue Jan 27, 2017 · 18 comments
Closed
7 of 12 tasks

Guaranteed admissions and evictions of "critical pods" and "static pods" #40573

vishh opened this issue Jan 27, 2017 · 18 comments
Assignees
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. milestone/removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@vishh
Copy link
Contributor

vishh commented Jan 27, 2017

Forked from #22212

Tl;dr; Work items in order of priority:

The rest of the work items are not strictly necessary for v1.6 but are included for the sake of completeness.

  • Kube-proxy is moved to Guaranteed QoS class (@thockin @matchstick FYI)
  • Kube-proxy is moved to a DaemonSet - We need to introduce node labels to avoid scheduling kube proxy on to nodes that are already running it via static pods. ( @thockin @matchstick FYI)
  • Document that usage of static pods along with regular apiserver pods on a node is discouraged.
  • Evict all critical pods. This cannot be done until kube-proxy moves to a daemonset.

If we care about supporting static pods safely along with regular apiserver pods, possibly for bootstrapping, the next feature would matter. Without the feature mentioned below, static pods that have been evicted earlier will get restarted after kubelet restarts and might cause other running pods to get booted due to resource constraints.

  • Kubelet switches to running only in "local" or "remote" mode

Kubernetes lacks the notion of "priority" among pods. A pod level annotation was introduced through which a rescheduler will make room and successfully place a pod marked as "Critical" onto a node by evicting some existing pods.
This feature however does not work for pods that have been directly placed on nodes. Examples of pods directly placed on nodes include pods managed by Daemonset and static pods.
Typically pods that are managed through Daemonsets and static manifest files are of high priority in the system. Given that the rescheduler will not help pods from these sources, there needs to be an alternative means to guarantee admission of such critical pods.

A viable solution is that of having the kubelet preempt existing pods to make room for a critical pod if required.

Note that the critical pod annotation can be easily abused and so as discussed in #38322 (comment), the critical pod annotation will be available only as an opt-in solution and be restricted to a single namespace. Any user who intends to manage critical pods will have to place them in that special namespace (kube-system).

Until we have preemption available in the kubelet though, it is not safe to evict "critical" pods for a couple of reasons:

  1. Static pods do not get restarted once they get evicted (more on this below)
  2. New instances of pods placed directly on the node (DaemonSet) may not be admitted due to lack of compute resources.

For these reasons, we will avoid evicting "Critical" pods.
There is a possibility of "critical" pods having memory or storage leak bugs that could lead to unexpected usage of memory or disk. For this purpose, it is safer to at-least restart "Critical" pods instead of evicting them. However, restart of pods in kubelet is currently difficult to implement. It requires tearing down of all containers, volumes, pod level cgroups, etc and re-creating them.

Given that we are implementing "preemption" soon, we will temporarily just simply ignore "Critical" pods when a node is under resource pressure.

Now onto static pods. Once a static pod gets evicted, it will not be re-started by a kubelet unless it restarts. When it restarts, the kubelet simply admits all static pods even ones that have been previously evicted (cc @yujuhong).
As a result of this, a static pod that was once evicted and is not expected to be running will be admitted and a regular pod that was already running on the node might be booted from the node.
This behavior is highly undesirable. Kubelet can look at the "mirror pod" for a static pod and decide not to recreate previously evicted static pods. But this is only possible if the kubelet has access to the apiserver before it processes static pods.
Instead of attempting to solve this problem, we will instead attempt to avoid using static pods as much as possible in production.
In addition, we need to update kubelet to only operate in one of two modes "local" or "remote". Once the kubelet switches to "remote" apiserver mode, it will stop accepting any "local" static (or http) pods. If the kubelet is explicitly switched back to "local" mode, it will evict all "remote"/unknown pods and only manage "local" pods. Upon switching from "local" to "remote", the kubelet will create mirror pod objects for "local" pods and then transfer control of those pods to the apiserver.

@dashpole ^^

FYI: @dchen1107 @thockin @davidopp @erictune @derekwaynecarr @smarterclayton

@vishh vishh added priority/P1 sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jan 27, 2017
@vishh vishh added this to the v1.6 milestone Jan 27, 2017
@thockin
Copy link
Member

thockin commented Jan 27, 2017 via email

@davidopp
Copy link
Member

That's a lot of work for something that will probably change substantially in 1.7. Are you sure all of it is necessary to solve the problem?

@vishh
Copy link
Contributor Author

vishh commented Jan 28, 2017

@thockin

I do not recall discussing this part. If we have preemption, why do
we need this?

Preemption to begin with will only work for "Critical" pods. Static pods being re-started unintentionally can result in disruption for regular pods since the node suddenly become overcommitted unintentionally.

With preemption we do not need this (and I think this is v.difficult
to define anyway)

Startup order will not be an issue for with preemptions.

[ ] Kubelet switches to running only in "local" or "remote" mode
Not sure we need this

As mentioned above, if we want the system to be reliable even in the presence of static pods, we need this. However, if we chose to ignore the issues with static pods, possibly by just documenting the issues associated with it, we can avoid implementing this behavior.

As discussed offline, we might revisit all of this later, once we have
a proper design for a more-than-binary priority scheme.

The issue with static pods will not be solved by the priority scheme. We "incorrectly" restart evicted or failed static pods.

@thockin
Copy link
Member

thockin commented Jan 28, 2017 via email

@vishh
Copy link
Contributor Author

vishh commented Jan 28, 2017

@davidopp @thockin I made some updates to the issue. PTAL. I hope the updates make it obvious why and when the changes mentioned above are necessary.

k8s-github-robot pushed a commit that referenced this issue Feb 4, 2017
Automatic merge from submit-queue

Optionally avoid evicting critical pods in kubelet

For #40573

```release-note
When feature gate "ExperimentalCriticalPodAnnotation" is set, Kubelet will avoid evicting pods in "kube-system" namespace that contains a special annotation - `scheduler.alpha.kubernetes.io/critical-pod`
This feature should be used in conjunction with the rescheduler to guarantee availability for critical system pods - https://kubernetes.io/docs/admin/rescheduler/
```
@thockin
Copy link
Member

thockin commented Feb 6, 2017

Vish,

I still have no recollection of "local" vs "remote" ?

@vishh
Copy link
Contributor Author

vishh commented Feb 6, 2017

@thockin It an idea that you also proposed - Handing over control of static pods to the api-server once kubelet connects successfully with an apiserver.

@vishh
Copy link
Contributor Author

vishh commented Feb 6, 2017

There is still one more issue that has not been captured.
System OOMs can lead to crash loop backoff which results in long downtimes for critical system pods. System OOMs cannot be reliably avoided using oom score adjust setting unless we disable system OOMs all together for critical pods. However this is dangerous because critical pods can also have memory leak bugs.

I'm inclined towards not meddling with oom score adjust, which is known to be unreliable, and instead reduce max backoff for crash looping critical pods to a low value (1s maybe?)

Until we have NodeAllocatable rolled out, even if critical pods do not exceed their requested memory, they might be OOM killed. To protect critical system pods from being OOM killed, they should ideally be in the Guaranteed QoS Class.

k8s-github-robot pushed a commit that referenced this issue Feb 13, 2017
Automatic merge from submit-queue

Flag gate critical pods annotation support in kubelet

```release-note
If ExperimentalCriticalPodAnnotation=True flag gate is set, kubelet will ensure that pods with `scheduler.alpha.kubernetes.io/critical-pod` annotation will be admitted even under resource pressure, will not be evicted, and are reasonably protected from system OOMs.
```

For #40573
k8s-github-robot pushed a commit that referenced this issue Feb 13, 2017
Automatic merge from submit-queue

Make fluentd a critical pod

For #40573
Based on #40655 (comment)

```release-note
If `experimentalCriticalPodAnnotation` feature gate is set to true, fluentd pods will not be evicted by the kubelet.
```
k8s-github-robot pushed a commit that referenced this issue Feb 25, 2017
Automatic merge from submit-queue

Protect kubeproxy deployed via kube-up from system OOMs

This change is necessary until it can be moved to Guaranteed QoS Class.

For #40573
k8s-github-robot pushed a commit that referenced this issue Feb 26, 2017
Automatic merge from submit-queue

Admit critical pods under resource pressure

And evict critical pods that are not static.

Depends on #40952.

For #40573
@vishh vishh assigned thockin and matchstick and unassigned dashpole Mar 9, 2017
@thockin
Copy link
Member

thockin commented Apr 10, 2017

@mdelio We should get these last 2 added to backlog. I threw them in the spreadsheet

@davidopp
Copy link
Member

IIRC @dchen1107 has said she doesn't think kube-proxy should be in Guaranteed QoS because it is hard to set a good limit for its memory consumption.

@thockin
Copy link
Member

thockin commented Apr 10, 2017 via email

@vishh
Copy link
Contributor Author

vishh commented Apr 10, 2017

As an example, Heapster is in the Guaranteed QoS class and it runs a nanny that resizes it periodically based on cluster size. More details on the nanny program here.

As thockin@ suggested, if kube-proxy can be scaled proportional to the number of services in the cluster, kube-proxy can run in the Guaranteed class.

@spiffxp
Copy link
Member

spiffxp commented Jun 19, 2017

/remove-priority P1
/priority important-soon
(I'm not actually sure this is the right priority, just trying to remove the old priority/PN labels)

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/P1 labels Jun 19, 2017
@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Removed

@matchstick @thockin @vishh

Important:
This issue was missing labels required for the v1.8 milestone for more than 7 days:

kind: Must specify exactly one of [kind/bug, kind/cleanup, kind/feature].

Removing it from the milestone.

Additional instructions available here The commands available for adding these labels are documented here

@k8s-github-robot k8s-github-robot removed this from the v1.8 milestone Sep 9, 2017
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 5, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 9, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. milestone/removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests