Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory pressure shouldn't hard evict static pods #38322

Closed
bprashanth opened this issue Dec 7, 2016 · 47 comments · Fixed by #39114
Closed

Memory pressure shouldn't hard evict static pods #38322

bprashanth opened this issue Dec 7, 2016 · 47 comments · Fixed by #39114
Assignees
Labels
sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@bprashanth
Copy link
Contributor

If a node runs out of memory, we kill something. Then if there is memory pressure, we don't admit best effort (https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/eviction/eviction_manager.go#L110). Maybe we can admit static-pods, even if they're best effort?

@kubernetes/sig-node if we can't differentiate static from non-static, maybe we can use the scheduler.alpha.kubernetes.io/critical-pod annotation?

@bprashanth bprashanth added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Dec 7, 2016
@MrHohn
Copy link
Member

MrHohn commented Dec 15, 2016

This issue is now actually causing customer outage. Should bump it up to P0.

@thockin @bowei

@euank
Copy link
Contributor

euank commented Dec 15, 2016

Is there any reason any static pods are best effort?

Shouldn't, ideally, all static pods have be guaranteed? Rather than including a "scheduler.alpha.kubernetes.io/critical-pod" annotation, I'd rather modify the pod resource request/limits to make them guaranteed.

@euank
Copy link
Contributor

euank commented Dec 15, 2016

Two possible solutions (of others, I'm sure) and why I'd prefer guaranteed

Make static pods 'special'

Having static pods skip all admittance checks is implemented in #38795.

I don't like this because of the special-casing, security implications, and because I think there's probably a more correct solution.

I think this solution will lead to the chance for static pods to thrash if the node's in a messy state.

Make static pods guaranteed

I think that making all static pods guaranteed might be a better solution. It results in a lower total node-allocatable for users, but the reality was users shouldn't have allocated those resources already because the static pods needed them.

The hardest part of this solution is picking the request/limit number for each static pod correctly, but honestly we need to do that anyways.

@bprashanth
Copy link
Contributor Author

I think the issue to discuss big picture solutions is #22212

The suggested oneliner to fix this can go right after https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/eviction/eviction_manager.go#L110:

if notBestEffot || criticalPod {
 return lifecycle.PodAdmitResult{Admit: true}
}

where criticalPod is set based on

annotations:
  node.alpha.kubernetes.io/critical-pod: ''

@euank
Copy link
Contributor

euank commented Dec 15, 2016

@bprashanth I'm not really happy with such an annotation though. The basically oneline change in #38795 is now "don't apply memory-admit failures to static pods", but I'd only see that as a temporary workaround for the proper solutions to this.

euank added a commit to euank/kubernetes that referenced this issue Dec 15, 2016
Fixes kubernetes#38322

This assumes that all static pods are important to the point that
eviction denials do not apply to them.
@bprashanth
Copy link
Contributor Author

Yeah I'm fine doing that too. It's basically the recommeded fix but looking at the source instead of an annotation. I think we'll need the annotation though.

To recap varios other bugs, there are actually multiple issues here:

  1. Admit all static pods, unless there's either a node condition or some other "hard" conflict (eg: hostport)
  2. Set a lower oom score on (some/all) static pods
  3. If a static pod fails admission because of node condition, keep retrying

your pr fixes 1

we still need something like an annotation to fix 2 in the short term (obviously fixable with preemption priority). We could use the existing podSource again (but that means something like fluent running in burstable, would also get a low oom score, so may not be the best tradeoff because we know that things is a pig).

3 involves treating file source different from apiserver source in many more places. Right now, failing admission check marks a pod Failed and the kubelet assumes that some other controller will recreate the pod, and the scheduler will re-assign to a healthy node. The problem of course, is that the kubelet is the controller for static-pods, so really, it needs to continuously reconcile static pods that have failed admission checks and process them as new creates instead of terminated pods.

@dchen1107
Copy link
Member

@bprashanth I like your idea using using scheduler.alpha.kubernetes.io/critical-pod annotation here, because:

  1. This is more align with our long term fix: Preemption priority / scheme #22212 to have a generic schema describing how critical and importance a pod is.

  2. Once the annotation is specified for a pod, when the local controller modules(admission at Kubelet, eviction manager, etc.) make the decision, they all can follow the same set of rule. Kubelet also can give critical pods a lower oom score to make them not likely being killed.

  3. More important is that later when we enhance scheduler, rescheduler and daemonset_controller, etc. they all can follow the same rules to avoid the ping-pong issue due to conflict protocol (Kubelet keeps reject some pods, while the remote controller / scheduler keep putting those pods back).

  4. This proposal also decouples static pods from the importance / criticalness of a pod. Not all static pods should be treated equally same. For example, fluentd is static pod today, but it is known havign memory leakage issue. Are we planning to treat fluentd pod same as kube-proxy? We are considering when node has the resource starvation issue, losing some logging message is equally bad as losing the node availability?

@euank I don't like the treat static pods special because only kubelet knows those pods are file-based. There is no API field indicating it, so very likely we run into the issues described above in 3) and 4).

@davidopp
Copy link
Member

+1, using the critical pod annotation is a good idea and (I think) takes pressure off of us to design a full solution for #22212 right away.

@davidopp
Copy link
Member

Just to be clear, though -- the "critical pod" annotation was never intended for this purpose. It is intended to indicate which pods rescheduler should work to get scheduled. Rescheduler only operates on pods that go through the normal default scheduler -- but static pods (and DaemonSet pods) never go through the scheduler. So, it may be confusing to use it here, and instead maybe we should use a different annotation. Reusing the annotation here won't break anything but it seems likely to cause confusion. Can you use a new annotation?

@bprashanth
Copy link
Contributor Author

I had suggested a new annotation above (note the "node" prefix). Dawn's argument for resuing is that it future proofs the pod template (eg: moved to a deployment or daemonset, it'll work). I'm currently going to keep it the same.

@davidopp
Copy link
Member

I can understand the future-proofing argument, but realistically #22212 is almost certainly going to change (hopefully eliminate) that annotation anyway, so the future is really only until we do #22212.

@euank
Copy link
Contributor

euank commented Dec 15, 2016

Thanks for the explaining @dchen1107, having a temporary annotation as a stop-gap for a proper solution makes more sense to me now.

@fgrzadkowski
Copy link
Contributor

The issue can happen also other pods, like fluentd, or other logging agents. We should apply a similar "fix" as in #38836 for fluentd.

@crassirostris Can you please patch fluentd configuration to be treated as critical? I think it should be patched to 1.5 together with #38836

@crassirostris
Copy link

@fgrzadkowski Sure

k8s-github-robot pushed a commit that referenced this issue Dec 22, 2016
Automatic merge from submit-queue (batch tested with PRs 39114, 36004)

assign -998 as the oom_score_adj for critical pods (e.g. kube-proxy)

I also validated this with a testing cluster: Fresh built cluster, and kill kube-proxy pod, etc. 

```
root      2660  2643  0 Dec21 ?        00:00:00 /bin/sh -c kube-proxy --master=https://104.198.79.64 --kubeconfig=/var/lib/kube-proxy/kubeconfig  --cluster-cidr=10.180.0.0/14 --resource-container="" --v=4   1>>/var/log/kube-proxy.log 2>&1
root      2667  2660  0 Dec21 ?        00:03:14 kube-proxy --master=https://104.198.79.64 --kubeconfig=/var/lib/kube-proxy/kubeconfig --cluster-cidr=10.180.0.0/14 --resource-container= --v=4
# cat /proc/2660/oom_score_adj 
-998
# cat /proc/2667/oom_score_adj 
-998
```

In this pr, I also include a small fix for import cycle issue. The right fix should remove the dependency on qos package from pkg/apis/componentconfig/v1alpha1. But since we plan to cherrypick this pr to both 1.5 and 1.4 (possible), I want touch the source as little as possible. 

Partial fix: #38322
@vishh
Copy link
Contributor

vishh commented Jan 12, 2017

@davidopp 5 in #38322 (comment) is necessary for admitting static pods for example. Let's discuss the eviction semantics in a different issue though as you suggested.

@thockin
Copy link
Member

thockin commented Jan 12, 2017 via email

@dchen1107
Copy link
Member

dchen1107 commented Jan 17, 2017

I don't like the short-term solution initially, and tried to prioritize the proper fix through #22212 more than a year. But we just have too much to carry as a team, and something failed to be properly prioritized. At the end, I accepted the short-term fixes due to two reasons:

  • The users are suffering. kube-proxy, fluentd pods were evicted without restart. kube-dns cannot be admitted even it is very critical for the overall health. @thockin
  • Providing the long term proper fix (Preemption priority / scheme #22212) is prioritized as P0 for 1.6 release. In this case, we can retire those short hacks very soon. @davidopp

Meanwhile, I did some risk analysis on three prs being cherrypicked into the release branch a couple of weeks ago through a private email thread. Now I copied & paste here:

From dawnchen@ to the customers:

The cherrypick pr(s) include the following fixes:

  • pr/38836: Kubelet admits critical pods including kube-proxy even under memory pressure:
    With pr/39059 below being included, kube-proxy should be covered already since it, as the static pod, it wouldn't be evicted completely. But this pr will help other critical kube-system pods (for example: kube-dns) being admitted. The risk of introducing the current fix is when the node is already overcommit with the memory, the node won't reject any critical pod, which potentially makes the overcommit situation worse, and affected the running processes (daemons and other workloads etc.). But the risk should be bounded since the pr only admit the critical pods upon the memory pressure, not out-of-disk condition. If the starvation gets worse, the kubelet eviction manager will be invoked to recover from the situation.

One suggestion from me is that please do not use critical-pod-annotation for your pods, and just leave it only for kube-system pods for now.

  • pr/39114: Make critical pods have the same oom score as the guaranteed pods
    This shouldn't have any risk, instead we should make this even better in a long term.

  • pr/39059: Stop evicting static pods.
    This one does imply a bigger risk theoretically if the static pods are the ones causing the resource starvation issue, especially the disk starvation. But the risk is bounded too since GKE node by default only has two static pods: kube-proxy and fluentd. Both static pods are not using much disk space, and evicting them doesn't help with out-of-disk situation anyway. Fluentd itself has known memory leakage issue. But since 1.4 release, we make it as guaranteed pod with a resource limit associated. Once the memory usage hits the limit, the kernel will oom the container itself. Based on this risk analysis, I rejected pr/38914 as the patch even it more align with our long term solution due to the complexity.

The suggestion I have for this is that please do not introduce new static pods to your node.

Obviously all above are short term workaround, and the long term solution should be addressed through issue/22212. issues/39124 includes two issues: 1) one small part of issue/22212 2) disk management issue. We currently prioritized both as P0 for Q1. Hope I answered your question and concern here. Please let me know if you have more issues.

@vishh
Copy link
Contributor

vishh commented Jan 18, 2017

@thockin @davidopp @dchen1107 @erictune @janetkuo and myself discussed this issue further today. Following is the summary:

The primary problems we addressed are as follows:

  1. Static system pods (kube-proxy & fluentd on certain deployments) get evicted and are not scheduled back on nodes
  2. Static system pods are not "guaranteed" to run on a node and can fail feasibility checks.
  3. Existing alpha API for specifying pod priority using pod level annotations causes security issues

In the short term,

  1. Continue using "pod level annotations" for priority.
  2. Feature gate the "Critical Pod Annotation" across the system, tie it to a configurable namespace (or just to kube-system), and disable it by default. GKE will enable this feature.
  3. Static pods can be marked as "critical"
  4. Kubelet will not evict "critical pods" and instead restart them.
  5. Kubelet will guarantee that static pod manifests are processed before API pods are processed. This is to ensure that static pods can fit on a node.
  6. A static pod or a daemonset that gets updated might not be accepted by the node due to resource constraints, even if they are critical.

We decided to have only two levels of priority (Critical & QoS) instead of three (Static, Critical and QoS) for evictions. This is to align with the long term plan.

This short term solution will solve problems 1 & 3 mentioned above.

Now for the long term,

  1. A preemption & priority scheme will be designed that will allow us to deprecate the "Critical" pod level annotation. @davidopp is working on this. This solves problem 3. ETA: Design in v1.6, Alpha (or beta?) in v1.7
  2. As part of the this design, Kubelet's role in preemptions will be finalized based on which static pods can be admitted on to a fully committed node. Solves problem 2. ETA
  3. Critical static pods will use this new priority scheme to guarantee their availability. Solves problem 1. ETA: TBD (based on availability of the feature)
  4. [Closely related] Existing Static pods (kube-proxy & fluentd) will switch to using Daemon Sets once high priority daemon set pods are "guaranteed" to run on a node. As of now "Daemon Set" pods bypass the scheduler and can get "rejected" by the kubelet.

@derekwaynecarr @liggitt @smarterclayton does this satisfy your needs?

@smarterclayton
Copy link
Contributor

smarterclayton commented Jan 19, 2017 via email

@derekwaynecarr
Copy link
Member

@vishh -- for the near term plan, does a critical pod get a higher oom_score, or do we just let it be restarted as normal? for 1.6, i assume we are ok with critical pods being constrained on their memory usage relative to their qos tier?

@vishh
Copy link
Contributor

vishh commented Jan 19, 2017

@derekwaynecarr I was thinking of not making "Critical" a new QoS class, but instead use it just for preemption and eviction decisions. So oom_score will not be impacted. A Critical pod can still get OOM killed by the kernel, but will not be evicted. I think that is a good design because even critical pods can have memory leaks.

@dchen1107
Copy link
Member

@vishh Thanks for summarizing the discussion.

@smarterclayton @derekwaynecarr @liggitt #38322 (comment) is aligned the node team roadmap: https://docs.google.com/a/google.com/spreadsheets/d/1-hADEbGEUrW04QP4bVk7xf1CWuqbFU6ulMAH1jqZQrU/edit?usp=sharing (row 25). This is P0 for both sig-scheduling and sig-node team.

@liggitt
Copy link
Member

liggitt commented Jan 19, 2017

Feature gate the "Critical Pod Annotation" across the system, tie it to a configurable namespace (or just to kube-system), and disable it by default. GKE will enable this feature.

that addresses my security concerns, thanks

@davidopp
Copy link
Member

davidopp commented Jan 20, 2017

Feature gate the "Critical Pod Annotation" across the system, tie it to a configurable namespace (or just to kube-system), and disable it by default.

Can you clarify what "feature gating" means here? Does it mean critical pod annotation cannot be added when the feature is disabled, or that it can always be added but will be ignored for the purposes of the eviction policy when the feature is disabled? The second is fine but the first is not because rescheduler uses it. I assume you mean the second but I just want to be sure.

@derekwaynecarr
Copy link
Member

@davidopp -- i assume the feature gate is not restricting the presence of the annotation, but instead is just a kubelet flag that says if the annotations presence gives special node local decisions. the annotation is free to be used by the rescheduler assuming the rescheduler is enabled (which also appears to default to false).

@derekwaynecarr
Copy link
Member

@davidopp -- ignore previous comment, i see its on by default. btw, i see rescheduler also restricts the annotation as only being meaningful to a single namespace (which is good). anyway, i assume its the second in your original comment.

@vishh
Copy link
Contributor

vishh commented Jan 20, 2017 via email

@derekwaynecarr
Copy link
Member

@vishh alpha features should not be on by default.

@derekwaynecarr
Copy link
Member

@vishh - to elaborate, in code, the default must be off. in kube-up deployments, the default can change, and i have no opinion there.

@vishh
Copy link
Contributor

vishh commented Jan 20, 2017 via email

@davidopp
Copy link
Member

Right -- it's fine to disable the eviction behavior of the annotation by default. Just don't prevent people from setting the annotation by default, since it is used by a completely unrelated subsystem (the "rescheduler").

@vishh
Copy link
Contributor

vishh commented Jan 22, 2017 via email

@smarterclayton
Copy link
Contributor

smarterclayton commented Jan 23, 2017 via email

@davidopp
Copy link
Member

So we are saying that it's OK if upgrading to 1.6 breaks this feature for everyone who is currently relying on it to get their DNS, Heapster, etc. scheduled today on GCE, AWS, Azure, on-prem, etc., unless they take the additional action of re-enabling it?

@davidopp
Copy link
Member

cc/ @piosz @crassirostris

@erictune
Copy link
Member

It sounds like it would be okay if the behavior is flag-controllable, and off by default in head, but enabled on GKE, and kube-up.sh. Not a great long-term solution, but could bridge the gap until another solution is ready.

@dchen1107
Copy link
Member

dchen1107 commented Jan 23, 2017

I am confused here.

I thought we had a design meeting and achieved the agreement with the following action items:

  1. vishnuk@ finishes the feature gate for "Critical Pod Annotation" across the system, and cherrypick it to both 1.4 and 1.5. The feature gate should be off by default, but GKE can carry the burden to re-enable it. Like what @erictune suggested above, we can also enable this by default through kube-up.sh, so that there is no regression introduced for 1.5 users.

  2. dashpole@ (or vishnuk@) work on resolving the rest short-term issues for 1.6 which is documented by vish at Memory pressure shouldn't hard evict static pods  #38322 (comment)

    • Kubelet will not evict "critical pods" and instead restart them.
    • Kubelet will guarantee that static pod manifests are processed before API pods are processed. This is to ensure that static pods can fit on a node. We talked about some corner cases:
      • A static pod or a daemonset that gets updated might not be accepted by the node due to resource constraints, even if they are critical.
      • Updates or additions of static pods after the initial node startup are not "guaranteed" acceptance by Kubelet and this is a known issue /corner case.
  3. davidopp@ is working on the long term design: Preemption priority / scheme #22212, and sig-node, sig-scheduling and sig-apps will be involved with the design since it affects Kubelet, scheduler, rescheduler and daemonset-controller, etc.

Are we change the timeline or revert some of the decision here?

@thockin
Copy link
Member

thockin commented Jan 24, 2017

Dawn, the part you're missing (I think) is that we should also make the critical annotation disabled or disableable in rescheduler by default.

@davidopp
Copy link
Member

I think there are a lot of people who run Kubernetes without GKE or kube-up.sh. But I guess we can say the rescheduler is alpha, so it's OK if upgrading disables it... (The documentation doesn't say anything about it being alpha or beta or GA.)

@vishh
Copy link
Contributor

vishh commented Jan 27, 2017

Let's move this discussion to #40573

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet