-
Notifications
You must be signed in to change notification settings - Fork 39.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guaranteed admissions and evictions of "critical pods" and "static pods" #40573
Comments
There is a possibility of "critical" pods having memory or storage leak bugs that could lead to unexpected usage of memory or disk. For this purpose, it is safer to at-least restart "Critical" pods instead of evicting them. However, restart of pods in kubelet is currently difficult to implement. It requires tearing down of all containers, volumes, pod level cgroups, etc and re-creating them.
Given that we are implementing "preemption" soon, we will temporarily just simply ignore "Critical" pods when a node is under resource pressure.
Without at least restarting a critical pod, it ill eventually OOM
itself or the system. I guess that's OK, but not great
In addition, we need to update kubelet to only operate in one of two modes "local" or "remote". Once the kubelet switches to "remote" apiserver mode, it will stop accepting any "local" static (or http) pods. If the kubelet is explicitly switched back to "local" mode, it will evict all "remote"/unknown pods and only manage "local" pods. Upon switching from "local" to "remote", the kubelet will create mirror pod objects for "local" pods and then transfer control of those pods to the apiserver.
I do not recall discussing this part. If we have preemption, why do
we need this?
[ ] Kubelet admits static pods before regular pods
With preemption we do not need this (and I think this is v.difficult
to define anyway)
[ ] Kubelet switches to running only in "local" or "remote" mode
Not sure we need this
As discussed offline, we might revisit all of this later, once we have
a proper design for a more-than-binary priority scheme.
|
That's a lot of work for something that will probably change substantially in 1.7. Are you sure all of it is necessary to solve the problem? |
Preemption to begin with will only work for "Critical" pods. Static pods being re-started unintentionally can result in disruption for regular pods since the node suddenly become overcommitted unintentionally.
Startup order will not be an issue for with preemptions.
As mentioned above, if we want the system to be reliable even in the presence of static pods, we need this. However, if we chose to ignore the issues with static pods, possibly by just documenting the issues associated with it, we can avoid implementing this behavior.
The issue with static pods will not be solved by the priority scheme. We "incorrectly" restart evicted or failed static pods. |
> I do not recall discussing this part. If we have preemption, why do
> we need this?
Preemption to begin with will only work for "Critical" pods. Static pods being re-started unintentionally can result in disruption for regular pods since the node suddenly become overcommitted unintentionally.
I'm not sure I follow. If a static pod is marked critical, it should
be priority. If it is not marked critical, it's considered for
admission like everything else, right?
> With preemption we do not need this (and I think this is v.difficult
> to define anyway)
Startup order will not be an issue for with preemptions.
Right, so we shoul dtake it off the plan?
>> [ ] Kubelet switches to running only in "local" or "remote" mode
>
> Not sure we need this
As mentioned above, if we want the system to be reliable even in the presence of static pods, we need this. However, if we chose to ignore the issues with static pods, possibly by just documenting the issues associated with it, we can avoid implementing this behavior.
The issue with static pods will not be solved by the priority scheme. We "incorrectly" restart evicted or failed static pods.
Why is it incorrect? Static pods are no more important than scheduled pods.
|
Automatic merge from submit-queue Optionally avoid evicting critical pods in kubelet For #40573 ```release-note When feature gate "ExperimentalCriticalPodAnnotation" is set, Kubelet will avoid evicting pods in "kube-system" namespace that contains a special annotation - `scheduler.alpha.kubernetes.io/critical-pod` This feature should be used in conjunction with the rescheduler to guarantee availability for critical system pods - https://kubernetes.io/docs/admin/rescheduler/ ```
Vish, I still have no recollection of "local" vs "remote" ? |
@thockin It an idea that you also proposed - Handing over control of static pods to the api-server once kubelet connects successfully with an apiserver. |
There is still one more issue that has not been captured. I'm inclined towards not meddling with oom score adjust, which is known to be unreliable, and instead reduce max backoff for crash looping critical pods to a low value ( Until we have NodeAllocatable rolled out, even if critical pods do not exceed their |
Automatic merge from submit-queue Flag gate critical pods annotation support in kubelet ```release-note If ExperimentalCriticalPodAnnotation=True flag gate is set, kubelet will ensure that pods with `scheduler.alpha.kubernetes.io/critical-pod` annotation will be admitted even under resource pressure, will not be evicted, and are reasonably protected from system OOMs. ``` For #40573
Automatic merge from submit-queue Make fluentd a critical pod For #40573 Based on #40655 (comment) ```release-note If `experimentalCriticalPodAnnotation` feature gate is set to true, fluentd pods will not be evicted by the kubelet. ```
Automatic merge from submit-queue Protect kubeproxy deployed via kube-up from system OOMs This change is necessary until it can be moved to Guaranteed QoS Class. For #40573
@mdelio We should get these last 2 added to backlog. I threw them in the spreadsheet |
IIRC @dchen1107 has said she doesn't think kube-proxy should be in Guaranteed QoS because it is hard to set a good limit for its memory consumption. |
I think we can do guaranteed if we have a cluster-proportional scaler that
governs it.
…On Sun, Apr 9, 2017 at 11:08 PM, David Oppenheimer ***@***.*** > wrote:
IIRC @dchen1107 <https://github.com/dchen1107> has said she doesn't think
kube-proxy should be in Guaranteed QoS because it is hard to set a good
limit for its memory consumption.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#40573 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFVgVPbwkVGwCa9WXpOp5iO_kSCH3lQ1ks5rucdzgaJpZM4LvWni>
.
|
As an example, Heapster is in the Guaranteed QoS class and it runs a nanny that resizes it periodically based on cluster size. More details on the nanny program here. As thockin@ suggested, if kube-proxy can be scaled proportional to the number of services in the cluster, kube-proxy can run in the |
/remove-priority P1 |
[MILESTONENOTIFIER] Milestone Removed Important: kind: Must specify exactly one of [ Removing it from the milestone. |
Issues go stale after 90d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Forked from #22212
Tl;dr; Work items in order of priority:
The rest of the work items are not strictly necessary for v1.6 but are included for the sake of completeness.
Guaranteed
QoS class (@thockin @matchstick FYI)If we care about supporting static pods safely along with regular apiserver pods, possibly for bootstrapping, the next feature would matter. Without the feature mentioned below, static pods that have been evicted earlier will get restarted after kubelet restarts and might cause other running pods to get booted due to resource constraints.
Kubernetes lacks the notion of "priority" among pods. A pod level annotation was introduced through which a rescheduler will make room and successfully place a pod marked as "Critical" onto a node by evicting some existing pods.
This feature however does not work for pods that have been directly placed on nodes. Examples of pods directly placed on nodes include pods managed by Daemonset and static pods.
Typically pods that are managed through Daemonsets and static manifest files are of high priority in the system. Given that the rescheduler will not help pods from these sources, there needs to be an alternative means to guarantee admission of such critical pods.
A viable solution is that of having the kubelet preempt existing pods to make room for a critical pod if required.
Note that the critical pod annotation can be easily abused and so as discussed in #38322 (comment), the critical pod annotation will be available only as an opt-in solution and be restricted to a single namespace. Any user who intends to manage critical pods will have to place them in that special namespace (
kube-system
).Until we have preemption available in the kubelet though, it is not safe to evict "critical" pods for a couple of reasons:
For these reasons, we will avoid evicting "Critical" pods.
There is a possibility of "critical" pods having memory or storage leak bugs that could lead to unexpected usage of memory or disk. For this purpose, it is safer to at-least restart "Critical" pods instead of evicting them. However, restart of pods in kubelet is currently difficult to implement. It requires tearing down of all containers, volumes, pod level cgroups, etc and re-creating them.
Given that we are implementing "preemption" soon, we will temporarily just simply ignore "Critical" pods when a node is under resource pressure.
Now onto static pods. Once a static pod gets evicted, it will not be re-started by a kubelet unless it restarts. When it restarts, the kubelet simply admits all static pods even ones that have been previously evicted (cc @yujuhong).
As a result of this, a static pod that was once evicted and is not expected to be running will be admitted and a regular pod that was already running on the node might be booted from the node.
This behavior is highly undesirable. Kubelet can look at the "mirror pod" for a static pod and decide not to recreate previously evicted static pods. But this is only possible if the kubelet has access to the apiserver before it processes static pods.
Instead of attempting to solve this problem, we will instead attempt to avoid using static pods as much as possible in production.
In addition, we need to update kubelet to only operate in one of two modes "local" or "remote". Once the kubelet switches to "remote" apiserver mode, it will stop accepting any "local" static (or http) pods. If the kubelet is explicitly switched back to "local" mode, it will evict all "remote"/unknown pods and only manage "local" pods. Upon switching from "local" to "remote", the kubelet will create mirror pod objects for "local" pods and then transfer control of those pods to the apiserver.
@dashpole ^^
FYI: @dchen1107 @thockin @davidopp @erictune @derekwaynecarr @smarterclayton
The text was updated successfully, but these errors were encountered: