New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid setting CPU limits for Guaranteed pods #51135

Closed
vishh opened this Issue Aug 22, 2017 · 73 comments

Comments

@vishh
Copy link
Member

vishh commented Aug 22, 2017

The effect of CPU throttling is non obvious to users and throws them off when they try out kubernetes. It also complicates CPU capacity planning for pods. Pods that are carefully placed in Guaranteed QoS class are typically of higher priority than other pods, and these pods are expected to get the best performance. On the other hand, Burstable pods may have to be limited in certain environments to avoid tail latencies for Guaranteed pods.
I propose we avoid setting CPU limits on Guaranteed pods by default and continue to impose CPU limits on Burstable pods. We can provide a config option to ease the rollout in existing clusters. Users may need time to handle this transition, and some of them may choose to use the existing policies.

Closely related is static CPU pinning policies in kubelet which will naturally limit Guaranteed pods to specific cores if they request integral number of cores. This feature may be combined with turning off CPU limiting for Guaranteed pods. That said, having an ability to configure each of these feature independently will still be helpful.

@derekwaynecarr we have discussed this in the past, but AFAIK, an issue hasn't been filed thus far. @dchen1107 thoughts?

If there is a general agreement that this is the intended direction longer term, I can post a patch to make CPU limit policies configurable.

@kubernetes/sig-node-feature-requests

@vishh

This comment has been minimized.

Copy link
Member Author

vishh commented Aug 22, 2017

cc @thockin who had some thoughts on this feature.

@davidopp

This comment has been minimized.

Copy link
Member

davidopp commented Aug 22, 2017

Can you explain a bit more about what "avoid setting CPU limits on Guaranteed pods by default" means? A user may set Limit, but depending what other pods are running on the node their pod gets scheduled to, they may or may not get more CPU than indicated by the Limit?

I guess I can see the argument for making this behavior configurable, but what granularity were you think of? Per (guaranteed) pod or per cluster? This might be something a cluster admin wants to set cluster-wide (though I guess they could use an admission controller to do that if it is specified per pod...)

BTW you mentioned that the current behavior thows users off. My naive assumption would be that someone sees the word "limit" and assumes that their pod will be "limited" to using that amount of CPU. So I think the current behavior is more intuitive than the alternative. That said I definitely see the argument for making it configurable.

@vishh

This comment has been minimized.

Copy link
Member Author

vishh commented Aug 22, 2017

Can you explain a bit more about what "avoid setting CPU limits on Guaranteed pods by default" means? A user may set Limit, but depending what other pods are running on the node their pod gets scheduled to, they may or may not get more CPU than indicated by the Limit?

There will be no CPU Quota limits (throttling) applied to Guaranteed pods. Essentially, their CPU limits will be ignored.

I guess I can see the argument for making this behavior configurable, but what granularity were you think of? Per (guaranteed) pod or per cluster? This might be something a cluster admin wants to set cluster-wide (though I guess they could use an admission controller to do that if it is specified per pod...)

I think having a config per-node (or groups of nodes) may be better.

So I think the current behavior is more intuitive than the alternative. That said I definitely see the argument for making it configurable.

What users expect is that when they set the limit to 1 CPU, they can continue to use 1 CPU continuously and not more than that. In reality though with CPU Quota applied, in a slot of CPU scheduling time, a pod can use more than 1 CPU then get unscheduled and not get access to CPU for the rest of the slot. This means the spikes are penalized very heavily which is not what users expect. CPU Quota may be useful for scavenging pods like Bu and BE pods.

@justinsb

This comment has been minimized.

Copy link
Member

justinsb commented Sep 30, 2017

So I hit this one rather hard. I was setting CPU request == CPU limits, so that we would be in guaranteed class for critical system pods, so that those pods would be less likely to be evicted. However, I observed that critical pods were being CPU throttled and that this was causing really bad behaviour in our cluster.

Looking at the manifests in cluster/, it seems that the recommended approach is:

  • Set CPU requests, don't set CPU limit (so we will be in burstable QoS)
  • Still OK to set memory limit == request if we want to protect against a memory leak - we will still be in burstable QoS.
  • Enable the ExperimentalCriticalPodAnnotation feature and set scheduler.alpha.kubernetes.io/critical-pod on the pods, along with a toleration - typically [{"key":"CriticalAddonsOnly", "operator":"Exists"}], but [{"operator":"Exists"}] for something that should run on every node e.g. a networking daemonset.
  • Ideally run the rescheduler, but this is not required because ExperimentalCriticalPodAnnotation should prevent eviction (despite being in burstable class)

Is that correct?

On the direct topic of this particular issue, if that's correct, I think it would be confusing to have the CPU limits not applied when limit == request. IMO the bigger problem is that a desire to get into guaranteed class nudged me to set CPU limits, and that a desire not to give up too much resources across the cluster caused me to set those limits too low.

But what is the duration of a CPU scheduling time unit? (And should we recommend making it smaller to avoid the feast-then-famine scenario?)

@davidopp

This comment has been minimized.

Copy link
Member

davidopp commented Sep 30, 2017

I think what you wrote is actually correct.

All of this should get a lot better once priority is fully implemented, both in the scheduler (alpha in 1.8) and on the kubelet (I assume in alpha in 1.9?). Then all of those steps should be reducible to

  • Set CPU requests, don't set CPU limit
  • Still OK to set memory limit == request if we want to protect against a memory leak - we will still be in burstable QoS.
  • Set a high priority on your pod

justinsb added a commit to justinsb/kops that referenced this issue Nov 13, 2017

Fix CNI CPU allocations
* Limit each CNI provider to 100m

* Remove CPU limits - they cause serious problems
(kubernetes/kubernetes#51135), but this also
makes the CPU allocation less problematic.

* Bump versions and start introducing the `-kops.1` suffix preemptively.

* Upgrade flannel to 0.9.0 as it fixes a lot.

justinsb added a commit to justinsb/kops that referenced this issue Nov 13, 2017

Fix CNI CPU allocations
* Limit each CNI provider to 100m

* Remove CPU limits - they cause serious problems
(kubernetes/kubernetes#51135), but this also
makes the CPU allocation less problematic.

* Bump versions and start introducing the `-kops.1` suffix preemptively.

* Upgrade flannel to 0.9.0 as it fixes a lot.

k8s-merge-robot added a commit to kubernetes/kops that referenced this issue Nov 13, 2017

Merge pull request #3844 from justinsb/fix_cpu_cni
Automatic merge from submit-queue.

Fix CNI CPU allocations

* Limit each CNI provider to 100m

* Remove CPU limits - they cause serious problems
(kubernetes/kubernetes#51135), but this also
makes the CPU allocation less problematic.

* Bump versions and start introducing the `-kops.1` suffix preemptively.

* Upgrade flannel to 0.9.0 as it fixes a lot.

Builds on #3843
@spiffxp

This comment has been minimized.

Copy link
Member

spiffxp commented Nov 20, 2017

/remove-priority P2
/priority important-soon

@ConnorDoyle

This comment has been minimized.

Copy link
Member

ConnorDoyle commented Nov 20, 2017

Closely related is static CPU pinning policies in kubelet which will naturally limit Guaranteed pods to specific cores if they request integral number of cores. This feature may be combined with turning off CPU limiting for Guaranteed pods.

Adding context: the static CPU manager policy sets affinity for guaranteed integral-CPU-request containers. Those CPUs are not available for use by any other Kubernetes container. Quota is meaningless for these pinned containers. While these containers have access to the entire logical CPU, they also cannot burst beyond their quota because the amount of CPU time allowed is equal to full utilization of their assigned CPUs. Disabling quota for these containers should have no effect on performance.

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Feb 18, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@szuecs

This comment has been minimized.

Copy link
Contributor

szuecs commented Mar 15, 2018

We also run into this issue in production, start investigating latency critical applications moving to kubernetes.
This is kind of a blocker for production clusters, because you can not set CPU limits for all latency critical applications. When we dropped the CPU limits from our ingress controller we got 12-20x less latency for the p99. The ingress controller is consuming about 30m CPU on average (kubectl top pods), but even setting to 1500m did not dropped the p99 very much. We got only improvements of factor 2-3x.

When we dropped the CPU limit, container_cpu_cfs_throttled_seconds_total is not appearing anymore and the latency p99 and p999 dropped from 60ms and 100ms to ~5ms.
A similar issue we found sooner in our kube-apiserver deployment.

image

I believe the problem is https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/helpers_linux.go#L59 , which is hard coded and can not be changed.

@obeattie

This comment has been minimized.

Copy link

obeattie commented Mar 17, 2018

In case it’s interesting, we implemented the ability to customise this period in our fork.

@szuecs

This comment has been minimized.

Copy link
Contributor

szuecs commented Mar 19, 2018

@obeattie Thanks for sharing!

Did you open a PR already?
I would like to add my +1 to make it happen to upstream.

@obeattie

This comment has been minimized.

Copy link

obeattie commented Mar 19, 2018

@szuecs No, I'm afraid not. I discussed it on Slack with some of the k8s maintainers a while ago and I think the conclusion was that it wouldn't fly upstream. I don't recall whether this was due to a specific implementation issue or that ideologically this wouldn't be useful to enough users to be a feature of mainline k8s.

We're certainly not opposed to contributing this code, or trying to get this merged in, if others would find this useful.

@szuecs

This comment has been minimized.

Copy link
Contributor

szuecs commented Mar 19, 2018

@spiffxp @justinsb can you clarify this please?
I really think it's a production problem, if you can not change the period as already done by monzo / @obeattie.
It really makes CPU limits useless for most microservices, because there are many in critical paths of an application, which will cause all of them to not specifying CPU limits, which can cause cluster instability.

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Apr 22, 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

szuecs added a commit to szuecs/kubernetes that referenced this issue Aug 1, 2018

fix kubernetes#51135 make CFS quota period configurable, adds a cli f…
…lag and config option to kubelet to be able to set cpu.cfs_period and defaults to 100ms as before.

It requires to enable feature gate CustomCPUCFSQuotaPeriod.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>

szuecs added a commit to szuecs/kubernetes that referenced this issue Aug 3, 2018

fix kubernetes#51135 make CFS quota period configurable, adds a cli f…
…lag and config option to kubelet to be able to set cpu.cfs_period and defaults to 100ms as before.

It requires to enable feature gate CustomCPUCFSQuotaPeriod.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>

szuecs added a commit to szuecs/kubernetes that referenced this issue Aug 3, 2018

fix kubernetes#51135 make CFS quota period configurable, adds a cli f…
…lag and config option to kubelet to be able to set cpu.cfs_period and defaults to 100ms as before.

It requires to enable feature gate CustomCPUCFSQuotaPeriod.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>

szuecs added a commit to szuecs/kubernetes that referenced this issue Aug 5, 2018

fix kubernetes#51135 make CFS quota period configurable, adds a cli f…
…lag and config option to kubelet to be able to set cpu.cfs_period and defaults to 100ms as before.

It requires to enable feature gate CustomCPUCFSQuotaPeriod.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>

szuecs added a commit to szuecs/kubernetes that referenced this issue Aug 9, 2018

fix kubernetes#51135 make CFS quota period configurable, adds a cli f…
…lag and config option to kubelet to be able to set cpu.cfs_period and defaults to 100ms as before.

It requires to enable feature gate CustomCPUCFSQuotaPeriod.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>

szuecs added a commit to szuecs/kubernetes that referenced this issue Aug 9, 2018

fix kubernetes#51135 make CFS quota period configurable, adds a cli f…
…lag and config option to kubelet to be able to set cpu.cfs_period and defaults to 100ms as before.

It requires to enable feature gate CustomCPUCFSQuotaPeriod.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>

szuecs added a commit to szuecs/kubernetes that referenced this issue Aug 9, 2018

fix kubernetes#51135 make CFS quota period configurable, adds a cli f…
…lag and config option to kubelet to be able to set cpu.cfs_period and defaults to 100ms as before.

It requires to enable feature gate CustomCPUCFSQuotaPeriod.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>

szuecs added a commit to szuecs/kubernetes that referenced this issue Aug 9, 2018

fix kubernetes#51135 make CFS quota period configurable, adds a cli f…
…lag and config option to kubelet to be able to set cpu.cfs_period and defaults to 100ms as before.

It requires to enable feature gate CustomCPUCFSQuotaPeriod.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
@bobrik

This comment has been minimized.

Copy link

bobrik commented Aug 21, 2018

There's a bug in CFS throttling affecting well behaved applications most, not just CFS period configuration tradeoff. In #67577 I linked upstream kernel bug, reproduction and the possible fix.

Hope that helps.

szuecs added a commit to szuecs/kubernetes that referenced this issue Aug 23, 2018

fix kubernetes#51135 make CFS quota period configurable, adds a cli f…
…lag and config option to kubelet to be able to set cpu.cfs_period and defaults to 100ms as before.

It requires to enable feature gate CustomCPUCFSQuotaPeriod.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>

szuecs added a commit to szuecs/kubernetes that referenced this issue Aug 23, 2018

fix kubernetes#51135 make CFS quota period configurable, adds a cli f…
…lag and config option to kubelet to be able to set cpu.cfs_period and defaults to 100ms as before.

It requires to enable feature gate CustomCPUCFSQuotaPeriod.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>

szuecs added a commit to szuecs/kubernetes that referenced this issue Aug 31, 2018

fix kubernetes#51135 make CFS quota period configurable, adds a cli f…
…lag and config option to kubelet to be able to set cpu.cfs_period and defaults to 100ms as before.

It requires to enable feature gate CustomCPUCFSQuotaPeriod.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>

szuecs added a commit to szuecs/kubernetes that referenced this issue Aug 31, 2018

fix kubernetes#51135 make CFS quota period configurable, adds a cli f…
…lag and config option to kubelet to be able to set cpu.cfs_period and defaults to 100ms as before.

It requires to enable feature gate CustomCPUCFSQuotaPeriod.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>

szuecs added a commit to szuecs/kubernetes that referenced this issue Aug 31, 2018

fix kubernetes#51135 make CFS quota period configurable, adds a cli f…
…lag and config option to kubelet to be able to set cpu.cfs_period and defaults to 100ms as before.

It requires to enable feature gate CustomCPUCFSQuotaPeriod.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>

szuecs added a commit to szuecs/kubernetes that referenced this issue Aug 31, 2018

fix kubernetes#51135 make CFS quota period configurable, adds a cli f…
…lag and config option to kubelet to be able to set cpu.cfs_period and defaults to 100ms as before.

It requires to enable feature gate CustomCPUCFSQuotaPeriod.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>

szuecs added a commit to szuecs/kubernetes that referenced this issue Aug 31, 2018

fix kubernetes#51135 make CFS quota period configurable, adds a cli f…
…lag and config option to kubelet to be able to set cpu.cfs_period and defaults to 100ms as before.

It requires to enable feature gate CustomCPUCFSQuotaPeriod.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>

szuecs added a commit to szuecs/kubernetes that referenced this issue Sep 1, 2018

fix kubernetes#51135 make CFS quota period configurable, adds a cli f…
…lag and config option to kubelet to be able to set cpu.cfs_period and defaults to 100ms as before.

It requires to enable feature gate CustomCPUCFSQuotaPeriod.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>

szuecs added a commit to szuecs/kubernetes that referenced this issue Sep 1, 2018

fix kubernetes#51135 make CFS quota period configurable, adds a cli f…
…lag and config option to kubelet to be able to set cpu.cfs_period and defaults to 100ms as before.

It requires to enable feature gate CustomCPUCFSQuotaPeriod.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>

szuecs added a commit to szuecs/kubernetes that referenced this issue Sep 1, 2018

fix kubernetes#51135 make CFS quota period configurable, adds a cli f…
…lag and config option to kubelet to be able to set cpu.cfs_period and defaults to 100ms as before.

It requires to enable feature gate CustomCPUCFSQuotaPeriod.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>

szuecs added a commit to szuecs/kubernetes that referenced this issue Sep 1, 2018

fix kubernetes#51135 make CFS quota period configurable, adds a cli f…
…lag and config option to kubelet to be able to set cpu.cfs_period and defaults to 100ms as before.

It requires to enable feature gate CustomCPUCFSQuotaPeriod.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>

k8s-merge-robot added a commit that referenced this issue Sep 2, 2018

Merge pull request #63437 from szuecs/fix/51135-set-saneer-default-cp…
…u.cfs_period

Automatic merge from submit-queue (batch tested with PRs 63437, 68081). If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md.

fix #51135 make CFS quota period configurable

**What this PR does / why we need it**:

This PR makes it possible for users to change CFS quota period from the default 100ms to some other value between 1µs and 1s.
#51135 shows that multiple production users have serious issues running reasonable workloads in kubernetes. The latency added by the 100ms CFS quota period is adding way too much time.

**Which issue(s) this PR fixes**:
Fixes #51135 

**Special notes for your reviewer**:
- 5ms is used by user experience #51135 (comment)
- Latency added caused by CFS 100ms is shown at #51135 (comment)
- explanation why we should not disable limits #51135 (comment)
- agreement found at kubecon EU 2018: #51135 (comment)

**Release note**:
```release-note
Adds a kubelet parameter and config option to change CFS quota period from the default 100ms to some other value between 1µs and 1s. This was done to improve response latencies for workloads running in clusters with guaranteed and burstable QoS classes.  
```

JoelSpeed added a commit to pusher/kubernetes that referenced this issue Sep 5, 2018

fix kubernetes#51135 make CFS quota period configurable, adds a cli f…
…lag and config option to kubelet to be able to set cpu.cfs_period and defaults to 100ms as before.

It requires to enable feature gate CustomCPUCFSQuotaPeriod.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>

JoelSpeed added a commit to pusher/kubernetes that referenced this issue Sep 5, 2018

fix kubernetes#51135 make CFS quota period configurable, adds a cli f…
…lag and config option to kubelet to be able to set cpu.cfs_period and defaults to 100ms as before.

It requires to enable feature gate CustomCPUCFSQuotaPeriod.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>

JoelSpeed added a commit to pusher/kubernetes that referenced this issue Dec 3, 2018

fix kubernetes#51135 make CFS quota period configurable, adds a cli f…
…lag and config option to kubelet to be able to set cpu.cfs_period and defaults to 100ms as before.

It requires to enable feature gate CustomCPUCFSQuotaPeriod.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
@nrobert13

This comment has been minimized.

Copy link

nrobert13 commented Jan 14, 2019

There's a bug in CFS throttling affecting well behaved applications most, not just CFS period configuration tradeoff. In #67577 I linked upstream kernel bug, reproduction and the possible fix.

Hope that helps.

@bobrik, came across your gist regarding this issue in the cfs. Commented on the gist, but not sure if it sends out any notifications, so will post it here as well.

I'm also trying to understand this issue, and not sure how is it possible to run this cfs.go with cfs-quota 4000, and get Throttling in less than 50% of the iterations. If I understand it right cfs-quota 4000 is 4 ms, so basically trying to execute this 5ms burn in 4ms, it should always Throttle.... what am I missing here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment