Avoid setting CPU limits for Guaranteed pods #51135

vishh · 2017-08-22T18:25:56Z

The effect of CPU throttling is non obvious to users and throws them off when they try out kubernetes. It also complicates CPU capacity planning for pods. Pods that are carefully placed in Guaranteed QoS class are typically of higher priority than other pods, and these pods are expected to get the best performance. On the other hand, Burstable pods may have to be limited in certain environments to avoid tail latencies for Guaranteed pods.
I propose we avoid setting CPU limits on Guaranteed pods by default and continue to impose CPU limits on Burstable pods. We can provide a config option to ease the rollout in existing clusters. Users may need time to handle this transition, and some of them may choose to use the existing policies.

Closely related is static CPU pinning policies in kubelet which will naturally limit Guaranteed pods to specific cores if they request integral number of cores. This feature may be combined with turning off CPU limiting for Guaranteed pods. That said, having an ability to configure each of these feature independently will still be helpful.

@derekwaynecarr we have discussed this in the past, but AFAIK, an issue hasn't been filed thus far. @dchen1107 thoughts?

If there is a general agreement that this is the intended direction longer term, I can post a patch to make CPU limit policies configurable.

@kubernetes/sig-node-feature-requests

vishh · 2017-08-22T18:26:38Z

cc @thockin who had some thoughts on this feature.

davidopp · 2017-08-22T20:17:21Z

Can you explain a bit more about what "avoid setting CPU limits on Guaranteed pods by default" means? A user may set Limit, but depending what other pods are running on the node their pod gets scheduled to, they may or may not get more CPU than indicated by the Limit?

I guess I can see the argument for making this behavior configurable, but what granularity were you think of? Per (guaranteed) pod or per cluster? This might be something a cluster admin wants to set cluster-wide (though I guess they could use an admission controller to do that if it is specified per pod...)

BTW you mentioned that the current behavior thows users off. My naive assumption would be that someone sees the word "limit" and assumes that their pod will be "limited" to using that amount of CPU. So I think the current behavior is more intuitive than the alternative. That said I definitely see the argument for making it configurable.

vishh · 2017-08-22T20:43:27Z

Can you explain a bit more about what "avoid setting CPU limits on Guaranteed pods by default" means? A user may set Limit, but depending what other pods are running on the node their pod gets scheduled to, they may or may not get more CPU than indicated by the Limit?

There will be no CPU Quota limits (throttling) applied to Guaranteed pods. Essentially, their CPU limits will be ignored.

I guess I can see the argument for making this behavior configurable, but what granularity were you think of? Per (guaranteed) pod or per cluster? This might be something a cluster admin wants to set cluster-wide (though I guess they could use an admission controller to do that if it is specified per pod...)

I think having a config per-node (or groups of nodes) may be better.

So I think the current behavior is more intuitive than the alternative. That said I definitely see the argument for making it configurable.

What users expect is that when they set the limit to 1 CPU, they can continue to use 1 CPU continuously and not more than that. In reality though with CPU Quota applied, in a slot of CPU scheduling time, a pod can use more than 1 CPU then get unscheduled and not get access to CPU for the rest of the slot. This means the spikes are penalized very heavily which is not what users expect. CPU Quota may be useful for scavenging pods like Bu and BE pods.

justinsb · 2017-09-30T18:26:40Z

So I hit this one rather hard. I was setting CPU request == CPU limits, so that we would be in guaranteed class for critical system pods, so that those pods would be less likely to be evicted. However, I observed that critical pods were being CPU throttled and that this was causing really bad behaviour in our cluster.

Looking at the manifests in cluster/, it seems that the recommended approach is:

Set CPU requests, don't set CPU limit (so we will be in burstable QoS)
Still OK to set memory limit == request if we want to protect against a memory leak - we will still be in burstable QoS.
Enable the ExperimentalCriticalPodAnnotation feature and set scheduler.alpha.kubernetes.io/critical-pod on the pods, along with a toleration - typically [{"key":"CriticalAddonsOnly", "operator":"Exists"}], but [{"operator":"Exists"}] for something that should run on every node e.g. a networking daemonset.
Ideally run the rescheduler, but this is not required because ExperimentalCriticalPodAnnotation should prevent eviction (despite being in burstable class)

Is that correct?

On the direct topic of this particular issue, if that's correct, I think it would be confusing to have the CPU limits not applied when limit == request. IMO the bigger problem is that a desire to get into guaranteed class nudged me to set CPU limits, and that a desire not to give up too much resources across the cluster caused me to set those limits too low.

But what is the duration of a CPU scheduling time unit? (And should we recommend making it smaller to avoid the feast-then-famine scenario?)

davidopp · 2017-09-30T19:20:22Z

I think what you wrote is actually correct.

All of this should get a lot better once priority is fully implemented, both in the scheduler (alpha in 1.8) and on the kubelet (I assume in alpha in 1.9?). Then all of those steps should be reducible to

Set CPU requests, don't set CPU limit
Still OK to set memory limit == request if we want to protect against a memory leak - we will still be in burstable QoS.
Set a high priority on your pod

* Limit each CNI provider to 100m * Remove CPU limits - they cause serious problems (kubernetes/kubernetes#51135), but this also makes the CPU allocation less problematic. * Bump versions and start introducing the `-kops.1` suffix preemptively. * Upgrade flannel to 0.9.0 as it fixes a lot.

Automatic merge from submit-queue. Fix CNI CPU allocations * Limit each CNI provider to 100m * Remove CPU limits - they cause serious problems (kubernetes/kubernetes#51135), but this also makes the CPU allocation less problematic. * Bump versions and start introducing the `-kops.1` suffix preemptively. * Upgrade flannel to 0.9.0 as it fixes a lot. Builds on #3843

spiffxp · 2017-11-20T17:12:09Z

/remove-priority P2
/priority important-soon

ConnorDoyle · 2017-11-20T17:30:16Z

Closely related is static CPU pinning policies in kubelet which will naturally limit Guaranteed pods to specific cores if they request integral number of cores. This feature may be combined with turning off CPU limiting for Guaranteed pods.

Adding context: the static CPU manager policy sets affinity for guaranteed integral-CPU-request containers. Those CPUs are not available for use by any other Kubernetes container. Quota is meaningless for these pinned containers. While these containers have access to the entire logical CPU, they also cannot burst beyond their quota because the amount of CPU time allowed is equal to full utilization of their assigned CPUs. Disabling quota for these containers should have no effect on performance.

fejta-bot · 2018-02-18T17:51:00Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

szuecs · 2018-03-15T17:14:13Z

We also run into this issue in production, start investigating latency critical applications moving to kubernetes.
This is kind of a blocker for production clusters, because you can not set CPU limits for all latency critical applications. When we dropped the CPU limits from our ingress controller we got 12-20x less latency for the p99. The ingress controller is consuming about 30m CPU on average (kubectl top pods), but even setting to 1500m did not dropped the p99 very much. We got only improvements of factor 2-3x.

When we dropped the CPU limit, container_cpu_cfs_throttled_seconds_total is not appearing anymore and the latency p99 and p999 dropped from 60ms and 100ms to ~5ms.
A similar issue we found sooner in our kube-apiserver deployment.

I believe the problem is https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/helpers_linux.go#L59 , which is hard coded and can not be changed.

obeattie · 2018-03-17T17:42:49Z

In case it’s interesting, we implemented the ability to customise this period in our fork.

szuecs · 2018-03-19T09:19:28Z

@obeattie Thanks for sharing!

Did you open a PR already?
I would like to add my +1 to make it happen to upstream.

obeattie · 2018-03-19T09:23:27Z

@szuecs No, I'm afraid not. I discussed it on Slack with some of the k8s maintainers a while ago and I think the conclusion was that it wouldn't fly upstream. I don't recall whether this was due to a specific implementation issue or that ideologically this wouldn't be useful to enough users to be a feature of mainline k8s.

We're certainly not opposed to contributing this code, or trying to get this merged in, if others would find this useful.

szuecs · 2018-03-19T10:12:36Z

@spiffxp @justinsb can you clarify this please?
I really think it's a production problem, if you can not change the period as already done by monzo / @obeattie.
It really makes CPU limits useless for most microservices, because there are many in critical paths of an application, which will cause all of them to not specifying CPU limits, which can cause cluster instability.

fejta-bot · 2018-04-22T18:04:50Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

…lag and config option to kubelet to be able to set cpu.cfs_period and defaults to 100ms as before. It requires to enable feature gate CustomCPUCFSQuotaPeriod. Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>

nrobert13 · 2019-01-14T16:33:43Z

There's a bug in CFS throttling affecting well behaved applications most, not just CFS period configuration tradeoff. In #67577 I linked upstream kernel bug, reproduction and the possible fix.

Hope that helps.

@bobrik, came across your gist regarding this issue in the cfs. Commented on the gist, but not sure if it sends out any notifications, so will post it here as well.

I'm also trying to understand this issue, and not sure how is it possible to run this cfs.go with cfs-quota 4000, and get Throttling in less than 50% of the iterations. If I understand it right cfs-quota 4000 is 4 ms, so basically trying to execute this 5ms burn in 4ms, it should always Throttle.... what am I missing here?

Adds a monzo.com/cpu-period resource, which allows tuning the period of time over which the kernel tracksw CPU throttling. In upstream Kubernetes versions pre-1.12, this is not tunable and is hardcoded to the kernel default (100ms). We originally introduced this after seeing long GC pauses clustered around 100ms [1], which was eventually traced to CFS throttling. Essentially it's recommended for very latency sensitive & bursty workloads (like HTTP microservices!) it's recommended to set the CFS quota period lower. We mostly set ours at 5ms across the board. See [2] and [3] for further discussion in the Kubernetes repository. This is fixed in upstream 1.12 via a slightly different path [4]; the period is now tunable via a kubelet CLI flag. This doesn't give us as fine-grained control, but we can still set this and optimise for the vast majority of our workloads. [1] golang/go#19378 [2] kubernetes#51135 [3] kubernetes#67577 [4] kubernetes#63437

Adds a monzo.com/cpu-period resource, which allows tuning the period of time over which the kernel tracksw CPU throttling. In upstream Kubernetes versions pre-1.12, this is not tunable and is hardcoded to the kernel default (100ms). We originally introduced this after seeing long GC pauses clustered around 100ms [1], which was eventually traced to CFS throttling. Essentially it's recommended for very latency sensitive & bursty workloads (like HTTP microservices!) it's recommended to set the CFS quota period lower. We mostly set ours at 5ms across the board. See [2] and [3] for further discussion in the Kubernetes repository. This is fixed in upstream 1.12 via a slightly different path [4]; the period is now tunable via a kubelet CLI flag. This doesn't give us as fine-grained control, but we can still set this and optimise for the vast majority of our workloads. [1] golang/go#19378 [2] kubernetes#51135 [3] kubernetes#67577 [4] kubernetes#63437 Squashed commits: commit 61551b0 Merge: a446c68 de2c6cb Author: Miles Bryant <milesbryant@monzo.com> Date: Wed Mar 13 16:16:17 2019 +0000 Merge pull request #2 from monzo/v1.9.11-kubelet-register-cpu-period Register nodes with monzo.com/cpu-period resource commit de2c6cb Author: Miles Bryant <milesbryant@monzo.com> Date: Wed Mar 13 15:14:58 2019 +0000 Register nodes with monzo.com/cpu-period resource We have a custom pod resource which allows tuning the CPU throttling period. Upgrading to 1.9 causes this to break scheduling logic, as the scheduler and pod preemption controller takes this resource into account when deciding where to place pods, and which pods to pre-empt. This patches the kubelet so that it registers its node with a large amount of this resource - 10000 * max number of pods (default 110). We typically run pods with this set to 5000, so this should be plenty. commit a446c68 Author: Miles Bryant <milesbryant@monzo.com> Date: Tue Jan 29 16:43:03 2019 +0000 ResourceConfig.CpuPeriod is now uint64, not int64 Some changes to upstream dependencies between v1.7 and v1.9 mean that the CpuPeriod field of ResourceConfig has changed type; unfortunately this means the Monzo CFS period patch doesn't compile. This won't change behaviour at all - the apiserver already validates that `monzo.com/cpu-period` can never be negative. The only edge case is if someone sets it to higher than the int64 positive bound (this will result in an overflow), but I don't think this is worth mitigating commit 1ead2d6 Author: Oliver Beattie <oliver@obeattie.com> Date: Wed Aug 9 22:57:53 2017 +0100 [PLAT-713] Allow the CFS period to be tuned for containers

cyrus-mc · 2019-03-20T02:15:51Z

I know this issue has been dormant for a while but I did ran into this issue and my research led me here. Having spent some time reading through all the comments I want to try and summarize my understanding and some gaps that I still have.

Using an example of a container with a limit of 10% CPU. This results in a cpu.quota of 10 when period is set to 100. If this application is scheduled, uses up all 10 ms of its allotment during that time it will then sit idle for the remaining 90ms of that time slot. For a processor bound application this probably isn't a big deal, but for an I/O bound this would introduce latency.

If on the other hand this application uses its allotment of 10ms evenly over the time period (1ms, sleep for 9ms - 1ms sleep 9ms this latency wouldn't be seen.

To me it really boils down to expressing things in terms of % of CPU. As we all know an application can't use a % of a CPU, it is either using the CPU or it isn't. (even when you specify 10% CPU at any point in time the application is actually using 1 CPU). In both the examples above the application used 10% of the CPU during the period.

@szuecs

This is kind of a blocker for production clusters, because you can not set CPU limits for all latency critical applications. When we dropped the CPU limits from our ingress controller we got 12-20x less latency for the p99. The ingress controller is consuming about 30m CPU on average (kubectl top pods), but even setting to 1500m did not dropped the p99 very much. We got only improvements of factor 2-3x.

Where I am confused is on this situation. If the application is truly only using 30m then setting to 1500m would mean it always gets CPU time during the time slot. How was that value obtained? I would think that the observed time period that that value was obtained from would have to match what the cpu.cfs_period_us is.

After reading through all of this I am leaning towards those that feel setting limits are not going to accomplish what most people think. And the better approach is to adequately set requests as in the end, during contention shares is what will control what gets CPU.

szuecs · 2019-03-20T07:25:53Z

@cyrus-mc I think we could have been hit a kernel bug or node over provisioning changed scheduling time.
You can now use an alpha gate to set per node different cfs_period to make it smaller than 100ms, which could solve your problem. Use node pools to separate in case you need different cfs_period settings for different workloads.

AtlanCI · 2021-05-12T22:45:28Z

Is this problem fixed now? I saw that it was repaired by the kernel team. But I didn't find the relevant information.

paskal · 2022-07-21T12:01:38Z

I'm not sure CPU limits are in milliseconds in k8s, because in Linux kernel despite having "ms" naming they are in microseconds. I wrote a bunch of questions I have here, I hope someone will help me clarify it.

szuecs · 2022-07-22T14:39:12Z

Yes cpu resource settings are in millicore

vishh assigned derekwaynecarr and vishh Aug 22, 2017

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. kind/feature Categorizes issue or PR as related to a new feature. labels Aug 22, 2017

vishh added the priority/P2 label Aug 22, 2017

justinsb mentioned this issue Sep 30, 2017

Set critical pod annotations kopeio/networking#5

Closed

justinsb mentioned this issue Nov 13, 2017

Fix CNI CPU allocations kubernetes/kops#3844

Merged

k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/P2 labels Nov 20, 2017

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 18, 2018

a-robinson mentioned this issue Apr 11, 2018

Performance tuning guide for running on Kubernetes cockroachdb/docs#2896

Merged

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 22, 2018

mseiwald mentioned this issue Oct 17, 2018

Make CFS quota period configurable aws/amazon-ecs-agent#1627

Closed

nikhita mentioned this issue Nov 19, 2018

Fix CPUThrottlingHigh alert kubermatic/kubermatic#2367

Closed

fabriziofortino mentioned this issue Jan 24, 2019

[stable/elasticsearch] do not set default resources limits for all nodes types helm/charts#10872

Closed

3 tasks

dannyk81 mentioned this issue Mar 25, 2019

Disable cpu quota(use only cpuset) for pod Guaranteed #70585

Open

jnummelin mentioned this issue May 15, 2019

Allow user to configure CFS quota setting for kubelet kontena/pharos-cluster#1327

Merged

agadelshin mentioned this issue Jun 10, 2019

[stable/elasticsearch] remove cpu limit from default values helm/charts#14661

Closed

4 tasks

dilyevsky mentioned this issue Jul 11, 2019

unknown host error coredns/coredns#2963

Closed

Nuru mentioned this issue Aug 15, 2019

[kops] Enable CFS configuration cloudposse-archives/reference-architectures#42

Merged

zhuopeng mentioned this issue Sep 20, 2019

CPU limiting affects performance/latency telepresenceio/telepresence#1120

Closed

sainathkeesari mentioned this issue Jan 13, 2020

Average CPU % usage per container google/cadvisor#2026

Closed

ftyuuu mentioned this issue Jan 19, 2020

A performance issue elastic/cloud-on-k8s#2402

Closed

jmlrt mentioned this issue Jan 23, 2020

[Elasticsearch] Default values for resource policies makes it hard to have unset limits elastic/helm-charts#409

Closed

stankevich mentioned this issue Mar 25, 2020

Lower requested resorces for sidecar containers bitpoke/mysql-operator#512

Merged

1 task

jianzzha mentioned this issue Mar 26, 2020

Disable CPUCFSQuota to improve containerized DPDK performance openshift-kni/performance-addon-operators#152

Closed

realshuting mentioned this issue Apr 3, 2020

Remove cpu limits from the best practice policy kyverno/kyverno#799

Closed

willejs mentioned this issue May 1, 2020

Please add resource requests and limits to sample deploys uswitch/kiam#222

Open

JimBugwadia mentioned this issue May 21, 2020

Added cpu & memory resource requests and limits kyverno/kyverno#868

Merged

erkannt mentioned this issue Jun 8, 2020

Default CPU limit leads to pods getting throttled libero/reviewer#1023

Closed

BenPope mentioned this issue Dec 29, 2020

Resource Management redpanda-data/helm-charts#8

Merged

paskal mentioned this issue Jul 28, 2022

Change CPUCFSQuotaPeriod default value from 100ms to 100us to match Linux default #111520

Merged

reflectivedevelopment mentioned this issue Mar 13, 2023

Excessive throttling when running ACA-py agents containerized cloud environments hyperledger/aries-cloudagent-python#2157

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid setting CPU limits for Guaranteed pods #51135

Avoid setting CPU limits for Guaranteed pods #51135

vishh commented Aug 22, 2017

vishh commented Aug 22, 2017

davidopp commented Aug 22, 2017

vishh commented Aug 22, 2017

justinsb commented Sep 30, 2017

davidopp commented Sep 30, 2017

spiffxp commented Nov 20, 2017

ConnorDoyle commented Nov 20, 2017

fejta-bot commented Feb 18, 2018

szuecs commented Mar 15, 2018

obeattie commented Mar 17, 2018

szuecs commented Mar 19, 2018

obeattie commented Mar 19, 2018 •

edited

Loading

szuecs commented Mar 19, 2018

fejta-bot commented Apr 22, 2018

nrobert13 commented Jan 14, 2019

cyrus-mc commented Mar 20, 2019

szuecs commented Mar 20, 2019

AtlanCI commented May 12, 2021

paskal commented Jul 21, 2022

szuecs commented Jul 22, 2022

Avoid setting CPU limits for Guaranteed pods #51135

Avoid setting CPU limits for Guaranteed pods #51135

Comments

vishh commented Aug 22, 2017

vishh commented Aug 22, 2017

davidopp commented Aug 22, 2017

vishh commented Aug 22, 2017

justinsb commented Sep 30, 2017

davidopp commented Sep 30, 2017

spiffxp commented Nov 20, 2017

ConnorDoyle commented Nov 20, 2017

fejta-bot commented Feb 18, 2018

szuecs commented Mar 15, 2018

obeattie commented Mar 17, 2018

szuecs commented Mar 19, 2018

obeattie commented Mar 19, 2018 • edited Loading

szuecs commented Mar 19, 2018

fejta-bot commented Apr 22, 2018

nrobert13 commented Jan 14, 2019

cyrus-mc commented Mar 20, 2019

szuecs commented Mar 20, 2019

AtlanCI commented May 12, 2021

paskal commented Jul 21, 2022

szuecs commented Jul 22, 2022

obeattie commented Mar 19, 2018 •

edited

Loading