Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid setting CPU limits for Guaranteed pods #51135

Closed
vishh opened this issue Aug 22, 2017 · 78 comments · Fixed by #63437
Closed

Avoid setting CPU limits for Guaranteed pods #51135

vishh opened this issue Aug 22, 2017 · 78 comments · Fixed by #63437
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@vishh
Copy link
Contributor

vishh commented Aug 22, 2017

The effect of CPU throttling is non obvious to users and throws them off when they try out kubernetes. It also complicates CPU capacity planning for pods. Pods that are carefully placed in Guaranteed QoS class are typically of higher priority than other pods, and these pods are expected to get the best performance. On the other hand, Burstable pods may have to be limited in certain environments to avoid tail latencies for Guaranteed pods.
I propose we avoid setting CPU limits on Guaranteed pods by default and continue to impose CPU limits on Burstable pods. We can provide a config option to ease the rollout in existing clusters. Users may need time to handle this transition, and some of them may choose to use the existing policies.

Closely related is static CPU pinning policies in kubelet which will naturally limit Guaranteed pods to specific cores if they request integral number of cores. This feature may be combined with turning off CPU limiting for Guaranteed pods. That said, having an ability to configure each of these feature independently will still be helpful.

@derekwaynecarr we have discussed this in the past, but AFAIK, an issue hasn't been filed thus far. @dchen1107 thoughts?

If there is a general agreement that this is the intended direction longer term, I can post a patch to make CPU limit policies configurable.

@kubernetes/sig-node-feature-requests

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. kind/feature Categorizes issue or PR as related to a new feature. labels Aug 22, 2017
@vishh
Copy link
Contributor Author

vishh commented Aug 22, 2017

cc @thockin who had some thoughts on this feature.

@davidopp
Copy link
Member

Can you explain a bit more about what "avoid setting CPU limits on Guaranteed pods by default" means? A user may set Limit, but depending what other pods are running on the node their pod gets scheduled to, they may or may not get more CPU than indicated by the Limit?

I guess I can see the argument for making this behavior configurable, but what granularity were you think of? Per (guaranteed) pod or per cluster? This might be something a cluster admin wants to set cluster-wide (though I guess they could use an admission controller to do that if it is specified per pod...)

BTW you mentioned that the current behavior thows users off. My naive assumption would be that someone sees the word "limit" and assumes that their pod will be "limited" to using that amount of CPU. So I think the current behavior is more intuitive than the alternative. That said I definitely see the argument for making it configurable.

@vishh
Copy link
Contributor Author

vishh commented Aug 22, 2017

Can you explain a bit more about what "avoid setting CPU limits on Guaranteed pods by default" means? A user may set Limit, but depending what other pods are running on the node their pod gets scheduled to, they may or may not get more CPU than indicated by the Limit?

There will be no CPU Quota limits (throttling) applied to Guaranteed pods. Essentially, their CPU limits will be ignored.

I guess I can see the argument for making this behavior configurable, but what granularity were you think of? Per (guaranteed) pod or per cluster? This might be something a cluster admin wants to set cluster-wide (though I guess they could use an admission controller to do that if it is specified per pod...)

I think having a config per-node (or groups of nodes) may be better.

So I think the current behavior is more intuitive than the alternative. That said I definitely see the argument for making it configurable.

What users expect is that when they set the limit to 1 CPU, they can continue to use 1 CPU continuously and not more than that. In reality though with CPU Quota applied, in a slot of CPU scheduling time, a pod can use more than 1 CPU then get unscheduled and not get access to CPU for the rest of the slot. This means the spikes are penalized very heavily which is not what users expect. CPU Quota may be useful for scavenging pods like Bu and BE pods.

@justinsb
Copy link
Member

So I hit this one rather hard. I was setting CPU request == CPU limits, so that we would be in guaranteed class for critical system pods, so that those pods would be less likely to be evicted. However, I observed that critical pods were being CPU throttled and that this was causing really bad behaviour in our cluster.

Looking at the manifests in cluster/, it seems that the recommended approach is:

  • Set CPU requests, don't set CPU limit (so we will be in burstable QoS)
  • Still OK to set memory limit == request if we want to protect against a memory leak - we will still be in burstable QoS.
  • Enable the ExperimentalCriticalPodAnnotation feature and set scheduler.alpha.kubernetes.io/critical-pod on the pods, along with a toleration - typically [{"key":"CriticalAddonsOnly", "operator":"Exists"}], but [{"operator":"Exists"}] for something that should run on every node e.g. a networking daemonset.
  • Ideally run the rescheduler, but this is not required because ExperimentalCriticalPodAnnotation should prevent eviction (despite being in burstable class)

Is that correct?

On the direct topic of this particular issue, if that's correct, I think it would be confusing to have the CPU limits not applied when limit == request. IMO the bigger problem is that a desire to get into guaranteed class nudged me to set CPU limits, and that a desire not to give up too much resources across the cluster caused me to set those limits too low.

But what is the duration of a CPU scheduling time unit? (And should we recommend making it smaller to avoid the feast-then-famine scenario?)

@davidopp
Copy link
Member

I think what you wrote is actually correct.

All of this should get a lot better once priority is fully implemented, both in the scheduler (alpha in 1.8) and on the kubelet (I assume in alpha in 1.9?). Then all of those steps should be reducible to

  • Set CPU requests, don't set CPU limit
  • Still OK to set memory limit == request if we want to protect against a memory leak - we will still be in burstable QoS.
  • Set a high priority on your pod

justinsb added a commit to justinsb/kops that referenced this issue Nov 13, 2017
* Limit each CNI provider to 100m

* Remove CPU limits - they cause serious problems
(kubernetes/kubernetes#51135), but this also
makes the CPU allocation less problematic.

* Bump versions and start introducing the `-kops.1` suffix preemptively.

* Upgrade flannel to 0.9.0 as it fixes a lot.
justinsb added a commit to justinsb/kops that referenced this issue Nov 13, 2017
* Limit each CNI provider to 100m

* Remove CPU limits - they cause serious problems
(kubernetes/kubernetes#51135), but this also
makes the CPU allocation less problematic.

* Bump versions and start introducing the `-kops.1` suffix preemptively.

* Upgrade flannel to 0.9.0 as it fixes a lot.
k8s-github-robot pushed a commit to kubernetes/kops that referenced this issue Nov 13, 2017
Automatic merge from submit-queue.

Fix CNI CPU allocations

* Limit each CNI provider to 100m

* Remove CPU limits - they cause serious problems
(kubernetes/kubernetes#51135), but this also
makes the CPU allocation less problematic.

* Bump versions and start introducing the `-kops.1` suffix preemptively.

* Upgrade flannel to 0.9.0 as it fixes a lot.

Builds on #3843
@spiffxp
Copy link
Member

spiffxp commented Nov 20, 2017

/remove-priority P2
/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/P2 labels Nov 20, 2017
@ConnorDoyle
Copy link
Contributor

Closely related is static CPU pinning policies in kubelet which will naturally limit Guaranteed pods to specific cores if they request integral number of cores. This feature may be combined with turning off CPU limiting for Guaranteed pods.

Adding context: the static CPU manager policy sets affinity for guaranteed integral-CPU-request containers. Those CPUs are not available for use by any other Kubernetes container. Quota is meaningless for these pinned containers. While these containers have access to the entire logical CPU, they also cannot burst beyond their quota because the amount of CPU time allowed is equal to full utilization of their assigned CPUs. Disabling quota for these containers should have no effect on performance.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 18, 2018
@szuecs
Copy link
Member

szuecs commented Mar 15, 2018

We also run into this issue in production, start investigating latency critical applications moving to kubernetes.
This is kind of a blocker for production clusters, because you can not set CPU limits for all latency critical applications. When we dropped the CPU limits from our ingress controller we got 12-20x less latency for the p99. The ingress controller is consuming about 30m CPU on average (kubectl top pods), but even setting to 1500m did not dropped the p99 very much. We got only improvements of factor 2-3x.

When we dropped the CPU limit, container_cpu_cfs_throttled_seconds_total is not appearing anymore and the latency p99 and p999 dropped from 60ms and 100ms to ~5ms.
A similar issue we found sooner in our kube-apiserver deployment.

image

I believe the problem is https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/helpers_linux.go#L59 , which is hard coded and can not be changed.

@obeattie
Copy link

In case it’s interesting, we implemented the ability to customise this period in our fork.

@szuecs
Copy link
Member

szuecs commented Mar 19, 2018

@obeattie Thanks for sharing!

Did you open a PR already?
I would like to add my +1 to make it happen to upstream.

@obeattie
Copy link

obeattie commented Mar 19, 2018

@szuecs No, I'm afraid not. I discussed it on Slack with some of the k8s maintainers a while ago and I think the conclusion was that it wouldn't fly upstream. I don't recall whether this was due to a specific implementation issue or that ideologically this wouldn't be useful to enough users to be a feature of mainline k8s.

We're certainly not opposed to contributing this code, or trying to get this merged in, if others would find this useful.

@szuecs
Copy link
Member

szuecs commented Mar 19, 2018

@spiffxp @justinsb can you clarify this please?
I really think it's a production problem, if you can not change the period as already done by monzo / @obeattie.
It really makes CPU limits useless for most microservices, because there are many in critical paths of an application, which will cause all of them to not specifying CPU limits, which can cause cluster instability.

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 22, 2018
JoelSpeed pushed a commit to pusher/kubernetes that referenced this issue Dec 3, 2018
…lag and config option to kubelet to be able to set cpu.cfs_period and defaults to 100ms as before.

It requires to enable feature gate CustomCPUCFSQuotaPeriod.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
@nrobert13
Copy link

There's a bug in CFS throttling affecting well behaved applications most, not just CFS period configuration tradeoff. In #67577 I linked upstream kernel bug, reproduction and the possible fix.

Hope that helps.

@bobrik, came across your gist regarding this issue in the cfs. Commented on the gist, but not sure if it sends out any notifications, so will post it here as well.

I'm also trying to understand this issue, and not sure how is it possible to run this cfs.go with cfs-quota 4000, and get Throttling in less than 50% of the iterations. If I understand it right cfs-quota 4000 is 4 ms, so basically trying to execute this 5ms burn in 4ms, it should always Throttle.... what am I missing here?

milesbxf added a commit to monzo/kubernetes that referenced this issue Mar 19, 2019
Adds a monzo.com/cpu-period resource, which allows tuning the period of
time over which the kernel tracksw CPU throttling. In upstream Kubernetes
versions pre-1.12, this is not tunable and is hardcoded to the kernel
default (100ms).

We originally introduced this after seeing long GC pauses clustered
around 100ms [1], which was eventually traced to CFS throttling.
Essentially it's recommended for very latency sensitive & bursty
workloads (like HTTP microservices!) it's recommended to set the CFS
quota period lower. We mostly set ours at 5ms across the board. See [2]
and [3] for further discussion in the Kubernetes repository.

This is fixed in upstream 1.12 via a slightly different path [4]; the
period is now tunable via a kubelet CLI flag. This doesn't give us as
fine-grained control, but we can still set this and optimise for the
vast majority of our workloads.

[1] golang/go#19378
[2] kubernetes#51135
[3] kubernetes#67577
[4] kubernetes#63437
milesbxf added a commit to monzo/kubernetes that referenced this issue Mar 19, 2019
Adds a monzo.com/cpu-period resource, which allows tuning the period of
time over which the kernel tracksw CPU throttling. In upstream Kubernetes
versions pre-1.12, this is not tunable and is hardcoded to the kernel
default (100ms).

We originally introduced this after seeing long GC pauses clustered
around 100ms [1], which was eventually traced to CFS throttling.
Essentially it's recommended for very latency sensitive & bursty
workloads (like HTTP microservices!) it's recommended to set the CFS
quota period lower. We mostly set ours at 5ms across the board. See [2]
and [3] for further discussion in the Kubernetes repository.

This is fixed in upstream 1.12 via a slightly different path [4]; the
period is now tunable via a kubelet CLI flag. This doesn't give us as
fine-grained control, but we can still set this and optimise for the
vast majority of our workloads.

[1] golang/go#19378
[2] kubernetes#51135
[3] kubernetes#67577
[4] kubernetes#63437

Squashed commits:

commit 61551b0
Merge: a446c68 de2c6cb
Author: Miles Bryant <milesbryant@monzo.com>
Date:   Wed Mar 13 16:16:17 2019 +0000

    Merge pull request #2 from monzo/v1.9.11-kubelet-register-cpu-period

    Register nodes with monzo.com/cpu-period resource

commit de2c6cb
Author: Miles Bryant <milesbryant@monzo.com>
Date:   Wed Mar 13 15:14:58 2019 +0000

    Register nodes with monzo.com/cpu-period resource

    We have a custom pod resource which allows tuning the CPU throttling
    period. Upgrading to 1.9 causes this to break scheduling logic, as the
    scheduler and pod preemption controller takes this resource into account
    when deciding where to place pods, and which pods to pre-empt.

    This patches the kubelet so that it registers its node with a large
    amount of this resource - 10000 * max number of pods (default 110). We
    typically run pods with this set to 5000, so this should be plenty.

commit a446c68
Author: Miles Bryant <milesbryant@monzo.com>
Date:   Tue Jan 29 16:43:03 2019 +0000

    ResourceConfig.CpuPeriod is now uint64, not int64

    Some changes to upstream dependencies between v1.7 and v1.9 mean that
    the CpuPeriod field of ResourceConfig has changed type; unfortunately
    this means the Monzo CFS period patch doesn't compile.

    This won't change behaviour at all - the apiserver already validates
    that `monzo.com/cpu-period` can never be negative. The only edge case is
    if someone sets it to higher than the int64 positive bound (this will
    result in an overflow), but I don't think this is worth mitigating

commit 1ead2d6
Author: Oliver Beattie <oliver@obeattie.com>
Date:   Wed Aug 9 22:57:53 2017 +0100

    [PLAT-713] Allow the CFS period to be tuned for containers
@cyrus-mc
Copy link

I know this issue has been dormant for a while but I did ran into this issue and my research led me here. Having spent some time reading through all the comments I want to try and summarize my understanding and some gaps that I still have.

Using an example of a container with a limit of 10% CPU. This results in a cpu.quota of 10 when period is set to 100. If this application is scheduled, uses up all 10 ms of its allotment during that time it will then sit idle for the remaining 90ms of that time slot. For a processor bound application this probably isn't a big deal, but for an I/O bound this would introduce latency.

If on the other hand this application uses its allotment of 10ms evenly over the time period (1ms, sleep for 9ms - 1ms sleep 9ms this latency wouldn't be seen.

To me it really boils down to expressing things in terms of % of CPU. As we all know an application can't use a % of a CPU, it is either using the CPU or it isn't. (even when you specify 10% CPU at any point in time the application is actually using 1 CPU). In both the examples above the application used 10% of the CPU during the period.

@szuecs

This is kind of a blocker for production clusters, because you can not set CPU limits for all latency critical applications. When we dropped the CPU limits from our ingress controller we got 12-20x less latency for the p99. The ingress controller is consuming about 30m CPU on average (kubectl top pods), but even setting to 1500m did not dropped the p99 very much. We got only improvements of factor 2-3x.

Where I am confused is on this situation. If the application is truly only using 30m then setting to 1500m would mean it always gets CPU time during the time slot. How was that value obtained? I would think that the observed time period that that value was obtained from would have to match what the cpu.cfs_period_us is.

After reading through all of this I am leaning towards those that feel setting limits are not going to accomplish what most people think. And the better approach is to adequately set requests as in the end, during contention shares is what will control what gets CPU.

@szuecs
Copy link
Member

szuecs commented Mar 20, 2019

@cyrus-mc I think we could have been hit a kernel bug or node over provisioning changed scheduling time.
You can now use an alpha gate to set per node different cfs_period to make it smaller than 100ms, which could solve your problem. Use node pools to separate in case you need different cfs_period settings for different workloads.

@AtlanCI
Copy link

AtlanCI commented May 12, 2021

Is this problem fixed now? I saw that it was repaired by the kernel team. But I didn't find the relevant information.

@paskal
Copy link
Contributor

paskal commented Jul 21, 2022

I'm not sure CPU limits are in milliseconds in k8s, because in Linux kernel despite having "ms" naming they are in microseconds. I wrote a bunch of questions I have here, I hope someone will help me clarify it.

@szuecs
Copy link
Member

szuecs commented Jul 22, 2022

Yes cpu resource settings are in millicore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

Successfully merging a pull request may close this issue.