-
Notifications
You must be signed in to change notification settings - Fork 38.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid setting CPU limits for Guaranteed pods #51135
Comments
cc @thockin who had some thoughts on this feature. |
Can you explain a bit more about what "avoid setting CPU limits on Guaranteed pods by default" means? A user may set Limit, but depending what other pods are running on the node their pod gets scheduled to, they may or may not get more CPU than indicated by the Limit? I guess I can see the argument for making this behavior configurable, but what granularity were you think of? Per (guaranteed) pod or per cluster? This might be something a cluster admin wants to set cluster-wide (though I guess they could use an admission controller to do that if it is specified per pod...) BTW you mentioned that the current behavior thows users off. My naive assumption would be that someone sees the word "limit" and assumes that their pod will be "limited" to using that amount of CPU. So I think the current behavior is more intuitive than the alternative. That said I definitely see the argument for making it configurable. |
There will be no CPU Quota limits (throttling) applied to Guaranteed pods. Essentially, their CPU limits will be ignored.
I think having a config per-node (or groups of nodes) may be better.
What users expect is that when they set the limit to 1 CPU, they can continue to use 1 CPU continuously and not more than that. In reality though with CPU Quota applied, in a slot of CPU scheduling time, a pod can use more than 1 CPU then get unscheduled and not get access to CPU for the rest of the slot. This means the spikes are penalized very heavily which is not what users expect. CPU Quota may be useful for scavenging pods like Bu and BE pods. |
So I hit this one rather hard. I was setting CPU request == CPU limits, so that we would be in guaranteed class for critical system pods, so that those pods would be less likely to be evicted. However, I observed that critical pods were being CPU throttled and that this was causing really bad behaviour in our cluster. Looking at the manifests in
Is that correct? On the direct topic of this particular issue, if that's correct, I think it would be confusing to have the CPU limits not applied when limit == request. IMO the bigger problem is that a desire to get into guaranteed class nudged me to set CPU limits, and that a desire not to give up too much resources across the cluster caused me to set those limits too low. But what is the duration of a CPU scheduling time unit? (And should we recommend making it smaller to avoid the feast-then-famine scenario?) |
I think what you wrote is actually correct. All of this should get a lot better once priority is fully implemented, both in the scheduler (alpha in 1.8) and on the kubelet (I assume in alpha in 1.9?). Then all of those steps should be reducible to
|
* Limit each CNI provider to 100m * Remove CPU limits - they cause serious problems (kubernetes/kubernetes#51135), but this also makes the CPU allocation less problematic. * Bump versions and start introducing the `-kops.1` suffix preemptively. * Upgrade flannel to 0.9.0 as it fixes a lot.
* Limit each CNI provider to 100m * Remove CPU limits - they cause serious problems (kubernetes/kubernetes#51135), but this also makes the CPU allocation less problematic. * Bump versions and start introducing the `-kops.1` suffix preemptively. * Upgrade flannel to 0.9.0 as it fixes a lot.
Automatic merge from submit-queue. Fix CNI CPU allocations * Limit each CNI provider to 100m * Remove CPU limits - they cause serious problems (kubernetes/kubernetes#51135), but this also makes the CPU allocation less problematic. * Bump versions and start introducing the `-kops.1` suffix preemptively. * Upgrade flannel to 0.9.0 as it fixes a lot. Builds on #3843
/remove-priority P2 |
Adding context: the static CPU manager policy sets affinity for guaranteed integral-CPU-request containers. Those CPUs are not available for use by any other Kubernetes container. Quota is meaningless for these pinned containers. While these containers have access to the entire logical CPU, they also cannot burst beyond their quota because the amount of CPU time allowed is equal to full utilization of their assigned CPUs. Disabling quota for these containers should have no effect on performance. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
We also run into this issue in production, start investigating latency critical applications moving to kubernetes. When we dropped the CPU limit, container_cpu_cfs_throttled_seconds_total is not appearing anymore and the latency p99 and p999 dropped from 60ms and 100ms to ~5ms. I believe the problem is https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/helpers_linux.go#L59 , which is hard coded and can not be changed. |
In case it’s interesting, we implemented the ability to customise this period in our fork. |
@obeattie Thanks for sharing! Did you open a PR already? |
@szuecs No, I'm afraid not. I discussed it on Slack with some of the k8s maintainers a while ago and I think the conclusion was that it wouldn't fly upstream. I don't recall whether this was due to a specific implementation issue or that ideologically this wouldn't be useful to enough users to be a feature of mainline k8s. We're certainly not opposed to contributing this code, or trying to get this merged in, if others would find this useful. |
@spiffxp @justinsb can you clarify this please? |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
…lag and config option to kubelet to be able to set cpu.cfs_period and defaults to 100ms as before. It requires to enable feature gate CustomCPUCFSQuotaPeriod. Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
@bobrik, came across your gist regarding this issue in the cfs. Commented on the gist, but not sure if it sends out any notifications, so will post it here as well.
|
Adds a monzo.com/cpu-period resource, which allows tuning the period of time over which the kernel tracksw CPU throttling. In upstream Kubernetes versions pre-1.12, this is not tunable and is hardcoded to the kernel default (100ms). We originally introduced this after seeing long GC pauses clustered around 100ms [1], which was eventually traced to CFS throttling. Essentially it's recommended for very latency sensitive & bursty workloads (like HTTP microservices!) it's recommended to set the CFS quota period lower. We mostly set ours at 5ms across the board. See [2] and [3] for further discussion in the Kubernetes repository. This is fixed in upstream 1.12 via a slightly different path [4]; the period is now tunable via a kubelet CLI flag. This doesn't give us as fine-grained control, but we can still set this and optimise for the vast majority of our workloads. [1] golang/go#19378 [2] kubernetes#51135 [3] kubernetes#67577 [4] kubernetes#63437
Adds a monzo.com/cpu-period resource, which allows tuning the period of time over which the kernel tracksw CPU throttling. In upstream Kubernetes versions pre-1.12, this is not tunable and is hardcoded to the kernel default (100ms). We originally introduced this after seeing long GC pauses clustered around 100ms [1], which was eventually traced to CFS throttling. Essentially it's recommended for very latency sensitive & bursty workloads (like HTTP microservices!) it's recommended to set the CFS quota period lower. We mostly set ours at 5ms across the board. See [2] and [3] for further discussion in the Kubernetes repository. This is fixed in upstream 1.12 via a slightly different path [4]; the period is now tunable via a kubelet CLI flag. This doesn't give us as fine-grained control, but we can still set this and optimise for the vast majority of our workloads. [1] golang/go#19378 [2] kubernetes#51135 [3] kubernetes#67577 [4] kubernetes#63437 Squashed commits: commit 61551b0 Merge: a446c68 de2c6cb Author: Miles Bryant <milesbryant@monzo.com> Date: Wed Mar 13 16:16:17 2019 +0000 Merge pull request #2 from monzo/v1.9.11-kubelet-register-cpu-period Register nodes with monzo.com/cpu-period resource commit de2c6cb Author: Miles Bryant <milesbryant@monzo.com> Date: Wed Mar 13 15:14:58 2019 +0000 Register nodes with monzo.com/cpu-period resource We have a custom pod resource which allows tuning the CPU throttling period. Upgrading to 1.9 causes this to break scheduling logic, as the scheduler and pod preemption controller takes this resource into account when deciding where to place pods, and which pods to pre-empt. This patches the kubelet so that it registers its node with a large amount of this resource - 10000 * max number of pods (default 110). We typically run pods with this set to 5000, so this should be plenty. commit a446c68 Author: Miles Bryant <milesbryant@monzo.com> Date: Tue Jan 29 16:43:03 2019 +0000 ResourceConfig.CpuPeriod is now uint64, not int64 Some changes to upstream dependencies between v1.7 and v1.9 mean that the CpuPeriod field of ResourceConfig has changed type; unfortunately this means the Monzo CFS period patch doesn't compile. This won't change behaviour at all - the apiserver already validates that `monzo.com/cpu-period` can never be negative. The only edge case is if someone sets it to higher than the int64 positive bound (this will result in an overflow), but I don't think this is worth mitigating commit 1ead2d6 Author: Oliver Beattie <oliver@obeattie.com> Date: Wed Aug 9 22:57:53 2017 +0100 [PLAT-713] Allow the CFS period to be tuned for containers
I know this issue has been dormant for a while but I did ran into this issue and my research led me here. Having spent some time reading through all the comments I want to try and summarize my understanding and some gaps that I still have. Using an example of a container with a limit of 10% CPU. This results in a cpu.quota of 10 when period is set to 100. If this application is scheduled, uses up all 10 ms of its allotment during that time it will then sit idle for the remaining 90ms of that time slot. For a processor bound application this probably isn't a big deal, but for an I/O bound this would introduce latency. If on the other hand this application uses its allotment of 10ms evenly over the time period (1ms, sleep for 9ms - 1ms sleep 9ms this latency wouldn't be seen. To me it really boils down to expressing things in terms of % of CPU. As we all know an application can't use a % of a CPU, it is either using the CPU or it isn't. (even when you specify 10% CPU at any point in time the application is actually using 1 CPU). In both the examples above the application used 10% of the CPU during the period.
Where I am confused is on this situation. If the application is truly only using 30m then setting to 1500m would mean it always gets CPU time during the time slot. How was that value obtained? I would think that the observed time period that that value was obtained from would have to match what the cpu.cfs_period_us is. After reading through all of this I am leaning towards those that feel setting limits are not going to accomplish what most people think. And the better approach is to adequately set requests as in the end, during contention shares is what will control what gets CPU. |
@cyrus-mc I think we could have been hit a kernel bug or node over provisioning changed scheduling time. |
Is this problem fixed now? I saw that it was repaired by the kernel team. But I didn't find the relevant information. |
I'm not sure CPU limits are in milliseconds in k8s, because in Linux kernel despite having "ms" naming they are in microseconds. I wrote a bunch of questions I have here, I hope someone will help me clarify it. |
Yes cpu resource settings are in millicore |
The effect of CPU throttling is non obvious to users and throws them off when they try out kubernetes. It also complicates CPU capacity planning for pods. Pods that are carefully placed in Guaranteed QoS class are typically of higher priority than other pods, and these pods are expected to get the best performance. On the other hand, Burstable pods may have to be limited in certain environments to avoid tail latencies for Guaranteed pods.
I propose we avoid setting CPU limits on Guaranteed pods by default and continue to impose CPU limits on Burstable pods. We can provide a config option to ease the rollout in existing clusters. Users may need time to handle this transition, and some of them may choose to use the existing policies.
Closely related is static CPU pinning policies in kubelet which will naturally limit Guaranteed pods to specific cores if they request integral number of cores. This feature may be combined with turning off CPU limiting for Guaranteed pods. That said, having an ability to configure each of these feature independently will still be helpful.
@derekwaynecarr we have discussed this in the past, but AFAIK, an issue hasn't been filed thus far. @dchen1107 thoughts?
If there is a general agreement that this is the intended direction longer term, I can post a patch to make CPU limit policies configurable.
@kubernetes/sig-node-feature-requests
The text was updated successfully, but these errors were encountered: