Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CFS quotas can lead to unnecessary throttling #67577

Open
bobrik opened this Issue Aug 20, 2018 · 36 comments

Comments

Projects
None yet
@bobrik
Copy link

commented Aug 20, 2018

/kind bug

This is not a bug in Kubernets per se, it's more of a heads-up.

I've read this great blog post:

From the blog post I learned that k8s is using cfs quotas to enforce CPU limits. Unfortunately, those can lead to unnecessary throttling, especially for well behaved tenants.

See this unresolved bug in Linux kernel I filed a while back:

There's an open and stalled patch that addresses the issue (I've not verified if it works):

cc @ConnorDoyle @balajismaniam

@neolit123

This comment has been minimized.

Copy link
Member

commented Aug 20, 2018

/sig node
/kind bug

@liggitt

This comment has been minimized.

Copy link
Member

commented Aug 20, 2018

is this a duplicate of #51135?

@bobrik

This comment has been minimized.

Copy link
Author

commented Aug 21, 2018

It's similar in spirit, but it seems to miss the fact that there's an actual bug in the kernel rather than just some configuration tradeoff in CFS quota period. I've liked #51135 back here to give people there more context.

@hjacobs

This comment has been minimized.

Copy link

commented Aug 22, 2018

As far as I understand this is another reason to either disable CFS quota (--cpu-cfs-quota=false) or make it configurable (#63437).

I also find this gist (linked from Kernel patch) very interesting to see (to gauge the impact): https://gist.github.com/bobrik/2030ff040fad360327a5fab7a09c4ff1

@vishh

This comment has been minimized.

Copy link
Member

commented Aug 23, 2018

@juliantaylor

This comment has been minimized.

Copy link

commented Aug 30, 2018

Another issue with quotas is that the kubelet counts hyperthreads as "cpus". When a cluster becomes so loaded that two threads are scheduled on the same core and the process has a cpu quota quota one of them will only perform with a small fraction of the available processing power (it will only do something when something on the other thread stalls) but still consume quota as if it had a physical core for itself. So it consumes double the quota that it should without doing significantly more work.
This has the effect that on a fully loaded node node with hyperthreading enabled the performance will be half of what it would be with hyperthreading or quotas disabled.

Imo the kubelet should not consider hyperthreads as real cpus to avoid this situation.

@vishh

This comment has been minimized.

Copy link
Member

commented Aug 30, 2018

@juliantaylor As I mentioned in #51135 turning off CPU quota might be the best approach for most k8s clusters running trusted workloads.

@prune998

This comment has been minimized.

Copy link

commented Nov 26, 2018

Is this considered as a bug ?

If some pods are throttled while not really exhausting their CPU limit, it sound like a bug to me.

On my cluster, most of the over-quota pods are related to metrics (heapster, metrics-collector, node-exporter...) or Operators, which obviously have the kind of workload that is in problem here : do nothing most of the time and wake up to reconcile every once and a while.

The strange thing here is that I tried to raise the limit, going from 40m to 100m or 200m, and the processes were still throttled.
I can't see any other metrics pointing out a workload that could trigger this throttling.

I've removed the limits on these pods for now... it's getting better, but, well, that really sounds like a bug and we should come with a better solution than disabling the Limits

@hjacobs

This comment has been minimized.

Copy link

commented Nov 27, 2018

@prune998 see @vishh's comment and this gist: the Kernel over-aggressively throttles, even if the math tells you it should not. We (Zalando) decided to disable CFS quota (CPU throttling) in our clusters: https://www.slideshare.net/try_except_/optimizing-kubernetes-resource-requestslimits-for-costefficiency-and-latency-highload

@prune998

This comment has been minimized.

Copy link

commented Nov 27, 2018

Thanks @hjacobs.
I'm on Google GKE and I see no easy way to disable it, but I keep searching....

@timoreimann

This comment has been minimized.

Copy link
Contributor

commented Nov 27, 2018

@prune998 AFAIK, Google hasn't exposed the necessary knobs yet. We filed a feature request right after the possibility to disable CFS landed in upstream, haven't heard any news since then.

@vishh

This comment has been minimized.

Copy link
Member

commented Nov 27, 2018

I'm on Google GKE and I see no easy way to disable it, but I keep searching....

Can you remove CPU limits from your containers for now?

@mariusgrigoriu

This comment has been minimized.

Copy link

commented Feb 11, 2019

According to the CPU manager docs, CFS quota is not used to bound the CPU usage of these containers as their usage is bound by the scheduling domain itself. But we're experiencing CFS throttling.

This makes the static CPU manager mostly pointless as setting a CPU limit to achieve the Guaranteed QoS class negates any benefit due to throttling.

Is it a bug that CFS quotas are set at all for pods on static CPU?

@hjacobs

This comment has been minimized.

Copy link

commented Feb 12, 2019

For additional context (learned this yesterday): @hrzbrg (MyTaxi) contributed a flag to Kops to disable CPU throttling: kubernetes/kops#5826

@alok87

This comment has been minimized.

Copy link
Contributor

commented Feb 14, 2019

Please share a summary of the problem here. It is not very clear what the problem is and in what scenarios users are impacted and what exactly is required to fix it?

Our understanding at present is when we cross limits we get penaltied and throttled. So say we have a cpu quota of 3 cores and in the first 5ms we consumed 3 cores then in the 100ms slice we will get throttled for 95ms, and in this 95ms our containers cannot do anything. And we have seen to get throttled even when the cpu spikes are not visible in the cpu usage metrics. We assume that it is because the time window of measuring cpu usage is in seconds and the throttling is happening at micro seconds level, so it averages out and is not visible. But the bug mentioned here has left us confused now.

Few questions:

  • When the node is at 100% cpu? Is this a special case in which all containers get throttled regardless of their usage?

  • When this happens does all containers get 100% cpu throttled?

  • What triggers this bug to get triggered in the node?

  • What is the difference between not using limits and disabling cpu.cfs_quota?

  • Is not disabling limits a risky solution when there are many burstable pods and one pod can cause the instability in the node and impact other pods which are running over their requests?

  • Separately, according to the kernel doc process may get throttled when the parent quota is fully consumed. What is parent in the container context here(is it related to this bug)? https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt

	a. it fully consumes its own quota within a period
	b. a parent's quota is fully consumed within its period
  • What is needed to fix them? Upgrade kernel version?

We have faced a fairly big outage and it looks closely related(if not the root cause) to all our pods getting stuck in throttling restart loop and not able to scale up. We are digging into the details to find the real issue. I will be opening a separate issue explaining in details about our outage.

Any help here is greatly appreciated.

cc @justinsb

@mariusgrigoriu

This comment has been minimized.

Copy link

commented Feb 15, 2019

One of our users set a CPU limit and got throttled into timing out their liveness probe causing an outage of their service.

We're seeing throttling even when pinning containers to a CPU. For example a CPU Limit of 1 and pinning that container to run only on the one CPU. It should be impossible to exceed the quota given any period if your quota were exactly the number of CPU you have, yet we see throttling in every case.

I thought I saw it posted somewhere that kernel 4.18 solves the problem. I haven't tested it yet, so it would be nice if someone could confirm.

@clkao

This comment has been minimized.

Copy link
Contributor

commented Feb 23, 2019

torvalds/linux@512ac99 in 4.18 seems to be the relevant patch for this issue.

milesbxf added a commit to monzo/kubernetes that referenced this issue Mar 19, 2019

[PATCH] Allow CFS period to be tuned per-container
Adds a monzo.com/cpu-period resource, which allows tuning the period of
time over which the kernel tracksw CPU throttling. In upstream Kubernetes
versions pre-1.12, this is not tunable and is hardcoded to the kernel
default (100ms).

We originally introduced this after seeing long GC pauses clustered
around 100ms [1], which was eventually traced to CFS throttling.
Essentially it's recommended for very latency sensitive & bursty
workloads (like HTTP microservices!) it's recommended to set the CFS
quota period lower. We mostly set ours at 5ms across the board. See [2]
and [3] for further discussion in the Kubernetes repository.

This is fixed in upstream 1.12 via a slightly different path [4]; the
period is now tunable via a kubelet CLI flag. This doesn't give us as
fine-grained control, but we can still set this and optimise for the
vast majority of our workloads.

[1] golang/go#19378
[2] kubernetes#51135
[3] kubernetes#67577
[4] kubernetes#63437

milesbxf added a commit to monzo/kubernetes that referenced this issue Mar 19, 2019

Allow CFS period to be tuned per-container
Adds a monzo.com/cpu-period resource, which allows tuning the period of
time over which the kernel tracksw CPU throttling. In upstream Kubernetes
versions pre-1.12, this is not tunable and is hardcoded to the kernel
default (100ms).

We originally introduced this after seeing long GC pauses clustered
around 100ms [1], which was eventually traced to CFS throttling.
Essentially it's recommended for very latency sensitive & bursty
workloads (like HTTP microservices!) it's recommended to set the CFS
quota period lower. We mostly set ours at 5ms across the board. See [2]
and [3] for further discussion in the Kubernetes repository.

This is fixed in upstream 1.12 via a slightly different path [4]; the
period is now tunable via a kubelet CLI flag. This doesn't give us as
fine-grained control, but we can still set this and optimise for the
vast majority of our workloads.

[1] golang/go#19378
[2] kubernetes#51135
[3] kubernetes#67577
[4] kubernetes#63437

Squashed commits:

commit 61551b0
Merge: a446c68 de2c6cb
Author: Miles Bryant <milesbryant@monzo.com>
Date:   Wed Mar 13 16:16:17 2019 +0000

    Merge pull request #2 from monzo/v1.9.11-kubelet-register-cpu-period

    Register nodes with monzo.com/cpu-period resource

commit de2c6cb
Author: Miles Bryant <milesbryant@monzo.com>
Date:   Wed Mar 13 15:14:58 2019 +0000

    Register nodes with monzo.com/cpu-period resource

    We have a custom pod resource which allows tuning the CPU throttling
    period. Upgrading to 1.9 causes this to break scheduling logic, as the
    scheduler and pod preemption controller takes this resource into account
    when deciding where to place pods, and which pods to pre-empt.

    This patches the kubelet so that it registers its node with a large
    amount of this resource - 10000 * max number of pods (default 110). We
    typically run pods with this set to 5000, so this should be plenty.

commit a446c68
Author: Miles Bryant <milesbryant@monzo.com>
Date:   Tue Jan 29 16:43:03 2019 +0000

    ResourceConfig.CpuPeriod is now uint64, not int64

    Some changes to upstream dependencies between v1.7 and v1.9 mean that
    the CpuPeriod field of ResourceConfig has changed type; unfortunately
    this means the Monzo CFS period patch doesn't compile.

    This won't change behaviour at all - the apiserver already validates
    that `monzo.com/cpu-period` can never be negative. The only edge case is
    if someone sets it to higher than the int64 positive bound (this will
    result in an overflow), but I don't think this is worth mitigating

commit 1ead2d6
Author: Oliver Beattie <oliver@obeattie.com>
Date:   Wed Aug 9 22:57:53 2017 +0100

    [PLAT-713] Allow the CFS period to be tuned for containers
@dannyk81

This comment has been minimized.

Copy link

commented Mar 25, 2019

@mariusgrigoriu I seem to be stuck in the same conundrum you described here #67577 (comment).

We observe CPU throttling on Pods in Guaranteed QoS class with CPUManager static policy (which doesn't seem to make any sense).

Removing the limits for these Pods will put them in Burstable QoS class, which is not what we want, so the only option remaining is disable CFS cpu quotas system-wide, which is also not something we can safely do, since allowing all Pods access to unbound cpu capacity can lead to dangerous CPU saturation issues.

@vishh given the above circumstances, what would be the best course of action? it seems like upgrading to kernel >4.18 (which has the cfs cpu accounting fix) and (perhaps) reducing cfs quota period?

On a general note, suggesting we just remove limits from containers that are being throttled should have clear warnings:

  1. If these were Pods in Guaranteed QoS class, with integer number of cores and with CPUMnanager static policy in place - these Pods will no longer get dedicated CPU cores since they will be put in Burstable QoS class (having no requests == limits)
  2. These pods will be unbound in terms of how much CPU they can consume and potentially can cause quite a bit of damage under certain circumstances.

Your feedback/guidance would be highly appreciated.

@mariusgrigoriu

This comment has been minimized.

Copy link

commented Mar 25, 2019

Upgrading the kernel definitely helps, but the behavior of applying a CFS quota still seems out of alignment with what the docs suggest.

@chiluk

This comment has been minimized.

Copy link

commented Mar 25, 2019

I've been researching various aspects of this issue for a while now. My research is summarized in my post to LKML.
https://lkml.org/lkml/2019/3/18/706
That being said, I have not been able to reproduce the issue as described here in the pre-512ac99 kernels. I have seen a performance regression in post-512ac99 kernels however. So that fix is not a panacea.

@dannyk81

This comment has been minimized.

Copy link

commented Mar 25, 2019

Thanks @mariusgrigoriu, we are going for the kernel upgrade and hope that will help somewhat, also check out #70585 - it seems that quotas are indeed set for guaranteed pods with cpuset (i.e. pinned cpus), so this seems like a bug to me.

@dannyk81

This comment has been minimized.

Copy link

commented Mar 25, 2019

@chiluk could you elaborate a bit? do you mean that the patch that was included in 4.18 (mentioned above in #67577 (comment)) doesn't actually solve the issue?

@chiluk

This comment has been minimized.

Copy link

commented Mar 25, 2019

The 512ac99 kernel patch fixes an issue for some people, but caused an issue for our configurations. The patch fixed the way time slices are distributed amongst cfs_rq, in that they now correctly expire. Previously they would not expire.

Java workloads in particular on high core-count machines now see high amounts of throttling with low cpu usage because of the blocking worker threads. Those threads are assigned time slice which they only use a small portion of which is later expired. In the synthetic test I wrote *(linked to on that thread), we see a performance degradation of roughly 30x. In real world performance we saw response time degradation of hundreds of milliseconds between the two kernels due to the increased throttling.

@willthames

This comment has been minimized.

Copy link

commented Mar 27, 2019

Using a 4.19.30 kernel I see pods that I'd hoped to see less throttling are still throttled and some pods that weren't previously being throttled are now being throttled quite severely (kube2iam is reporting more seconds throttled than the instance has been up, somehow)

@teralype

This comment has been minimized.

Copy link

commented Mar 27, 2019

On CoreOS 4.19.25-coreos I see Prometheus triggering CPUThrottlingHigh alert almost on every single pod in the system.

@dannyk81

This comment has been minimized.

Copy link

commented Mar 27, 2019

@williamsandrew @teralype this seems to reflect @chiluk's findings.

After various internal discussions we in fact decided to disable cfs quotas entirely (kubelet's flag --cpu-cfs-quota=false), this seems to solve all the issues we've been having for both Burstable and Guaranteed (cpu pinned or standard) Pods.

There's an excellent deck about this (and few other topics) here: https://www.slideshare.net/try_except_/ensuring-kubernetes-cost-efficiency-across-many-clusters-devops-gathering-2019

Highly recommended read 👍

@dims

This comment has been minimized.

Copy link
Member

commented Mar 27, 2019

long-term-issue (note to self)

@hjacobs

This comment has been minimized.

Copy link

commented Mar 27, 2019

@dannyk81 just for completeness: the linked talk is also available as recorded video: https://www.youtube.com/watch?v=4QyecOoPsGU

@agolomoodysaada

This comment has been minimized.

Copy link

commented Apr 2, 2019

@hjacobs , loved the talk! Thanks a lot...
Any idea how to apply this fix on AKS or GKE?
Thanks

@timoreimann

This comment has been minimized.

Copy link
Contributor

commented Apr 2, 2019

@agolomoodysaada we filed a feature request with GKE a while back. Not sure what the status is though, I do not work intensively with GKE anymore.

@agolomoodysaada

This comment has been minimized.

Copy link

commented Apr 3, 2019

I reached out to Azure support and they said it won't be available until around August 2019.

@agolomoodysaada

This comment has been minimized.

Copy link

commented Apr 5, 2019

Screen Shot 2019-04-04 at 3 54 00 PM

Thought I would share a graph of an application consistently throttled throughout its lifetime.
@chiluk

This comment has been minimized.

Copy link

commented Apr 5, 2019

What kernel was this on?

@agolomoodysaada

This comment has been minimized.

Copy link

commented Apr 5, 2019

@chiluk "4.15.0-1037-azure"

@chiluk

This comment has been minimized.

Copy link

commented Apr 5, 2019

@chiluk

This comment has been minimized.

Copy link

commented Apr 11, 2019

I have now posted patches to LKML about this.
https://lkml.org/lkml/2019/4/10/1068

Additional testing would be greatly appreciated.

KIVagant added a commit to KIVagant/charts that referenced this issue Apr 18, 2019

[stable/datadog] Remove default resources requests and limits
Resources should not be limited by default because nobody knows where
the chart will be installed to. Especially worth not to set CPU limits
by default as per kubernetes/kubernetes#67577

KIVagant added a commit to KIVagant/charts that referenced this issue Apr 18, 2019

[stable/datadog] Remove default resources requests and limits
Resources should not be limited by default because nobody knows where
the chart will be installed to. Especially worth not to set CPU limits
by default as per kubernetes/kubernetes#67577

KIVagant added a commit to KIVagant/charts that referenced this issue Apr 22, 2019

[stable/datadog] Remove default resources requests and limits
Resources should not be limited by default because nobody knows where
the chart will be installed to. Especially worth not to set CPU limits
by default as per kubernetes/kubernetes#67577

KIVagant added a commit to KIVagant/charts that referenced this issue Apr 22, 2019

[stable/datadog] Remove default resources requests and limits
Resources should not be limited by default because nobody knows where
the chart will be installed to. Especially worth not to set CPU limits
by default as per kubernetes/kubernetes#67577

Signed-off-by: Eugene Glotov <kivagant@gmail.com>

KIVagant added a commit to KIVagant/charts that referenced this issue Apr 22, 2019

[incubator/fluentd-cloudwatch] Remove default resources requests and …
…limits

Resources should not be limited by default because nobody knows where
the chart will be installed to. Especially worth not to set CPU limits
by default as per kubernetes/kubernetes#67577

Signed-off-by: Eugene Glotov <kivagant@gmail.com>

KIVagant added a commit to KIVagant/charts that referenced this issue Apr 22, 2019

[incubator/fluentd-cloudwatch] Remove default resources requests and …
…limits

Resources should not be limited by default because nobody knows where
the chart will be installed to. Especially worth not to set CPU limits
by default as per kubernetes/kubernetes#67577

Signed-off-by: Eugene Glotov <kivagant@gmail.com>

KIVagant added a commit to KIVagant/charts that referenced this issue Apr 24, 2019

[stable/datadog] Remove default resources requests and limits
Resources should not be limited by default because nobody knows where
the chart will be installed to. Especially worth not to set CPU limits
by default as per kubernetes/kubernetes#67577

Signed-off-by: Eugene Glotov <kivagant@gmail.com>

KIVagant added a commit to KIVagant/charts that referenced this issue Apr 24, 2019

[stable/datadog] Remove default resources requests and limits
Resources should not be limited by default because nobody knows where
the chart will be installed to. Especially worth not to set CPU limits
by default as per kubernetes/kubernetes#67577

Signed-off-by: Eugene Glotov <kivagant@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.