Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable cpu quota(use only cpuset) for pod Guaranteed #70585

Open
Wenfeng-GAO opened this issue Nov 2, 2018 · 77 comments · May be fixed by #107589
Open

Disable cpu quota(use only cpuset) for pod Guaranteed #70585

Wenfeng-GAO opened this issue Nov 2, 2018 · 77 comments · May be fixed by #107589
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@Wenfeng-GAO
Copy link
Contributor

Wenfeng-GAO commented Nov 2, 2018

What would you like to be added:

Disable cpu quota when we use cpuset for guaranteed pods.

Why is this needed:

For now, kubelet default adds cpu quota and share for all cases when cpuCFSQuota is open, even if cpuset will be set in guaranteed pod case.

However, this kind of use(cpuset with quota) may sometimes bring us cpu throttle time which we don't expect.

My question is why not make a if statement when we add cpuQuota config, if it's guaranteed case, just set cpuQuota=-1?

	if m.cpuCFSQuota {
		// if cpuLimit.Amount is nil, then the appropriate default value is returned
		// to allow full usage of cpu resource.
		cpuQuota, cpuPeriod := milliCPUToQuota(cpuLimit.MilliValue())
		lc.Resources.CpuQuota = cpuQuota
		lc.Resources.CpuPeriod = cpuPeriod
	}

/kind feature

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 2, 2018
@Wenfeng-GAO
Copy link
Contributor Author

Wenfeng-GAO commented Nov 2, 2018

/sig kubelet

@Wenfeng-GAO
Copy link
Contributor Author

Wenfeng-GAO commented Nov 2, 2018

/sig architecture

@k8s-ci-robot k8s-ci-robot added sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 2, 2018
@clkao
Copy link
Contributor

clkao commented Nov 2, 2018

/sig node

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Nov 2, 2018
@zaa
Copy link

zaa commented Nov 28, 2018

Hello,

We are using cpu manager in k8s v1.10.9 and experiencing the same issue.

https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/ states:

As Guaranteed pods whose containers fit the requirements for being statically assigned are scheduled to the node, CPUs are removed from the shared pool and placed in the cpuset for the container. CFS quota is not used to bound the CPU usage of these containers as their usage is bound by the scheduling domain itself. In others words, the number of CPUs in the container cpuset is equal to the integer CPU limit specified in the pod spec. This static assignment increases CPU affinity and decreases context switches due to throttling for the CPU-bound workload.

However CFS quota is not used to bound the CPU usage of these containers as their usage is bound by the scheduling domain itself. portion of the document does not work in our case.

So, the setup looks as follows:

  1. We have cpu manager enabled and running with static policy
  2. I have a pod with two containers
  3. The first container has limits/requests set to 2 whole cpus
  4. The second container has limits/requests set to 0.25 cpu

k8s schedules the pod to a node. When I exec into the first container, I see that it has cpu set defined correctly:

/sys/fs/cgroup/cpu,cpuacct # cat /proc/1/status | grep Cpu
Cpus_allowed:    00000000,00000000,00000000,00000202
Cpus_allowed_list:    1,9

However, for some reason (I presume that this is a bug), the container has CFS quota defined too, which should not be the case. The pod is restricted to the two cpus already and should be able to use 100% of them, without quota:

/sys/fs/cgroup/cpu,cpuacct # cat cpu.cfs_*
100000
200000

When I check cpu stats, I see that the container is severely throttled:

/sys/fs/cgroup/cpu,cpuacct # cat cpu.stat
nr_periods 2686183
nr_throttled 363111
throttled_time 27984636050628

Moreover, when I check cgroup set of the whole pod, I see that k8s put CFS quota to the whole pod too:

oot@ip-10-33-99-26:/sys/fs/cgroup/cpu,cpuacct/kubepods/pod62d886b5-ee75-11e8-8f1c-0a4ce21644ea# cat cpu.cfs_*
100000
225000
root@ip-10-33-99-26:/sys/fs/cgroup/cpu,cpuacct/kubepods/pod62d886b5-ee75-11e8-8f1c-0a4ce21644ea# cat fafd0f38288710446b53402716d11449c4b046a09c26266d2725eae2ac7f0352/cpu.cfs_*
100000
200000

My understanding is that if the pod is in guaranteed QoS class and requests integral number of cpus, it should not be subject to CFS quota (as stated in the docs).
So in this case:

  1. the pod should not have CFS quota defined.
  2. The first container should have cpuset defined, but no cfs quota (as the docs imply)
  3. The second container should have CFS quota defined as 25000 µs of 100000 µs period

@zaa
Copy link

zaa commented Nov 28, 2018

cc: @ipuustin @ConnorDoyle might be of interest to you

@mariusgrigoriu
Copy link

mariusgrigoriu commented Feb 14, 2019

Same issue with CPU manager here. We're not realizing any performance benefits due to this issue. In fact we're getting far worse latency due to the CFS throttling bug. Was it intentional to add CFS quota to pods that are CPU pinned? The documentation makes it seem like having a CFS quota on those pods is a bug.

@dannyk81
Copy link

dannyk81 commented Mar 25, 2019

@Wenfeng-GAO it seems that this should be labelled as bug and not a feature 😄

as mentioned by @mariusgrigoriu and @zaa, CPUManager's documentation suggests that Pods with pinned CPUs should not be bound to CPU quotas, however it is quite clear that they are, which seems wrong.

We keep getting hit by cpu throttling on Guaranteed QoS Pods with pinned CPUs, which is totally counter-productive 😞

@praseodym
Copy link
Contributor

praseodym commented Mar 25, 2019

/kind bug
/remove-kind feature

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. and removed kind/feature Categorizes issue or PR as related to a new feature. labels Mar 25, 2019
@dannyk81
Copy link

dannyk81 commented Mar 25, 2019

FYI @ConnorDoyle

I came across your comment here #51135 (comment) (from a while back) and it seems that this contradicts what you've mentioned in that post.

Figured you'd may be interested to weigh in.

@praseodym
Copy link
Contributor

praseodym commented Mar 25, 2019

In theory setting CPU CFS quota for CPU pinned pods should not affect performance, but in practice it does cause performance degradation due to the (buggy) way that CFS quota are implemented in the Linux kernel.

I think the best fix would be to unset any CPU CFS quota from the pod in the CPU manager when reserved CPUs (CPU sets) are assigned.

@dannyk81
Copy link

dannyk81 commented Mar 25, 2019

Thanks @praseodym!

Totally agree, unseting the quota for Pods with cpusets seems like the best solution.

Regarding the Kernel, it seems that a recent patch (included in 4.18) should solve the issue --> #67577 (comment)

OTOH, this comment --> #67577 (comment) is a bit confusing ¯_(ツ)_/¯

We are going to upgrade to latest CoreOS (2023.5.0) that comes with Kernel 4.19.25 and see how things look.

@chiluk
Copy link

chiluk commented Mar 25, 2019

512ac99 may resolve the issue in this thread, and assuming things are all correctly pinned and cgrouped the issue I discuss in the other bug due to 512ac99 should not be in play.

@praseodym
Copy link
Contributor

praseodym commented Mar 25, 2019

That's good to know!

This issue will still need some work to align the implemented behaviour with the documentation ("CFS quota is not used to bound the CPU usage of these containers as their usage is bound by the scheduling domain itself.") or vice versa.

@dannyk81
Copy link

dannyk81 commented Mar 25, 2019

Thanks @chiluk, just read your comment on the other thread (#67577 (comment)) and to be honest this seems a bit worrying as we might actually hit this if we go for >4.18 kernels.

We are using 48-core bare-metals for our worker servers and also deploy java (and php) workloads in various sizes.

So, although this might help in the case of the Pods with pinned CPUs it would actually make things worse for the rest of the workload?

@blakebarnett
Copy link

blakebarnett commented Mar 25, 2019

So, although this might help in the case of the Pods with pinned CPUs it would actually make things worse for the rest of the workload?

If you're pinning a container to a CPU with cpu-manager it should have full use of that CPU. The rest of the workload should be excluded. The rest of the system should behave normally, but CFS quota should be disabled for the cgroups that cpuset have been used on. That's my understanding anyway.

@dannyk81
Copy link

dannyk81 commented Mar 25, 2019

@blakebarnett Yes, that's accurate - however it's not the case now, and hopefully will be fixed by #75682

What concerns me is that using the latest kernel (>4.18) may degrade performance for the pods that are not pinned to specific cpusets (the majority) as indicated by @chiluk's tests...

@liggitt
Copy link
Member

liggitt commented May 4, 2019

/remove-sig architecture

@k8s-ci-robot k8s-ci-robot removed the sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. label May 4, 2019
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented May 4, 2019

@liggitt: Those labels are not set on the issue: sig/

In response to this:

/remove-sig architecture

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fejta-bot
Copy link

fejta-bot commented May 23, 2021

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 23, 2021
@2rs2ts
Copy link
Contributor

2rs2ts commented Jun 1, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 1, 2021
@eddytruyen
Copy link

eddytruyen commented Jun 4, 2021

@odinuge I assume there is a bug in the kernel or either the cpu usage of Kubernetes code at the pod level is too high

Interesting! Kubernetes should not run processes in those cgroups by default, so it shouldn't be the problem. Health Checks might however use some quota though. Exec probes run inside the container, and http/rpc probes "consume" resources in the way that the container has to answer. I don't think other stuff should interact with that cgroup...

@odinuge We have looked into the possibility that throttling increases due to the readiness probes. Yes it did!
The execution of the readiness probes are separate processes that show up in the cgroup.procs file of the cassandra container's cgroup, not the higher-level pod's cgroup.

Below you find a comparison of the number and percentage of throttled periods for 3 setups. This shows that the readiness probes are the ones being throttled.

1 pod with docker engine as cr, cpu-quota of pod turned on, cpu-quota of cassandra container turned on
# periods 13737
# throttled 6664
% throttled 49%

1 pod in docker engine as cr, cpu-quota of pod and cassandra container turned on, probes turned off
# periods 14289
# throttled 697
% throttled 5%


1 container in docker engine directly, no kubernetes, cpu-quota of cassandra container turned on
# periods 12454
# throttled 712
% throttled 6% 

The following image shows the impact on latency of these probes.

image

It seems there is a high system load/cpu issue with Kubernetes because it constantly spawns runc for running the readiness/liveliness checks: #82440. We also see that with the readiness probe of cassandra. The following picture shows the decrease in cpu utilization when removing the readiness probe:
image

We also looked what is the impact of turning off the cpu_quota of the cgroup of the Pod as follows:

cd /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/
kubepods-burstable.slice/kubepods-burstable-pod<pod-id>.slice/
echo -1 > cpu.cfs_quota_us

We did see a small improvement in latency and throttling, but not an as large improvement as turning off probes.

1 pod in docker, cpu-quota of pod turned off, cpu-quota of cassandra container turned on
# periods 14348
# throttled 6104
% throttled 43%

image

Here's the configuration and code of the readiness probe

readinessProbe:
exec:
command:
- /bin/bash
- -c
- /ready-probe.sh
failureThreshold: 3
initialDelaySeconds: 15
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
ready_probe.sh:
if [[ $(nodetool status | grep $POD_IP) == *"UN"* ]]; then
  if [[ $DEBUG ]]; then
    echo "UN";
  fi
  exit 0;
else
  if [[ $DEBUG ]]; then
    echo "Not Up";
  fi
  exit 1;
fi

@n4j n4j added this to Triaged in SIG Node Bugs Jul 9, 2021
@odinuge
Copy link
Member

odinuge commented Jul 13, 2021

@odinuge We have looked into the possibility that throttling increases due to the readiness probes. Yes it did!
The execution of the readiness probes are separate processes that show up in the cgroup.procs file of the cassandra container's cgroup, not the higher-level pod's cgroup.

Yes, by design, exec-probes run inside the container, and the resources they use are accounted. That is expected behavior. And yes, they are executed in the container control group, not the pod level one. And yes, exec probes are just plain linux processes being executed inside the container "sandbox", there is not anything special about them.

Below you find a comparison of the number and percentage of throttled periods for 3 setups. This shows that the readiness probes are the ones being throttled.

Those numbers look reasonable, no doubt there.

It seems there is a high system load/cpu issue with Kubernetes because it constantly spawns runc for running the readiness/liveliness checks: #82440. We also see that with the readiness probe of cassandra. The following picture shows the decrease in cpu utilization when removing the readiness probe:

Not 100% updated on this, but the overhead of constantly using runc for exec probes is definitely not zero. In your example, I am not sure how much CPU time is used by runc compared to the scripts tho. You can try to use crun instead of runc and see if there is any difference. Due to the nature of exec-probes, they act very spiky on the CPU. Eg. they use 300ms to run, and that translates to one ~one CPU on three CFS bandwidth periods. If your application is already close to the limit, that will certainly cause throttling...

We also looked what is the impact of turning off the cpu_quota of the cgroup of the Pod as follows:

That also sounds expected. If you are being throttled in the container control group, you will not see much difference by "disabling" cfs bandwidth on the pod control group. But yeah, there might be a small amount of change.

But, overall, the metric of throttled periods divided by periods is not the best metric here. Looking at the time spent throttled is at least as interesting. It is possible to get a ratio of more than 90% without seeing any increase in latency, all depends on the workload and configuration.


Also, looking at more than only the 95% percentile would be interesting. Something like https://hdrhistogram.github.io/HdrHistogram/ are cool for looking at the latency effects.

edit: new paragraph

@ehashman ehashman moved this from Triaged to High Priority in SIG Node Bugs Aug 5, 2021
@k8s-triage-robot
Copy link

k8s-triage-robot commented Oct 11, 2021

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 11, 2021
@2rs2ts
Copy link
Contributor

2rs2ts commented Oct 14, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 14, 2021
@ehashman
Copy link
Member

ehashman commented Dec 1, 2021

/assign @cynepco3hahue

@joshsten
Copy link

joshsten commented Dec 16, 2021

We are also seeing a small, but unexpected amount of throttling with static cpu policy set and fully guaranteed QoS (with integer core count) on containers in AKS.

@reg0bs
Copy link

reg0bs commented Jan 25, 2022

We are also seeing a small, but unexpected amount of throttling with static cpu policy set and fully guaranteed QoS (with integer core count) on containers in AKS.

We have exactly the same with an on-prem Rancher Cluster.

@k8s-triage-robot
Copy link

k8s-triage-robot commented Apr 25, 2022

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 25, 2022
@rbjorklin
Copy link

rbjorklin commented Apr 25, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 25, 2022
@k8s-triage-robot
Copy link

k8s-triage-robot commented Jul 24, 2022

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 24, 2022
@twz123
Copy link

twz123 commented Jul 25, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 25, 2022
@k8s-triage-robot
Copy link

k8s-triage-robot commented Oct 23, 2022

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 23, 2022
@rbjorklin
Copy link

rbjorklin commented Oct 23, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 23, 2022
@zvier
Copy link
Contributor

zvier commented Dec 2, 2022

We also cause the same problem. The cpu.shares can lead some performance lost for guaranteed pod(use only cpuset). When move the pid from container's cpu cgroup to /sys/fs/cgroup/cpu/cgroup.procs, the process performance works as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
SIG Node Bugs
High Priority