New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubelet: do not set CPU quota for guaranteed pods #117030
base: master
Are you sure you want to change the base?
Conversation
Please note that we're already in Test Freeze for the Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Fri Mar 31 04:32:14 UTC 2023. |
Welcome @MarSik! |
Hi @MarSik. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
dec5391
to
51556e7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok-to-test
/triage accepted
/priority important-longterm
pkg/kubelet/cm/helpers_linux.go
Outdated
@@ -167,6 +167,11 @@ func ResourceConfigForPod(pod *v1.Pod, enforceCPULimits bool, cpuPeriod uint64, | |||
// determine the qos class | |||
qosClass := v1qos.GetPodQOS(pod) | |||
|
|||
// quota is not capped when cfs quota is disabled or when the pod has guaranteed QoS class | |||
if !enforceCPULimits || qosClass == v1.PodQOSGuaranteed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about guaranteed QoS PODs which don't have exclusive cpus assigned? (e.g. not requesting integer CPUs?)
In addition, I think we should merge this check with the one on line 163 above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exclusive cpus are assigned to specific containers. This code deals with the parent pod cgroup.
My understanding is that the pod limits are not that important in the greater scheme of things, because it is the containers themselves that have limits too. The pod is just a sum of containers and the sum is needed due to cgroup inheritance (children cannot consume more than the parent).
Artyom used a simplified aproach here and disabled the quota for all Guaranteed Pod cgroups, while leaving it properly configured on the container level. It leaves a bit of a hole, however it makes the code much simpler. The proper check would have to go deep into cpu manager and find out whether exclusive cpus are or will be assigned to any child container of this pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this was already discussed in the original PR here: #107589 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
There is a lot more context around all of this in the original PR #107589, so I think people should read through it. I'll spend some time reviewing this again, but I think everything discussed in that PR still holds true.
the tl;dr is that when there is a cpu limit on all containers and the pod cgroup, sometimes the pod cgroup is throttled, and it hurts "important" containers, while they are still below their cpu quota, and is not throttled. This is due to how the accounting in the kernel is done, and the fact that the refill does not happen at the same time, along with a few other nuances around it (happy to elaborate when I have more time).
I personally want this as the default (removing the pod level cfs limit for qos guarnteed), with an ability to opt-out, but I also want it for pods where all containers have a limit, not only for the qos guarnteed one; effectively never set the pod level cpu cfs limit by default - unless opted-in. However, we need to also think about users that want this pod cgroup limit, and especially around the pod overhead support, where each pod level cgroup get a user defined "bump" in its value. I assume things like kata containers and other non "cgroup-only" (or what to call things using runc/crun) container runtimes.
Soo, I think personally a KEP would be a good format to ensure we deal with all the edge cases. But yeah, I really relly want this merged into mainline k8s, so I'm very happy to help out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
I agree with pretty much all @odinuge mentioned (including the fact I need to re-read context from #107589 ) but I kinda disagree about the KEP part. While I fully agree proper detailed discussion and record of it are needed, I don't think the KEP process is the best suited (I'm thinking about graduation stages, PRR...). But We can evaluate options like adopting part of the process: create a more detailed doc to evaluate the cases, discuss them (sig-node slack chan? here in the comments?), link it here and/or anywhere needed and distill the outcome as code comments to wrap up the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sounds good @ffromani, I think we are very much in line then! Main reason for a KEP is if we want introduce some kind of new flag that we need to deal with things like graduation, feature flags, test plans etc. If we end up not doing that, then I agree we probably don't need a KEP! smile
Thanks
If it turns out we need user-exposed flags or the changes we need to make a robust solution turn out too invasive, then yes, we will need to seriously evaluate if we need a full fledged KEP. In that case the prep work we are talking about will help, so it seems a good way forward anyway :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Folks, I added a code comment explaining the reasoning behind the Guaranteed Pod quota disablement. I was trying to be clear enough without making it too complicated so feel free to suggest changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another approach would be disable the quota ONLY for (all the containers belonging to) Guaranteed QoS Pods with exclusive cpus only. This could require more extensive changes but OTOH should minimize the chances of regression and ill effects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we handle this in the cpumanager reconcile loop?
- we have a reconcile loop anyway, and at that stage we will now exactly which pods/containers need to be treated, the chance of side effect is minimized
- the reconcile loop is manipulating the cgroups (through CRI) anyway
we need to check we won't race with other kubelet components though. My knowledge of innards of kubelet outside the resource managers is limited, but here we seem to be in the pod creation flow, so it should be OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PoC (usual disclaimer: untested, partial, etc. etc.)
diff --git a/pkg/kubelet/cm/cpumanager/cpu_manager.go b/pkg/kubelet/cm/cpumanager/cpu_manager.go
index 443eecd2d36..316a1534edf 100644
--- a/pkg/kubelet/cm/cpumanager/cpu_manager.go
+++ b/pkg/kubelet/cm/cpumanager/cpu_manager.go
@@ -460,7 +460,14 @@ func (m *manager) reconcileState() (success []reconciledContainer, failure []rec
m.containerMap.Add(string(pod.UID), container.Name, containerID)
m.Unlock()
- cset := m.state.GetCPUSetOrDefault(string(pod.UID), container.Name)
+ cpuQuota := int64(0)
+ cset, ok := m.state.GetCPUSet(string(pod.UID), container.Name)
+ if ok {
+ cpuQuota = -1 // unlimited - we are assigning full CPUs anyway
+ } else {
+ cset = m.state.GetDefaultCPUSet()
+ }
+
if cset.IsEmpty() {
// NOTE: This should not happen outside of tests.
klog.V(4).InfoS("ReconcileState: skipping container; assigned cpuset is empty", "pod", klog.KObj(pod), "containerName", container.Name)
@@ -470,8 +477,8 @@ func (m *manager) reconcileState() (success []reconciledContainer, failure []rec
lcset := m.lastUpdateState.GetCPUSetOrDefault(string(pod.UID), container.Name)
if !cset.Equals(lcset) {
- klog.V(4).InfoS("ReconcileState: updating container", "pod", klog.KObj(pod), "containerName", container.Name, "containerID", containerID, "cpuSet", cset)
- err = m.updateContainerCPUSet(ctx, containerID, cset)
+ klog.V(4).InfoS("ReconcileState: updating container", "pod", klog.KObj(pod), "containerName", container.Name, "containerID", containerID, "cpuSet", cset, "cpuQuota", cpuQuota)
+ err = m.updateContainerCPUSet(ctx, containerID, cset, cpuQuota)
if err != nil {
klog.ErrorS(err, "ReconcileState: failed to update container", "pod", klog.KObj(pod), "containerName", container.Name, "containerID", containerID, "cpuSet", cset)
failure = append(failure, reconciledContainer{pod.Name, container.Name, containerID})
@@ -510,7 +517,7 @@ func findContainerStatusByName(status *v1.PodStatus, name string) (*v1.Container
return nil, fmt.Errorf("unable to find status for container with name %v in pod status (it may not be running)", name)
}
-func (m *manager) updateContainerCPUSet(ctx context.Context, containerID string, cpus cpuset.CPUSet) error {
+func (m *manager) updateContainerCPUSet(ctx context.Context, containerID string, cpus cpuset.CPUSet, cpuQuota int64) error {
// TODO: Consider adding a `ResourceConfigForContainer` helper in
// helpers_linux.go similar to what exists for pods.
// It would be better to pass the full container resources here instead of
@@ -521,6 +528,7 @@ func (m *manager) updateContainerCPUSet(ctx context.Context, containerID string,
&runtimeapi.ContainerResources{
Linux: &runtimeapi.LinuxContainerResources{
CpusetCpus: cpus.String(),
+ CpuQuota: cpuQuota,
},
})
}
/assign |
we need to fix the release note label |
/release-note Containers from Guaranteed QoS Pods with statically assigned cpus are no longer limited by the CFS quota. |
/retest |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: MarSik The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
} | ||
} | ||
|
||
// TODO Make sure the cpu manager reconcile loop executed already? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's no way to observe from the outside. We can only wait "long enough"
// podCgroupName, _ := m.GetPodContainerName(pod) | ||
// containerConfig, _ := m.cgroupManager.GetCgroupConfig(podCgroupName, v1.ResourceCPU) | ||
// Find the pinned container pid using the CRI Info interface | ||
// This only works on linux |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no bother, cpumanager is a linux-only feature
quotas := string(out) | ||
gomega.Expect(quotas).To(gomega.Or(gomega.HavePrefix("max"), gomega.HavePrefix("-1")), "expected quota == max, got %q", quotas) | ||
} else { | ||
framework.Logf("could not read the cpu quota file: %s: %v", cpuQuotaPath, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we hard fail here? If not, why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing cpu quota controller is a valid scenario afaik. Not every level has to have it enabled and so we can continue to the upper dir.
/test pull-kubernetes-node-kubelet-serial-cpu-manager |
/test pull-kubernetes-node-kubelet-serial-crio-cgroupv1 |
can the PR be reviewed soon? |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@MarSik: Reopened this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Please note that we're already in Test Freeze for the Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: unknown. |
/remove-lifecycle rotten |
/retest |
@swatisehgal @ffromani This PR looks forgotten. PTAL. |
@MarSik: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Guaranteed pod containers with dedicated cpus assigned by cpu manager should not be throttled by the linux CFS quota, because the cpus are well.. exclusively and fully assigned.
Which issue(s) this PR fixes:
Fixes #70585
Special notes for your reviewer:
This is a revive of an almost approved abandoned PR #107589 that was posted by a colleague no longer working on the project.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: