kubelet: do not set CPU quota for guaranteed pods #117030

MarSik · 2023-03-31T07:09:53Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Guaranteed pod containers with dedicated cpus assigned by cpu manager should not be throttled by the linux CFS quota, because the cpus are well.. exclusively and fully assigned.

Which issue(s) this PR fixes:

Fixes #70585

Special notes for your reviewer:

This is a revive of an almost approved abandoned PR #107589 that was posted by a colleague no longer working on the project.

Does this PR introduce a user-facing change?

Containers from Guaranteed QoS Pods with statically assigned cpus are no longer limited by the CFS quota.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2023-03-31T07:09:57Z

Please note that we're already in Test Freeze for the release-1.27 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.27.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Fri Mar 31 04:32:14 UTC 2023.

k8s-ci-robot · 2023-03-31T07:10:03Z

Welcome @MarSik!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2023-03-31T07:10:04Z

Hi @MarSik. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ffromani

/ok-to-test
/triage accepted
/priority important-longterm

ffromani · 2023-03-31T08:00:23Z

pkg/kubelet/cm/helpers_linux.go

@@ -167,6 +167,11 @@ func ResourceConfigForPod(pod *v1.Pod, enforceCPULimits bool, cpuPeriod uint64,
 	// determine the qos class
 	qosClass := v1qos.GetPodQOS(pod)

+	// quota is not capped when cfs quota is disabled or when the pod has guaranteed QoS class
+	if !enforceCPULimits || qosClass == v1.PodQOSGuaranteed {


What about guaranteed QoS PODs which don't have exclusive cpus assigned? (e.g. not requesting integer CPUs?)
In addition, I think we should merge this check with the one on line 163 above

The exclusive cpus are assigned to specific containers. This code deals with the parent pod cgroup.

My understanding is that the pod limits are not that important in the greater scheme of things, because it is the containers themselves that have limits too. The pod is just a sum of containers and the sum is needed due to cgroup inheritance (children cannot consume more than the parent).

Artyom used a simplified aproach here and disabled the quota for all Guaranteed Pod cgroups, while leaving it properly configured on the container level. It leaves a bit of a hole, however it makes the code much simpler. The proper check would have to go deep into cpu manager and find out whether exclusive cpus are or will be assigned to any child container of this pod.

Ok, this was already discussed in the original PR here: #107589 (comment)

Hi,

There is a lot more context around all of this in the original PR #107589, so I think people should read through it. I'll spend some time reviewing this again, but I think everything discussed in that PR still holds true.

the tl;dr is that when there is a cpu limit on all containers and the pod cgroup, sometimes the pod cgroup is throttled, and it hurts "important" containers, while they are still below their cpu quota, and is not throttled. This is due to how the accounting in the kernel is done, and the fact that the refill does not happen at the same time, along with a few other nuances around it (happy to elaborate when I have more time).

I personally want this as the default (removing the pod level cfs limit for qos guarnteed), with an ability to opt-out, but I also want it for pods where all containers have a limit, not only for the qos guarnteed one; effectively never set the pod level cpu cfs limit by default - unless opted-in. However, we need to also think about users that want this pod cgroup limit, and especially around the pod overhead support, where each pod level cgroup get a user defined "bump" in its value. I assume things like kata containers and other non "cgroup-only" (or what to call things using runc/crun) container runtimes.

Soo, I think personally a KEP would be a good format to ensure we deal with all the edge cases. But yeah, I really relly want this merged into mainline k8s, so I'm very happy to help out!

Hi,

I agree with pretty much all @odinuge mentioned (including the fact I need to re-read context from #107589 ) but I kinda disagree about the KEP part. While I fully agree proper detailed discussion and record of it are needed, I don't think the KEP process is the best suited (I'm thinking about graduation stages, PRR...). But We can evaluate options like adopting part of the process: create a more detailed doc to evaluate the cases, discuss them (sig-node slack chan? here in the comments?), link it here and/or anywhere needed and distill the outcome as code comments to wrap up the change.

Yeah, sounds good @ffromani, I think we are very much in line then! Main reason for a KEP is if we want introduce some kind of new flag that we need to deal with things like graduation, feature flags, test plans etc. If we end up not doing that, then I agree we probably don't need a KEP! smile

Thanks

If it turns out we need user-exposed flags or the changes we need to make a robust solution turn out too invasive, then yes, we will need to seriously evaluate if we need a full fledged KEP. In that case the prep work we are talking about will help, so it seems a good way forward anyway :)

Folks, I added a code comment explaining the reasoning behind the Guaranteed Pod quota disablement. I was trying to be clear enough without making it too complicated so feel free to suggest changes.

Another approach would be disable the quota ONLY for (all the containers belonging to) Guaranteed QoS Pods with exclusive cpus only. This could require more extensive changes but OTOH should minimize the chances of regression and ill effects.

What if we handle this in the cpumanager reconcile loop?

we have a reconcile loop anyway, and at that stage we will now exactly which pods/containers need to be treated, the chance of side effect is minimized

the reconcile loop is manipulating the cgroups (through CRI) anyway

we need to check we won't race with other kubelet components though. My knowledge of innards of kubelet outside the resource managers is limited, but here we seem to be in the pod creation flow, so it should be OK

PoC (usual disclaimer: untested, partial, etc. etc.)

diff --git a/pkg/kubelet/cm/cpumanager/cpu_manager.go b/pkg/kubelet/cm/cpumanager/cpu_manager.go index 443eecd2d36..316a1534edf 100644 --- a/pkg/kubelet/cm/cpumanager/cpu_manager.go +++ b/pkg/kubelet/cm/cpumanager/cpu_manager.go @@ -460,7 +460,14 @@ func (m *manager) reconcileState() (success []reconciledContainer, failure []rec m.containerMap.Add(string(pod.UID), container.Name, containerID) m.Unlock() - cset := m.state.GetCPUSetOrDefault(string(pod.UID), container.Name) + cpuQuota := int64(0) + cset, ok := m.state.GetCPUSet(string(pod.UID), container.Name) + if ok { + cpuQuota = -1 // unlimited - we are assigning full CPUs anyway + } else { + cset = m.state.GetDefaultCPUSet() + } + if cset.IsEmpty() { // NOTE: This should not happen outside of tests. klog.V(4).InfoS("ReconcileState: skipping container; assigned cpuset is empty", "pod", klog.KObj(pod), "containerName", container.Name) @@ -470,8 +477,8 @@ func (m *manager) reconcileState() (success []reconciledContainer, failure []rec lcset := m.lastUpdateState.GetCPUSetOrDefault(string(pod.UID), container.Name) if !cset.Equals(lcset) { - klog.V(4).InfoS("ReconcileState: updating container", "pod", klog.KObj(pod), "containerName", container.Name, "containerID", containerID, "cpuSet", cset) - err = m.updateContainerCPUSet(ctx, containerID, cset) + klog.V(4).InfoS("ReconcileState: updating container", "pod", klog.KObj(pod), "containerName", container.Name, "containerID", containerID, "cpuSet", cset, "cpuQuota", cpuQuota) + err = m.updateContainerCPUSet(ctx, containerID, cset, cpuQuota) if err != nil { klog.ErrorS(err, "ReconcileState: failed to update container", "pod", klog.KObj(pod), "containerName", container.Name, "containerID", containerID, "cpuSet", cset) failure = append(failure, reconciledContainer{pod.Name, container.Name, containerID}) @@ -510,7 +517,7 @@ func findContainerStatusByName(status *v1.PodStatus, name string) (*v1.Container return nil, fmt.Errorf("unable to find status for container with name %v in pod status (it may not be running)", name) } -func (m *manager) updateContainerCPUSet(ctx context.Context, containerID string, cpus cpuset.CPUSet) error { +func (m *manager) updateContainerCPUSet(ctx context.Context, containerID string, cpus cpuset.CPUSet, cpuQuota int64) error { // TODO: Consider adding a `ResourceConfigForContainer` helper in // helpers_linux.go similar to what exists for pods. // It would be better to pass the full container resources here instead of @@ -521,6 +528,7 @@ func (m *manager) updateContainerCPUSet(ctx context.Context, containerID string, &runtimeapi.ContainerResources{ Linux: &runtimeapi.LinuxContainerResources{ CpusetCpus: cpus.String(), + CpuQuota: cpuQuota, }, }) }

ffromani · 2023-03-31T08:01:17Z

/assign

ffromani · 2023-03-31T08:01:46Z

we need to fix the release note label

ffromani · 2023-03-31T08:02:03Z

/release-note Containers from Guaranteed QoS Pods with statically assigned cpus are no longer limited by the CFS quota.

adilGhaffarDev · 2023-11-14T08:15:44Z

/retest

k8s-ci-robot · 2023-11-20T14:20:30Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: MarSik
Once this PR has been reviewed and has the lgtm label, please ask for approval from klueska. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/kubelet/OWNERS
test/e2e_node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ffromani · 2023-11-24T08:31:56Z