Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet: do not set CPU quota for guaranteed pods #117030

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

MarSik
Copy link

@MarSik MarSik commented Mar 31, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

Guaranteed pod containers with dedicated cpus assigned by cpu manager should not be throttled by the linux CFS quota, because the cpus are well.. exclusively and fully assigned.

Which issue(s) this PR fixes:

Fixes #70585

Special notes for your reviewer:

This is a revive of an almost approved abandoned PR #107589 that was posted by a colleague no longer working on the project.

Does this PR introduce a user-facing change?

Containers from Guaranteed QoS Pods with statically assigned cpus are no longer limited by the CFS quota.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Mar 31, 2023
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.27 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.27.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Fri Mar 31 04:32:14 UTC 2023.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 31, 2023
@k8s-ci-robot
Copy link
Contributor

Welcome @MarSik!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 31, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @MarSik. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Mar 31, 2023
@MarSik MarSik force-pushed the do_not_set_quota_guaranteed_pod branch from dec5391 to 51556e7 Compare March 31, 2023 07:10
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 31, 2023
Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test
/triage accepted
/priority important-longterm

@@ -167,6 +167,11 @@ func ResourceConfigForPod(pod *v1.Pod, enforceCPULimits bool, cpuPeriod uint64,
// determine the qos class
qosClass := v1qos.GetPodQOS(pod)

// quota is not capped when cfs quota is disabled or when the pod has guaranteed QoS class
if !enforceCPULimits || qosClass == v1.PodQOSGuaranteed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about guaranteed QoS PODs which don't have exclusive cpus assigned? (e.g. not requesting integer CPUs?)
In addition, I think we should merge this check with the one on line 163 above

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exclusive cpus are assigned to specific containers. This code deals with the parent pod cgroup.

My understanding is that the pod limits are not that important in the greater scheme of things, because it is the containers themselves that have limits too. The pod is just a sum of containers and the sum is needed due to cgroup inheritance (children cannot consume more than the parent).

Artyom used a simplified aproach here and disabled the quota for all Guaranteed Pod cgroups, while leaving it properly configured on the container level. It leaves a bit of a hole, however it makes the code much simpler. The proper check would have to go deep into cpu manager and find out whether exclusive cpus are or will be assigned to any child container of this pod.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this was already discussed in the original PR here: #107589 (comment)

Copy link
Member

@odinuge odinuge Mar 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

There is a lot more context around all of this in the original PR #107589, so I think people should read through it. I'll spend some time reviewing this again, but I think everything discussed in that PR still holds true.

the tl;dr is that when there is a cpu limit on all containers and the pod cgroup, sometimes the pod cgroup is throttled, and it hurts "important" containers, while they are still below their cpu quota, and is not throttled. This is due to how the accounting in the kernel is done, and the fact that the refill does not happen at the same time, along with a few other nuances around it (happy to elaborate when I have more time).

I personally want this as the default (removing the pod level cfs limit for qos guarnteed), with an ability to opt-out, but I also want it for pods where all containers have a limit, not only for the qos guarnteed one; effectively never set the pod level cpu cfs limit by default - unless opted-in. However, we need to also think about users that want this pod cgroup limit, and especially around the pod overhead support, where each pod level cgroup get a user defined "bump" in its value. I assume things like kata containers and other non "cgroup-only" (or what to call things using runc/crun) container runtimes.

Soo, I think personally a KEP would be a good format to ensure we deal with all the edge cases. But yeah, I really relly want this merged into mainline k8s, so I'm very happy to help out!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

I agree with pretty much all @odinuge mentioned (including the fact I need to re-read context from #107589 ) but I kinda disagree about the KEP part. While I fully agree proper detailed discussion and record of it are needed, I don't think the KEP process is the best suited (I'm thinking about graduation stages, PRR...). But We can evaluate options like adopting part of the process: create a more detailed doc to evaluate the cases, discuss them (sig-node slack chan? here in the comments?), link it here and/or anywhere needed and distill the outcome as code comments to wrap up the change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sounds good @ffromani, I think we are very much in line then! Main reason for a KEP is if we want introduce some kind of new flag that we need to deal with things like graduation, feature flags, test plans etc. If we end up not doing that, then I agree we probably don't need a KEP! smile

Thanks

If it turns out we need user-exposed flags or the changes we need to make a robust solution turn out too invasive, then yes, we will need to seriously evaluate if we need a full fledged KEP. In that case the prep work we are talking about will help, so it seems a good way forward anyway :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Folks, I added a code comment explaining the reasoning behind the Guaranteed Pod quota disablement. I was trying to be clear enough without making it too complicated so feel free to suggest changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach would be disable the quota ONLY for (all the containers belonging to) Guaranteed QoS Pods with exclusive cpus only. This could require more extensive changes but OTOH should minimize the chances of regression and ill effects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we handle this in the cpumanager reconcile loop?

  1. we have a reconcile loop anyway, and at that stage we will now exactly which pods/containers need to be treated, the chance of side effect is minimized
  2. the reconcile loop is manipulating the cgroups (through CRI) anyway

we need to check we won't race with other kubelet components though. My knowledge of innards of kubelet outside the resource managers is limited, but here we seem to be in the pod creation flow, so it should be OK

Copy link
Contributor

@ffromani ffromani Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PoC (usual disclaimer: untested, partial, etc. etc.)

diff --git a/pkg/kubelet/cm/cpumanager/cpu_manager.go b/pkg/kubelet/cm/cpumanager/cpu_manager.go
index 443eecd2d36..316a1534edf 100644
--- a/pkg/kubelet/cm/cpumanager/cpu_manager.go
+++ b/pkg/kubelet/cm/cpumanager/cpu_manager.go
@@ -460,7 +460,14 @@ func (m *manager) reconcileState() (success []reconciledContainer, failure []rec
                        m.containerMap.Add(string(pod.UID), container.Name, containerID)
                        m.Unlock()
 
-                       cset := m.state.GetCPUSetOrDefault(string(pod.UID), container.Name)
+                       cpuQuota := int64(0)
+                       cset, ok := m.state.GetCPUSet(string(pod.UID), container.Name)
+                       if ok {
+                               cpuQuota = -1 // unlimited - we are assigning full CPUs anyway
+                       } else {
+                               cset = m.state.GetDefaultCPUSet()
+                       }
+
                        if cset.IsEmpty() {
                                // NOTE: This should not happen outside of tests.
                                klog.V(4).InfoS("ReconcileState: skipping container; assigned cpuset is empty", "pod", klog.KObj(pod), "containerName", container.Name)
@@ -470,8 +477,8 @@ func (m *manager) reconcileState() (success []reconciledContainer, failure []rec
 
                        lcset := m.lastUpdateState.GetCPUSetOrDefault(string(pod.UID), container.Name)
                        if !cset.Equals(lcset) {
-                               klog.V(4).InfoS("ReconcileState: updating container", "pod", klog.KObj(pod), "containerName", container.Name, "containerID", containerID, "cpuSet", cset)
-                               err = m.updateContainerCPUSet(ctx, containerID, cset)
+                               klog.V(4).InfoS("ReconcileState: updating container", "pod", klog.KObj(pod), "containerName", container.Name, "containerID", containerID, "cpuSet", cset, "cpuQuota", cpuQuota)
+                               err = m.updateContainerCPUSet(ctx, containerID, cset, cpuQuota)
                                if err != nil {
                                        klog.ErrorS(err, "ReconcileState: failed to update container", "pod", klog.KObj(pod), "containerName", container.Name, "containerID", containerID, "cpuSet", cset)
                                        failure = append(failure, reconciledContainer{pod.Name, container.Name, containerID})
@@ -510,7 +517,7 @@ func findContainerStatusByName(status *v1.PodStatus, name string) (*v1.Container
        return nil, fmt.Errorf("unable to find status for container with name %v in pod status (it may not be running)", name)
 }
 
-func (m *manager) updateContainerCPUSet(ctx context.Context, containerID string, cpus cpuset.CPUSet) error {
+func (m *manager) updateContainerCPUSet(ctx context.Context, containerID string, cpus cpuset.CPUSet, cpuQuota int64) error {
        // TODO: Consider adding a `ResourceConfigForContainer` helper in
        // helpers_linux.go similar to what exists for pods.
        // It would be better to pass the full container resources here instead of
@@ -521,6 +528,7 @@ func (m *manager) updateContainerCPUSet(ctx context.Context, containerID string,
                &runtimeapi.ContainerResources{
                        Linux: &runtimeapi.LinuxContainerResources{
                                CpusetCpus: cpus.String(),
+                               CpuQuota:   cpuQuota,
                        },
                })
 }

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 31, 2023
@ffromani
Copy link
Contributor

/assign

@ffromani
Copy link
Contributor

we need to fix the release note label

@ffromani
Copy link
Contributor

/release-note Containers from Guaranteed QoS Pods with statically assigned cpus are no longer limited by the CFS quota.

@bart0sh bart0sh added this to Needs Reviewer in SIG Node PR Triage Mar 31, 2023
@adilGhaffarDev
Copy link
Contributor

/retest

@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Nov 20, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: MarSik
Once this PR has been reviewed and has the lgtm label, please ask for approval from klueska. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

}
}

// TODO Make sure the cpu manager reconcile loop executed already?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no way to observe from the outside. We can only wait "long enough"

// podCgroupName, _ := m.GetPodContainerName(pod)
// containerConfig, _ := m.cgroupManager.GetCgroupConfig(podCgroupName, v1.ResourceCPU)
// Find the pinned container pid using the CRI Info interface
// This only works on linux
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no bother, cpumanager is a linux-only feature

test/e2e_node/cpu_manager_test.go Outdated Show resolved Hide resolved
test/e2e_node/cpu_manager_test.go Outdated Show resolved Hide resolved
quotas := string(out)
gomega.Expect(quotas).To(gomega.Or(gomega.HavePrefix("max"), gomega.HavePrefix("-1")), "expected quota == max, got %q", quotas)
} else {
framework.Logf("could not read the cpu quota file: %s: %v", cpuQuotaPath, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we hard fail here? If not, why not?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing cpu quota controller is a valid scenario afaik. Not every level has to have it enabled and so we can continue to the upper dir.

test/e2e_node/cpu_manager_test.go Outdated Show resolved Hide resolved
test/e2e_node/cpu_manager_test.go Outdated Show resolved Hide resolved
@swatisehgal
Copy link
Contributor

/test pull-kubernetes-node-kubelet-serial-cpu-manager

@swatisehgal
Copy link
Contributor

/test pull-kubernetes-node-kubelet-serial-crio-cgroupv1
/test pull-kubernetes-node-kubelet-serial-crio-cgroupv2
/test pull-kubernetes-node-kubelet-serial-containerd
/test pull-kubernetes-cos-cgroupv1-containerd-node-e2e
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial

@SergeyKanzhelev SergeyKanzhelev moved this from Triage to Archive-it in SIG Node CI/Test Board Nov 29, 2023
@inho-cho-redhat
Copy link

can the PR be reviewed soon?

@dims dims added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 5, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

SIG Node PR Triage automation moved this from Needs Reviewer to Done Apr 4, 2024
@MarSik
Copy link
Author

MarSik commented Apr 5, 2024

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Apr 5, 2024
SIG Node PR Triage automation moved this from Done to Triage Apr 5, 2024
@k8s-ci-robot
Copy link
Contributor

@MarSik: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.30 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.30.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: unknown.

@MarSik
Copy link
Author

MarSik commented Apr 5, 2024

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 5, 2024
@bart0sh
Copy link
Contributor

bart0sh commented Apr 5, 2024

/retest

@bart0sh bart0sh moved this from Triage to Needs Reviewer in SIG Node PR Triage Apr 5, 2024
@bart0sh
Copy link
Contributor

bart0sh commented Apr 5, 2024

@swatisehgal @ffromani This PR looks forgotten. PTAL.

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Apr 5, 2024

@MarSik: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-linter-hints bb5fef5 link false /test pull-kubernetes-linter-hints
pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial bb5fef5 link false /test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial
pull-kubernetes-node-kubelet-serial-containerd bb5fef5 link false /test pull-kubernetes-node-kubelet-serial-containerd
pull-kubernetes-node-kubelet-serial-crio-cgroupv1 bb5fef5 link false /test pull-kubernetes-node-kubelet-serial-crio-cgroupv1

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
SIG Node PR Triage
Needs Reviewer
Development

Successfully merging this pull request may close these issues.

Disable cpu quota(use only cpuset) for pod Guaranteed