Fix the issue in CPU static policy when containers shouldn't be sched… #118021

dastonzerg · 2023-05-15T19:30:07Z

…uled on reserved CPUs.

What type of PR is this?

/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #115994

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2023-05-15T19:30:09Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

linux-foundation-easycla · 2023-05-15T19:30:11Z

The committers listed above are authorized under a signed CLA.

✅ login: dastonzerg / name: Weipeng Zheng (5c71de8)

k8s-ci-robot · 2023-05-15T19:30:16Z

Welcome @dastonzerg!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2023-05-15T19:30:16Z

Hi @dastonzerg. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

haircommander · 2023-05-15T19:39:28Z

/ok-to-test
/triage accepted

@dastonzerg can you sign-off your commit and sign the EasyCLA per the kubernetes contributor guidelines?

dastonzerg · 2023-05-15T19:53:45Z

Hey @haircommander I have signed 10 mins ago but there is no "Look for an email indicating successful signup." email received as per https://github.com/kubernetes/community/blob/master/CLA.md#5-look-for-an-email-indicating-successful-signup . Should I wait for longer time?

haircommander · 2023-05-15T19:55:30Z

can you try running

git commit --amend -s
git push -f origin zw/fix-kubelet-reserved-cpu

dastonzerg · 2023-05-15T20:14:09Z

/easycla

dastonzerg · 2023-05-15T20:17:51Z

@haircommander Thx! after that I added Signed-off-by: Weipeng Zheng <zweipeng@google.com> line to the end of commit, and I recieved an email from admin@lfx.linuxfoundation.org about my signed PDF which has full name "Weipeng Zheng" and email zweipeng@google.com , then I ran /easycla as per https://github.com/kubernetes/community/blob/master/CLA.md#6-validate-your-cla but the CLA check is still not updated.

SergeyKanzhelev · 2023-09-20T16:36:00Z

/approve cancel

as discussed at yesterday sig node meeting

swatisehgal · 2023-09-21T11:50:27Z

Regarding the "shared pool cannot be empty" problem, I think after the commit reserved CPUs can be empty, so we can just remove https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/cpu_manager.go#L179-L186 . And later on if a gu pod wants to take some exclusive CPUs away from shared pool in https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/policy_static.go#L348, it will just fail by CPU accumulator's check if we change the > to >= https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/cpu_assignment.go#L420-L422. So shared pool can never be empty if we add = to consider the shared pool empty case.

I reviewed the code and can see that isFailed check happens in the allocation stage and all the pods that are not guaranteed or guaranteed with non-integral CPU request are ruled out before we get to this check and we would get to this only if we are trying to allocate exclusive CPUs only. So I stand corrected on my comment about best effort pods (with no CPU request) in this check. I can see how this would ensure that the default pool is never empty but IIUC we get this by preventing all the CPUs from being used on the node? We essentially keep a buffer of 1 CPU always on a node. Correct?

If I understand the intention above correctly, I think we need to consider the node allocatable here too. Consider this scenario: we have a node with 8 cpus, 2 cpus are reserved on the node => 6 allocatable CPUs as per https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/policy_static.go#L252. The nodeAllocatable amount (6) is advertised on the node and subsequently used by the scheduler to perform scheduling decisions. Now, if we have a guaranteed pod requesting 6 CPUs, the scheduler would consider the aforementioned node a suitable candidate for the placement of the pod but if it is scheduled on the node it would fail at admission time.

In summary there are 2 problems we are trying to deal with:

non-gu pods shouldn't run on reserved CPUs

Shared CPU pool should never be empty

I want to share a very compelling point made by @MarSik in our discussion yesterday about default pool not being empty:

An empty value indicates that the cgroup is using the same setting as the nearest cgroup ancestor with a non-empty "cpuset.cpus" or all the available CPUs if none is found.

This is copied verbatim from cgroup-v2 documentation and essentially means that if we try to make the defaultset empty it would inherit cgroup from its nearest non-empty ancestor or get all the CPUs which in the worst case can jeopardize exclusivity guarantees of all the pinned pods. I couldn't figure out if the implications are same in case of cgroupsv1 or not. But with cgroupsv2 supported as a stable feature in Kubernetes we can't ignore the fact that even if we try to make the default pool empty it could completely backfire. I wonder if this was the very reason for ensuring that the default pool is never empty from the get go.

jonyhy96 · 2023-09-25T04:25:47Z

/approve cancel

as discussed at yesterday sig node meeting

@SergeyKanzhelev Can you provide the discussion conclusion? Why was this proposal canceled

iancoolidge · 2023-10-03T16:00:32Z

I hope we can find some time to discuss this at sig-node today, I will only be able to be in for a few minutes due to some overlapping meetings.

I appreciate the reference to cgroups v2 documentation.

To me, if someone exhausts the shared pool with exclusive allocations, that is WAI, and should be a supported thing. I don't think that anyone is arguing against that point, but I am simply stating my assumption.

Now, as exclusive allocations consume CPUs out of the shared pool, and the shared pool becomes empty, I hear the argument about cgroups v2 behavior: documentation seems to indicate that the parent cgroups resources will be used instead.

If that's the case, couldn't we simply just have a specifc check to see if the cpuset.cpus is empty, and fail the allocation that way? It seems like a relatively straightforward problem to solve.

dastonzerg · 2023-10-03T17:01:37Z

Regarding the "shared pool cannot be empty" problem, I think after the commit reserved CPUs can be empty, so we can just remove https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/cpu_manager.go#L179-L186 . And later on if a gu pod wants to take some exclusive CPUs away from shared pool in https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/policy_static.go#L348, it will just fail by CPU accumulator's check if we change the > to >= https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/cpu_assignment.go#L420-L422. So shared pool can never be empty if we add = to consider the shared pool empty case.

I reviewed the code and can see that isFailed check happens in the allocation stage and all the pods that are not guaranteed or guaranteed with non-integral CPU request are ruled out before we get to this check and we would get to this only if we are trying to allocate exclusive CPUs only. So I stand corrected on my comment about best effort pods (with no CPU request) in this check. I can see how this would ensure that the default pool is never empty but IIUC we get this by preventing all the CPUs from being used on the node? We essentially keep a buffer of 1 CPU always on a node. Correct?

If I understand the intention above correctly, I think we need to consider the node allocatable here too. Consider this scenario: we have a node with 8 cpus, 2 cpus are reserved on the node => 6 allocatable CPUs as per https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/policy_static.go#L252. The nodeAllocatable amount (6) is advertised on the node and subsequently used by the scheduler to perform scheduling decisions. Now, if we have a guaranteed pod requesting 6 CPUs, the scheduler would consider the aforementioned node a suitable candidate for the placement of the pod but if it is scheduled on the node it would fail at admission time.

In summary there are 2 problems we are trying to deal with:

non-gu pods shouldn't run on reserved CPUs

Shared CPU pool should never be empty

I want to share a very compelling point made by @MarSik in our discussion yesterday about default pool not being empty:
An empty value indicates that the cgroup is using the same setting as the nearest cgroup ancestor with a non-empty "cpuset.cpus" or all the available CPUs if none is found.
This is copied verbatim from cgroup-v2 documentation and essentially means that if we try to make the defaultset empty it would inherit cgroup from its nearest non-empty ancestor or get all the CPUs which in the worst case can jeopardize exclusivity guarantees of all the pinned pods. I couldn't figure out if the implications are same in case of cgroupsv1 or not. But with cgroupsv2 supported as a stable feature in Kubernetes we can't ignore the fact that even if we try to make the default pool empty it could completely backfire. I wonder if this was the very reason for ensuring that the default pool is never empty from the get go.

@swatisehgal Thanks for the follow up discussions and context sharing! I am clear and agree on shared CPU pool should never go empty for the following reasons based our discussions so far:

It already behaves that way, and we don't want to change the behavior to impact users who already assume it. It is somewhat of a "regression".
We are supporting cgroupv2 and allowing shared CPU pool could jeopardize exclusivity guarantees of all the pinned pods.

I think I can address it together in this PR too. I was thinking of throwing error in admission time like in change the > to >= https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/cpu_assignment.go#L420-L422. So shared pool can never be empty if we add = to consider the shared pool empty case. Let's discuss if this approach is good enough or there is any other better way in today's signode meeting.

swatisehgal · 2023-10-04T11:14:19Z

I think I can address it together in this PR too. I was thinking of throwing error in admission time like in change the > to >= https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/cpu_assignment.go#L420-L422. So shared pool can never be empty if we add = to consider the shared pool empty case. Let's discuss if this approach is good enough or there is any other better way in today's signode meeting.

Hi, sorry I was out sick yesterday so couldn't join the discussion in SIG Node meeting and the recording hasn't been uploaded yet. Did we manage to discuss the implications of this approach from scheduling and end user perspective? I am referring to my point below:

I think we need to consider the node allocatable here too. Consider this scenario: we have a node with 8 cpus, 2 cpus are reserved on the node => 6 allocatable CPUs as per https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/policy_static.go#L252. The nodeAllocatable amount (6) is advertised on the node and subsequently used by the scheduler to perform scheduling decisions. Now, if we have a guaranteed pod requesting 6 CPUs, the scheduler would consider the aforementioned node as a suitable candidate for the placement of the pod but if it is scheduled on the node it would fail at admission time.

Was the disconnect between scheduler and kubelet brought up in the meeting? With the approach you are suggesting we are could end up in runaway pod creation scenarios similar to #84869.

jonyhy96 · 2023-10-11T09:48:54Z

I think this pr already fix the issue non-gu pods shouldn't run on reserved CPUs which is a known bug. Better push this pr forward and open another issue to talk about how to make sure shared CPU pool should never be empty

pacoxu · 2024-01-19T07:16:00Z

Recording: https://www.youtube.com/watch?v=HdIURTQSm7Q
for Oct 3rd, 2023 sig node weekly meeting.

bart0sh · 2024-04-13T15:25:21Z

/priority important-longterm

k8s-triage-robot · 2024-07-12T16:27:30Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ffromani · 2024-07-12T18:11:02Z

/remove-lifecycle stale

k8s-triage-robot · 2024-10-10T18:21:21Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

SergeyKanzhelev · 2024-10-10T21:33:54Z

/remove-lifecycle stale

@ffromani removing stale assuming you still want to follow up here

k8s-ci-robot · 2024-11-01T07:27:38Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

swatisehgal · 2025-01-07T16:44:00Z

This PR is superseded by feature currently in alpha, the progress is tracked in the enhancement issue: kubernetes/enhancements#4540 so closing this PR.

/close

k8s-ci-robot · 2025-01-07T16:44:05Z

@swatisehgal: Closed this PR.

In response to this:

This PR is superseded by feature currently in alpha, the progress is tracked in the enhancement issue: kubernetes/enhancements#4540 so closing this PR.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 15, 2023

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 15, 2023

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 15, 2023

k8s-ci-robot requested review from haircommander and klueska May 15, 2023 19:30

k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 15, 2023

dastonzerg force-pushed the zw/fix-kubelet-reserved-cpu branch 2 times, most recently from 5ffc472 to 92860d2 Compare May 15, 2023 20:01

dastonzerg mentioned this pull request May 15, 2023

containers get scheduled on --reserved-cpus #115994

Closed

dims removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 23, 2023

k8s-ci-robot added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 13, 2024

ffromani mentioned this pull request Apr 25, 2024

kubelet failed to create containerd task if cgroupRoot defined cpuset and CPU Manager configured with static policy #124440

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 10, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 10, 2024

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 1, 2024

dims added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 3, 2025

k8s-ci-robot closed this Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the issue in CPU static policy when containers shouldn't be sched… #118021

Fix the issue in CPU static policy when containers shouldn't be sched… #118021

dastonzerg commented May 15, 2023

k8s-ci-robot commented May 15, 2023

linux-foundation-easycla bot commented May 15, 2023 •

edited

Loading

k8s-ci-robot commented May 15, 2023

k8s-ci-robot commented May 15, 2023

haircommander commented May 15, 2023

dastonzerg commented May 15, 2023

haircommander commented May 15, 2023

dastonzerg commented May 15, 2023

dastonzerg commented May 15, 2023

SergeyKanzhelev commented Sep 20, 2023

swatisehgal commented Sep 21, 2023

jonyhy96 commented Sep 25, 2023 •

edited

Loading

iancoolidge commented Oct 3, 2023

dastonzerg commented Oct 3, 2023 •

edited

Loading

swatisehgal commented Oct 4, 2023 •

edited

Loading

jonyhy96 commented Oct 11, 2023 •

edited

Loading

pacoxu commented Jan 19, 2024

bart0sh commented Apr 13, 2024

k8s-triage-robot commented Jul 12, 2024

ffromani commented Jul 12, 2024

k8s-triage-robot commented Oct 10, 2024

SergeyKanzhelev commented Oct 10, 2024

k8s-ci-robot commented Nov 1, 2024

swatisehgal commented Jan 7, 2025

k8s-ci-robot commented Jan 7, 2025

Fix the issue in CPU static policy when containers shouldn't be sched… #118021

Fix the issue in CPU static policy when containers shouldn't be sched… #118021

Conversation

dastonzerg commented May 15, 2023

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented May 15, 2023

linux-foundation-easycla bot commented May 15, 2023 • edited Loading

k8s-ci-robot commented May 15, 2023

k8s-ci-robot commented May 15, 2023

haircommander commented May 15, 2023

dastonzerg commented May 15, 2023

haircommander commented May 15, 2023

dastonzerg commented May 15, 2023

dastonzerg commented May 15, 2023

SergeyKanzhelev commented Sep 20, 2023

swatisehgal commented Sep 21, 2023

jonyhy96 commented Sep 25, 2023 • edited Loading

iancoolidge commented Oct 3, 2023

dastonzerg commented Oct 3, 2023 • edited Loading

swatisehgal commented Oct 4, 2023 • edited Loading

jonyhy96 commented Oct 11, 2023 • edited Loading

pacoxu commented Jan 19, 2024

bart0sh commented Apr 13, 2024

k8s-triage-robot commented Jul 12, 2024

ffromani commented Jul 12, 2024

k8s-triage-robot commented Oct 10, 2024

SergeyKanzhelev commented Oct 10, 2024

k8s-ci-robot commented Nov 1, 2024

swatisehgal commented Jan 7, 2025

k8s-ci-robot commented Jan 7, 2025

linux-foundation-easycla bot commented May 15, 2023 •

edited

Loading

jonyhy96 commented Sep 25, 2023 •

edited

Loading

dastonzerg commented Oct 3, 2023 •

edited

Loading

swatisehgal commented Oct 4, 2023 •

edited

Loading

jonyhy96 commented Oct 11, 2023 •

edited

Loading