Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the issue in CPU static policy when containers shouldn't be sched… #118021

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dastonzerg
Copy link

…uled on reserved CPUs.

What type of PR is this?

/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #115994

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 15, 2023
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented May 15, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: dastonzerg / name: Weipeng Zheng (5c71de8)

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 15, 2023
@k8s-ci-robot
Copy link
Contributor

Welcome @dastonzerg!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @dastonzerg. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 15, 2023
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 15, 2023
@haircommander
Copy link
Contributor

/ok-to-test
/triage accepted

@dastonzerg can you sign-off your commit and sign the EasyCLA per the kubernetes contributor guidelines?

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 15, 2023
@dastonzerg
Copy link
Author

Hey @haircommander I have signed 10 mins ago but there is no "Look for an email indicating successful signup." email received as per https://github.com/kubernetes/community/blob/master/CLA.md#5-look-for-an-email-indicating-successful-signup . Should I wait for longer time?

@haircommander
Copy link
Contributor

can you try running

git commit --amend -s
git push -f origin zw/fix-kubelet-reserved-cpu

@dastonzerg dastonzerg force-pushed the zw/fix-kubelet-reserved-cpu branch 2 times, most recently from 5ffc472 to 92860d2 Compare May 15, 2023 20:01
@dastonzerg
Copy link
Author

/easycla

@dastonzerg
Copy link
Author

@haircommander Thx! after that I added Signed-off-by: Weipeng Zheng <zweipeng@google.com> line to the end of commit, and I recieved an email from admin@lfx.linuxfoundation.org about my signed PDF which has full name "Weipeng Zheng" and email zweipeng@google.com , then I ran /easycla as per https://github.com/kubernetes/community/blob/master/CLA.md#6-validate-your-cla but the CLA check is still not updated.

@dastonzerg dastonzerg force-pushed the zw/fix-kubelet-reserved-cpu branch 2 times, most recently from c945742 to b6dc80c Compare September 16, 2023 05:17
Signed-off-by: Weipeng Zheng <zweipeng@google.com>
@dastonzerg
Copy link
Author

/test pull-kubernetes-node-kubelet-serial-cpu-manager
/test pull-kubernetes-node-kubelet-serial-cpu-manager-kubetest2

@@ -255,7 +256,7 @@ func (p *staticPolicy) GetAllocatableCPUs(s state.State) cpuset.CPUSet {

// GetAvailableCPUs returns the set of unassigned CPUs minus the reserved set.
func (p *staticPolicy) GetAvailableCPUs(s state.State) cpuset.CPUSet {
return s.GetDefaultCPUSet().Difference(p.reservedCPUs)
return s.GetDefaultCPUSet()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is mismatch with the comment

@SergeyKanzhelev
Copy link
Member

/approve cancel

as discussed at yesterday sig node meeting

@swatisehgal
Copy link
Contributor

Regarding the "shared pool cannot be empty" problem, I think after the commit reserved CPUs can be empty, so we can just remove https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/cpu_manager.go#L179-L186 . And later on if a gu pod wants to take some exclusive CPUs away from shared pool in https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/policy_static.go#L348, it will just fail by CPU accumulator's check if we change the > to >= https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/cpu_assignment.go#L420-L422. So shared pool can never be empty if we add = to consider the shared pool empty case.

I reviewed the code and can see that isFailed check happens in the allocation stage and all the pods that are not guaranteed or guaranteed with non-integral CPU request are ruled out before we get to this check and we would get to this only if we are trying to allocate exclusive CPUs only. So I stand corrected on my comment about best effort pods (with no CPU request) in this check. I can see how this would ensure that the default pool is never empty but IIUC we get this by preventing all the CPUs from being used on the node? We essentially keep a buffer of 1 CPU always on a node. Correct?

If I understand the intention above correctly, I think we need to consider the node allocatable here too. Consider this scenario: we have a node with 8 cpus, 2 cpus are reserved on the node => 6 allocatable CPUs as per https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/policy_static.go#L252. The nodeAllocatable amount (6) is advertised on the node and subsequently used by the scheduler to perform scheduling decisions. Now, if we have a guaranteed pod requesting 6 CPUs, the scheduler would consider the aforementioned node a suitable candidate for the placement of the pod but if it is scheduled on the node it would fail at admission time.

In summary there are 2 problems we are trying to deal with:

  • non-gu pods shouldn't run on reserved CPUs
  • Shared CPU pool should never be empty

I want to share a very compelling point made by @MarSik in our discussion yesterday about default pool not being empty:

An empty value indicates that the cgroup is using the same setting as the nearest cgroup ancestor with a non-empty "cpuset.cpus" or all the available CPUs if none is found.

This is copied verbatim from cgroup-v2 documentation and essentially means that if we try to make the defaultset empty it would inherit cgroup from its nearest non-empty ancestor or get all the CPUs which in the worst case can jeopardize exclusivity guarantees of all the pinned pods. I couldn't figure out if the implications are same in case of cgroupsv1 or not. But with cgroupsv2 supported as a stable feature in Kubernetes we can't ignore the fact that even if we try to make the default pool empty it could completely backfire. I wonder if this was the very reason for ensuring that the default pool is never empty from the get go.

@jonyhy96
Copy link
Contributor

jonyhy96 commented Sep 25, 2023

/approve cancel

as discussed at yesterday sig node meeting

@SergeyKanzhelev Can you provide the discussion conclusion? Why was this proposal canceled

@iancoolidge
Copy link
Contributor

I hope we can find some time to discuss this at sig-node today, I will only be able to be in for a few minutes due to some overlapping meetings.

I appreciate the reference to cgroups v2 documentation.

To me, if someone exhausts the shared pool with exclusive allocations, that is WAI, and should be a supported thing. I don't think that anyone is arguing against that point, but I am simply stating my assumption.

Now, as exclusive allocations consume CPUs out of the shared pool, and the shared pool becomes empty, I hear the argument about cgroups v2 behavior: documentation seems to indicate that the parent cgroups resources will be used instead.

If that's the case, couldn't we simply just have a specifc check to see if the cpuset.cpus is empty, and fail the allocation that way? It seems like a relatively straightforward problem to solve.

@dastonzerg
Copy link
Author

dastonzerg commented Oct 3, 2023

Regarding the "shared pool cannot be empty" problem, I think after the commit reserved CPUs can be empty, so we can just remove https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/cpu_manager.go#L179-L186 . And later on if a gu pod wants to take some exclusive CPUs away from shared pool in https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/policy_static.go#L348, it will just fail by CPU accumulator's check if we change the > to >= https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/cpu_assignment.go#L420-L422. So shared pool can never be empty if we add = to consider the shared pool empty case.

I reviewed the code and can see that isFailed check happens in the allocation stage and all the pods that are not guaranteed or guaranteed with non-integral CPU request are ruled out before we get to this check and we would get to this only if we are trying to allocate exclusive CPUs only. So I stand corrected on my comment about best effort pods (with no CPU request) in this check. I can see how this would ensure that the default pool is never empty but IIUC we get this by preventing all the CPUs from being used on the node? We essentially keep a buffer of 1 CPU always on a node. Correct?

If I understand the intention above correctly, I think we need to consider the node allocatable here too. Consider this scenario: we have a node with 8 cpus, 2 cpus are reserved on the node => 6 allocatable CPUs as per https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/policy_static.go#L252. The nodeAllocatable amount (6) is advertised on the node and subsequently used by the scheduler to perform scheduling decisions. Now, if we have a guaranteed pod requesting 6 CPUs, the scheduler would consider the aforementioned node a suitable candidate for the placement of the pod but if it is scheduled on the node it would fail at admission time.

In summary there are 2 problems we are trying to deal with:

  • non-gu pods shouldn't run on reserved CPUs
  • Shared CPU pool should never be empty

I want to share a very compelling point made by @MarSik in our discussion yesterday about default pool not being empty:

An empty value indicates that the cgroup is using the same setting as the nearest cgroup ancestor with a non-empty "cpuset.cpus" or all the available CPUs if none is found.

This is copied verbatim from cgroup-v2 documentation and essentially means that if we try to make the defaultset empty it would inherit cgroup from its nearest non-empty ancestor or get all the CPUs which in the worst case can jeopardize exclusivity guarantees of all the pinned pods. I couldn't figure out if the implications are same in case of cgroupsv1 or not. But with cgroupsv2 supported as a stable feature in Kubernetes we can't ignore the fact that even if we try to make the default pool empty it could completely backfire. I wonder if this was the very reason for ensuring that the default pool is never empty from the get go.

@swatisehgal Thanks for the follow up discussions and context sharing! I am clear and agree on shared CPU pool should never go empty for the following reasons based our discussions so far:

  • It already behaves that way, and we don't want to change the behavior to impact users who already assume it. It is somewhat of a "regression".
  • We are supporting cgroupv2 and allowing shared CPU pool could jeopardize exclusivity guarantees of all the pinned pods.

I think I can address it together in this PR too. I was thinking of throwing error in admission time like in change the > to >= https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/cpu_assignment.go#L420-L422. So shared pool can never be empty if we add = to consider the shared pool empty case. Let's discuss if this approach is good enough or there is any other better way in today's signode meeting.

@swatisehgal
Copy link
Contributor

swatisehgal commented Oct 4, 2023

I think I can address it together in this PR too. I was thinking of throwing error in admission time like in change the > to >= https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/cpu_assignment.go#L420-L422. So shared pool can never be empty if we add = to consider the shared pool empty case. Let's discuss if this approach is good enough or there is any other better way in today's signode meeting.

Hi, sorry I was out sick yesterday so couldn't join the discussion in SIG Node meeting and the recording hasn't been uploaded yet. Did we manage to discuss the implications of this approach from scheduling and end user perspective? I am referring to my point below:

I think we need to consider the node allocatable here too. Consider this scenario: we have a node with 8 cpus, 2 cpus are reserved on the node => 6 allocatable CPUs as per https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/policy_static.go#L252. The nodeAllocatable amount (6) is advertised on the node and subsequently used by the scheduler to perform scheduling decisions. Now, if we have a guaranteed pod requesting 6 CPUs, the scheduler would consider the aforementioned node as a suitable candidate for the placement of the pod but if it is scheduled on the node it would fail at admission time.

Was the disconnect between scheduler and kubelet brought up in the meeting? With the approach you are suggesting we are could end up in runaway pod creation scenarios similar to #84869.

@jonyhy96
Copy link
Contributor

jonyhy96 commented Oct 11, 2023

I think this pr already fix the issue non-gu pods shouldn't run on reserved CPUs which is a known bug. Better push this pr forward and open another issue to talk about how to make sure shared CPU pool should never be empty

@dims dims removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 23, 2023
@pacoxu
Copy link
Member

pacoxu commented Jan 19, 2024

Recording: https://www.youtube.com/watch?v=HdIURTQSm7Q
for Oct 3rd, 2023 sig node weekly meeting.

@bart0sh
Copy link
Contributor

bart0sh commented Apr 13, 2024

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 13, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2024
@ffromani
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 10, 2024
@SergeyKanzhelev
Copy link
Member

/remove-lifecycle stale

@ffromani removing stale assuming you still want to follow up here

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. kind/bug Categorizes issue or PR as related to a bug. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

containers get scheduled on --reserved-cpus