-
Notifications
You must be signed in to change notification settings - Fork 41.8k
staticCPUs kubelet option to specify a restricted list of CPUs for exclusive assignment #87532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
Welcome @vladikr! |
|
Hi @vladikr. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: vladikr The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
davidvossel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach seems reasonable to me.
From my perspective, this is the least invasive way for us to be able to express the ability to assign guaranteed qos containers to a CPU range making use of isolcpus kernel settings. This would be a situation where we need to guarantee certain CPUs are ineligible for system calls, and we want those CPUs to run high performance low latency workloads
cmd/kubelet/app/server.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's probably worth adding some error checking here to validate the cpu range falls within a range actually available on the host, and that the range doesn't overlap with something like "reserved-cpu" or any other conflicting options.
…ve assignment Signed-off-by: Vladik Romanovsky <vromanso@redhat.com>
This option will restricts the list of cpus that will be used for exclusing assignment Signed-off-by: Vladik Romanovsky <vromanso@redhat.com>
… cm static policy Signed-off-by: Vladik Romanovsky <vromanso@redhat.com>
Signed-off-by: Vladik Romanovsky <vromanso@redhat.com>
…icCPUs Signed-off-by: Vladik Romanovsky <vromanso@redhat.com>
|
/cc @derekwaynecarr |
|
cc @kubernetes/sig-node-pr-reviews |
|
/ok-to-test |
|
/cc @ConnorDoyle |
|
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
derekwaynecarr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am finding this slightly confusing, but its a complicated domain.
Can you help me understand how a typical node administrator would configure all the specified values together to meet the desired effect? It seems like this would ultimately restrict the number of non-integral core pods that could be placed on the node since the scheduler would not know the number of pods that appear to implicitly being reserved for integral requests.
| fs.Int32Var(&c.PodsPerCore, "pods-per-core", c.PodsPerCore, "Number of Pods per core that can run on this Kubelet. The total number of Pods on this Kubelet cannot exceed max-pods, so max-pods will be used if this calculation results in a larger number of Pods allowed on the Kubelet. A value of 0 disables this limit.") | ||
| fs.BoolVar(&c.ProtectKernelDefaults, "protect-kernel-defaults", c.ProtectKernelDefaults, "Default kubelet behaviour for kernel tuning. If set, kubelet errors if any of kernel tunables is different than kubelet defaults.") | ||
| fs.StringVar(&c.ReservedSystemCPUs, "reserved-cpus", c.ReservedSystemCPUs, "A comma-separated list of CPUs or CPU ranges that are reserved for system and kubernetes usage. This specific list will supersede cpu counts in --system-reserved and --kube-reserved.") | ||
| fs.StringVar(&c.StaticCPUs, "static-cpus", c.StaticCPUs, "A comma-separated list of CPUs or CPU ranges that will be used exclusive assignemnt. This list restricts the list of CPUs available for exclusive assignemnt. Containers with guaranteed QOS will be assigned with CPUs only from this list.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to understand this proposal better.
Today, if I use either --system-reserved/kube-reserved OR reserved-cpus, I am reducing my node allocatable capacity. If used in concert with cpu manager policy of static, the specified or low-numbered cores are excluded from exclusive scheduling BUT for non G QoS pods.
If I use static-cpus, I am not reducing allocatable, but I am implicitly restricting the number of G versus non G QoS pods I can run on a node without any corresponding signal back to the scheduler. Is this an accurate understanding of the proposal? The net is a kubelet could report allocatable cpu=6, but if static-cpus=1-4, its really allocatable=6 where 4 can be used by G pods, and 2 are left for non G pods. This seems confusing, but its possible I am reading this wrong.
If you could show a concrete configuration of isolcpu, reserved-cpus, and static-cpus used in concert it would help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to phrase this, prior to this static-cpu proposal the list of CPU ranges that could be handed out for exclusive assignment dynamically adjusted with the pods that were scheduled to the node. This makes that dynamism go away, and restricts where non G pods are able to execute and fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @derekwaynecarr
I am trying to understand this proposal better.
Today, if I use either
--system-reserved/kube-reservedORreserved-cpus, I am reducing my node allocatable capacity. If used in concert with cpu manager policy ofstatic, the specified or low-numbered cores are excluded from exclusive scheduling BUT for non G QoS pods.
This is true from k8s perspective, however, in case isolcpus kernel option is configured, it will prevent process running in these non G QoS pods to execute on the isolated cpus.
If there are total 6 cpus, and we will configure --reserved-cpus=0-1, isolcpus=2-5, non G pods will be able to execute only on 0-1
If I use
static-cpus, I am not reducing allocatable, but I am implicitly restricting the number of G versus non G QoS pods I can run on a node without any corresponding signal back to the scheduler. Is this an accurate understanding of the proposal? The net is a kubelet could report allocatable cpu=6, but if static-cpus=1-4, its really allocatable=6 where 4 can be used by G pods, and 2 are left for non G pods. This seems confusing, but its possible I am reading this wrong.
Yes, that's correct, however, static-cpus only applies to G QoS pods, the same way as reserved-cpus only applies to G QoS pods.
If you could show a concrete configuration of isolcpu, reserved-cpus, and static-cpus used in concert it would help.
I think there will be no need to configure reserved-cpus with the new static-cpus option.
If we have cpus=6, (kernel) isolcpus=2-5, static-cpus=2-5
You are right about the scheduler. I don't know how to solve this. I didn't find a way to account for only G CPUs with the static-cpus option. However, I thought that this will mimic the behavior of the Affinity Manager, which will error if it couldn't align pod CPUs to the correct numa node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to phrase this, prior to this
static-cpuproposal the list of CPU ranges that could be handed out for exclusive assignment dynamically adjusted with the pods that were scheduled to the node. This makes that dynamism go away, and restricts where non G pods are able to execute and fit.
I think the non G pods are not affected by this change, from k8s perspective. At least I didn't intend to change this behavior
As for the dynamism, you're right, but it is also gone when kernel isolcpus option is used in a combination with reserved-cpus.
My take on this, and it may be wrong, is that those who configure isolcpus on the node know that they are eventually limiting their G pods to only this list of CPUs. However, without the static-cpus option they didn't have a way to effectively restrict the pool of the G CPUs to only isolcpus without reducing the nodes allocatable capacity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: can the static cpu set ever be used by the non G pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another possible issue: say you have 6 allocable cpu on the system but static-cpu=4-5; a pod which request 4 static cpu can be schedule to this node; but the 4>2 and kubelet will fail to accommodate this pod. Right?
|
The need for this feature is mostly driven by the effects that isolcpus kernel tuning parameter. Isolcpus, for now, is the only real way to get very, very low interruption of a cpu, but it also comes at the cost of disabling load-balancing. So, if one were to use isolcpus today, the outcome could be very undesirable, if, for example, --reserved-cpus and kernel's isolcpus do not not consume all of the CPUs on a host. Let start with some examples: Scenario-1: A pod is deployed with guaranteed policy (it asks for 7 cpus and it wants very low latency/interruption), it happens to get placed on cpus 2-8. Another pod is deployed, but is not guaranteed (it wants 7 cpus, it is fine with some interruption, but expects load-balancing), and it can run on CPUs 0-1,9-15 (anywhere a guaranteed pod is not running). The outcome here is that the guaranteed pod did not get the very-low latency/interruption CPUs, and the non-guaranteed pod, while seemingly getting lots of CPUs, can really only run on CPUs 0-1 because load-balancing is disabled on CPUs 9-15. So in this case everyone is disappointed :) Scenario-2: Scenario-3: All of these scenarios have issues. With a new feature --static-cpus, we can do the following: Scenario-4: I should note that eventually, when we can get the kernel to have all CPUs except the ones used for --reserved-cpus to be very-low latency/interruption (no need for isolcpus kernel param), it would not be necessary to use --static-cpus, unless you had a very specific use-case, like specifying only all the CPUs on 1 NUMA node, so you could be confident the guaranteed pod got exclusive use of a L3 cache. |
|
@vladikr: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
ipuustin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is another approach to the CPU pool problem, which is about dividing the available CPUs in the system to various groups with some semantics. The real issue is the workload placement: how to get the right pods running on the right CPU pools?
I submitted a proposal about a pretty similar problem some time ago: kubernetes/community#2739. There the idea was to split the CPUs using QoS classes, so that we could have the right allocatable number of CPUs for the node and get "best effort" workloads running on separate cores, and I seem to remember that isolcpus was also part of the solution space. Also @Levovar has a KEP proposal about making CPU Manager respect the isolcpus setting: kubernetes/community#2435.
| var staticCPUs cpuset.CPUSet | ||
| var errParse error | ||
|
|
||
| if s.StaticCPUs != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea is that --static-cpus is consumed by Static CPU Manager policy, right? Should there rather be a generic way for configuring the CPU Manager policies? For example being able to assign a map of configuration parameters to them, or add a configuration file next to CPU Manager state file?
| reservedCPUsFloat := float64(reservedCPUs.MilliValue()) / 1000 | ||
| numReservedCPUs := int(math.Ceil(reservedCPUsFloat)) | ||
| policy, err = NewStaticPolicy(topo, numReservedCPUs, specificCPUs, affinity) | ||
| policy, err = NewStaticPolicy(topo, numReservedCPUs, specificCPUs, staticCPUs, affinity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a pending PR (#87522) which also adds a new parameter to static policy creation indicating the available CPUs (using a cpuset cgroup file)̣. These seem to be related approaches.
|
@vladikr: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
@fejta-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@vladikr @davidvossel @derekwaynecarr @ipuustin @atheurer This is a PR from long time ago, but I am very interested in this topic, the problem really happened with Scenario 1 to 3, |
What type of PR is this?
What this PR does / why we need it:
#83592 has introduced an option to set
--reserved-cpusto explicitly define the CPUs list which kubelet will reserve for system use. This reserved CPUs will not be used by the cpu manager for an exclusive assignment for containers with guaranteed QOS.However, when this option is used with an isolcpus kernel option on a node, all pods with a non-guaranteed QOS will only be able to execute on the CPUs defined by "—reserved-cpus", since CPUs defined by the isolcpus parameter will be removed from the kernel SMP balancing and scheduling.
In case the number of reserved-cpus is low, there will be a low amount of CPUs available for non-guaranteed pods to run on. On the other hand, reserved-cpus will reduce the
Capacityreported by this node and will create a scheduling problem.This PR introduces a new kubelet argument
--static-cpusandStaticCPUsconfig option.This is in order to restrict the list of CPUs that will be used for an exclusive assignment for containers with guaranteed QOS to only the specified list.
With this parameter, the amount of
reserved-cpuscan stay low and still provide more CPUs for non-guaranteed pods without reducing the nodeCapacityWhich issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: