KEP-3327: Add CPUManager policy option to align CPUs by Socket instead of by NUMA node #3334

arpitsardhana · 2022-06-02T16:45:20Z

One-line PR description: Adding new KEP
Issue link: Add CPUManager policy option to align CPUs by Socket instead of by NUMA node #3327

linux-foundation-easycla · 2022-06-02T16:45:22Z

The committers listed above are authorized under a signed CLA.

✅ login: arpitsardhana / name: Arpit Singh (32b4019)

k8s-ci-robot · 2022-06-02T16:45:27Z

Welcome @arpitsardhana!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2022-06-02T16:45:28Z

Hi @arpitsardhana. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

klueska · 2022-06-02T16:58:37Z

/ok-to-test

swatisehgal

Thanks for the PR.

The KEP needs minor updates related to feature gates but overall looks promising. The feature seems useful as it provides us with another knob to tweak the CPU Management behavior.

We should carefully consider how this change would work on systems with multiple NUMA nodes per socket (e.g. DualSocketMultiNumaPerSocketHT) as well as systems with multiple NUMA sockets per NUMA (e.g. DualNumaMultiSocketPerNumaHT). Would be nice to see that captured as part of the KEP.

keps/sig-node/3327-align-by-socket/README.md

keps/sig-node/3327-align-by-socket/kep.yaml

swatisehgal · 2022-06-03T10:17:50Z

keps/sig-node/3327-align-by-socket/README.md

+```
+At the end, we will have a list of desired hints. These hints will then be passed to the topology manager whose job it is to select the best hint (with an increased likelihood of selecting a hint that has CPUs which are aligned by socket now).
+
+In case TopologyManager “single-numa-node” policy is enabled, the policy option of “align-by-socket” is redundant since allocation guarantees within the same numa are by definition socket aligned. Hence, we will error out in case the policy option of “align-by-socket” is enabled in conjunction with TopologyManager single-numa-node policy.


This is true for systems with multiple NUMA nodes per Socket but not for the ones with multiple sockets per NUMA. In the latter case with TopologyManager “single-numa-node” policy, we could potentially get CPUs that span multiple sockets and with “align-by-socket” policy option enabled the goal should be to try to align those by socket?

This is something, i will need to think more!
This KEP covers cases of DualSocketMultiNUMAPerSocketHT
Let me think on situation where a single NUMA can span multiple sockets (ex DualNumaMultiSocketPerNumaHT) and update once i have some definitive answer

Can we please have few examples of architectures we are targeting where a single NUMA can span multiple sockets?
I'm aware such architecture existed in the past, but I'm not sure the hardware is heading in this direction.

I think we could do one of 2 things here:

Change the flag to have semantics of align-by-socket-or-numa-whichever-wraps-the-other(bad name, but you get the point).

Error out if align-by-socket is set and the machine has more than one socket per numa node.

I would lean towards 2, as I think it is rare to find a machine with this configuration nowadays.

Updated the KEP, i am leaning towards erring out in case there are more than one socket per NUMA
We can address this use case if we see real world architecture leaning in that direction.
I am open to suggestions

I agree with @klueska and I prefer approach number 2 as well. Would be fine by me.

+1 on erroring out for now. It is a rare case, but might still exist for VMs. Some large VM might span multiple sockets using vNUMA.

klueska

I left a few comments. In general, the change is fairly straightforward and easy to reason about. The main open questions are around the edge cases and how this new option will interact with existing options.

That said, the text of the KEP could still use some polishing, with two specific general improvements that could be made:

Split paragraphs into more than one line so that we can comment on specific sentences instead of a whole paragraph.
Put backticks around all of the things that should have backticks (e.g. CPU'Manager, DeviceManager, etc.)

keps/prod-readiness/sig-node/3327.yaml

keps/sig-node/3327-align-by-socket/README.md

klueska · 2022-06-07T14:56:04Z

keps/sig-node/3327-align-by-socket/README.md

+```
+At the end, we will have a list of desired hints. These hints will then be passed to the topology manager whose job it is to select the best hint (with an increased likelihood of selecting a hint that has CPUs which are aligned by socket now).
+
+In case TopologyManager “single-numa-node” policy is enabled, the policy option of “align-by-socket” is redundant since allocation guarantees within the same numa are by definition socket aligned. Hence, we will error out in case the policy option of “align-by-socket” is enabled in conjunction with TopologyManager single-numa-node policy.


I think we could do one of 2 things here:

Change the flag to have semantics of align-by-socket-or-numa-whichever-wraps-the-other(bad name, but you get the point).

Error out if align-by-socket is set and the machine has more than one socket per numa node.

I would lean towards 2, as I think it is rare to find a machine with this configuration nowadays.

arpitsardhana · 2022-06-09T05:28:14Z

Thanks for comments! I have addressed majority of review comments.
However, few of things that may require further updates are 1) Addressing corner cases 2) more explanation on allocateCPU() changes 3) May be some more polishing around framing of KEP

linux-foundation-easycla · 2022-06-13T15:40:33Z

The committers listed above are authorized under a signed CLA.

✅ login: arpitsardhana / name: Arpit Singh (32b4019, baa7ccd)
✅ login: sanjaychatterjee / name: Sanjay Chatterjee (3cef13f)

klueska · 2022-06-15T13:10:03Z

@johnbelamaric can we get a PRR review on this please?

arpitsardhana · 2022-06-15T19:24:23Z

/assign @johnbelamaric

ffromani

non-binding lgtm
non-blocking comments about possible improvements and clarifications
ping me anytime to get a final review (and so the proper LGTM, likely)

ffromani · 2022-06-17T08:36:31Z

keps/sig-node/3327-align-by-socket/README.md

+```
+At the end, we will have a list of desired hints. These hints will then be passed to the topology manager whose job it is to select the best hint (with an increased likelihood of selecting a hint that has CPUs which are aligned by socket now).
+
+In case TopologyManager “single-numa-node” policy is enabled, the policy option of “align-by-socket” is redundant since allocation guarantees within the same numa are by definition socket aligned. Hence, we will error out in case the policy option of “align-by-socket” is enabled in conjunction with TopologyManager single-numa-node policy.


I agree with @klueska and I prefer approach number 2 as well. Would be fine by me.

ffromani · 2022-06-17T08:37:33Z

keps/sig-node/3327-align-by-socket/README.md

+###### How can someone using this feature know that it is working for their instance?
+
+In order to verify this feature is working, one should:
+Pick a node with at least 2 Sockets and 8 NUMA nodes


non-blocking, asking for my understanding. Would a system with 2 sockets and 4 NUMA nodes be OK as well?

Yes, 2 socket and 4 NUMA is ok as well.
Above line is fixed in latest version.

ffromani · 2022-06-17T08:39:15Z

keps/sig-node/3327-align-by-socket/README.md

+
+Inspect the kubelet configuration of a node -- check for the presence of the feature gate and usage of the new policy option.
+
+###### How can someone using this feature know that it is working for their instance?


AFAIK, but I'll obviously defer the autorithative answer to PRRs, that this is meant to ask for more readily observable options, like metrics, or object statuses, or system properties. This (detailed) tests works of course.

Thanks for pointing out! There was not no relevant metric for CPUManager policy option, so put down the test case taking a cue from other policy options KEP.

ffromani · 2022-06-17T08:40:36Z

keps/sig-node/3327-align-by-socket/README.md

+
+To verify the list of CPUs allocated to the container, one can either:
+- `exec` into uthe container and run `taskset -cp 1` (assuming this command is available in the container).
+- Call the `GetCPUS()` method of the `CPUProvider` interface in the `kubelet`'s [podresources API](https://pkg.go.dev/k8s.io/kubernetes/pkg/kubelet/apis/podresources#CPUsProvider).


I think it is sufficient to connect to the podresources API and send a List call; inspecting the input should provide all the needed informations.
I know of a few command line podresources clients, but not an recommended tool (no crictl equivalent to say).

arpitsardhana · 2022-06-17T15:50:11Z

/assign @derekwaynecarr @wojtek-t

klueska · 2022-06-21T09:10:10Z

Thanks for putting this enhancement together @arpitsardhana.

In general, I think allowing alignment at boundaries other than a single NUMA node is a useful feature to have. This change, in particular, is designed to increase the scope of alignment for CPUs to the socket boundary on architectures where there are more than one NUMA node per socket. In this way, we are able to treat all CPUs on a socket as perfectly aligned with other resources attached to any individual NUMA node on that socket.

I can see how this would be useful given that the ratio of requested CPUs to other requested resources by any given pod tends to be quite high (e.g. 30 CPUs for every 1 GPU). Since the number of CPUs per NUMA node decreases as the number of NUMA nodes per socket increases, it can be hard to achieve perfect alignment unless the scope for CPU alignment is increased in some way.

My only (non-blocking) concerns with the proposed approach are the following:

Is it too limiting to force alignment to the socket boundary? Are there cases where we would want a different boundary (e.g. if there are 4 NUMA nodes per socket, do we ever want to reduce the scope of alignment to just 2 NUMA nodes instead of all 4?). I don't think this concern should block this proposal for alpha, but we may want to consider a redesign of the flagname / API for beta.
Does it really make sense to apply this new alignment to just the CPUManager? It seems as if we could have introduced a new TopologyManager policy called single-socket (mimicking the existing single-numa-node policy) that maybe would have achieved the same thing. Again, I think it's fine to move forward with the proposed approach as an alpha implementation, but we should consider this alternative a bit more closely before moving to beta.

/lgtm
/approve

ffromani · 2022-06-21T12:06:18Z

/lgtm
I agree with the approach suggested by @klueska

dchen1107 · 2022-06-21T16:33:22Z

keps/sig-node/3327-align-by-socket/README.md

+
+| Risk                                             | Impact | Mitigation |
+| -------------------------------------------------| -------| ---------- |
+| Bugs in the implementation lead to kubelet crash | High   | Disable the policy option and restart the kubelet. The workload will run but CPU allocations can spread across socket in cases when allocation could have been within same socket |


Are there any feedback / status back to the users if the kubelet meet its CPU alignment request or not?

dchen1107 · 2022-06-21T18:26:21Z

keps/sig-node/3327-align-by-socket/README.md

+
+### Proposed Change
+
+When `align-by-socket` is enabled as a policy option,  the `CPUManager`’s function `GetTopologyHints` will prefer hints which are socket aligned in addition to hints which require minimum number of NUMA nodes.


The proposal only defines the behavior with the newly introduced option is in addition to the default CPUManager behavior.

It is unclear to me what behavior would be the user configs the follow cases:

align-by-socket=true & full-pcups-only=true

align-by-socket=true & distribute-cpus-across-numa=true

align-by-socket=true & full-pcups-only=true & distribute-cpus-across-numa=true

Valid vs. invalid? Additive?

I didn't thought extensively to all the cases, but I expect full-pcpus-only to be fully compatible with this new option, like it is with distribute-cpus-across-numa.

I agree with @fromanirh , full-pcpu-only & distribute-cpus-across-numa can work in conjunction with align-by-socket.
To put in simpler words, align-by-socket influences how NUMA nodes are assigned(try to ensure that selected NUMA nodes are aligned by socket) while other two policy options influence how cpu's are distributed across NUMA nodes once assignment(hint) is finalized

I plan to raise another PR to address all comments

Agreed with @fromanirh above that align-by-socket=true & full-pcups-only=true can treat as an additive. But we need to iron out more details on align-by-socket=true & distribute-cpus-across-numa=true. Also agreed with @arpitsardhana above on the simple interpret. Please ping me if you have follow PR to address the comments. Thanks!

dchen1107 · 2022-06-21T18:33:36Z

Agreed with @klueska above at: #3334 (comment)

Added some comments as the non blocker for alpha too.

/lgtm
/approve

arpitsardhana · 2022-06-21T19:42:08Z

Thanks a lot @dchen1107 @fromanirh @klueska for review and approval.
I will work on comments and update the KEP.

johnbelamaric · 2022-06-21T22:14:47Z

/approve

k8s-ci-robot · 2022-06-21T22:15:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: arpitsardhana, dchen1107, johnbelamaric, klueska

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [johnbelamaric]
~~keps/sig-node/OWNERS~~ [dchen1107,johnbelamaric]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cathchu · 2022-07-27T16:49:00Z

Hi @klueska

1.25 Release Docs Shadow here. Does this enhancement work planned for 1.25 require any new docs or modification to existing docs?
If so, please follows the steps here to open a PR against dev-1.25 branch in the k/website repo. This PR can be just a placeholder at this time and must be created before August 4.
Also, take a look at Documenting for a release to get yourself familiarize with the docs requirement for the release.
Thank you!

arpitsardhana · 2022-07-30T07:33:49Z

Thanks @cathchu Will look over documentation requirement

arpitsardhana · 2022-07-30T08:14:33Z

@cathchu added Placeholder PR for website update: kubernetes/website#35569

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Jun 2, 2022

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 2, 2022

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 2, 2022

k8s-ci-robot requested review from dchen1107 and derekwaynecarr June 2, 2022 16:45

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 2, 2022

arpitsardhana mentioned this pull request Jun 2, 2022

Add CPUManager policy option to align CPUs by Socket instead of by NUMA node #3327

Open

4 tasks

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 2, 2022

Initial commit of KEP-3327

32b4019

arpitsardhana force-pushed the master branch from 7c3d953 to 32b4019 Compare June 2, 2022 17:49

swatisehgal reviewed Jun 3, 2022

View reviewed changes

klueska reviewed Jun 7, 2022

View reviewed changes

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 8, 2022

Addressing review comments

baa7ccd

dchen1107 added this to the v1.25 milestone Jun 9, 2022

Minor edit

3cef13f

arpitsardhana force-pushed the master branch from 9c6575e to 7133fed Compare June 15, 2022 06:05

k8s-ci-robot assigned johnbelamaric Jun 15, 2022

ffromani reviewed Jun 17, 2022

View reviewed changes

k8s-ci-robot assigned derekwaynecarr and wojtek-t Jun 17, 2022

pacoxu added this to Needs Reviewer in SIG Node PR Triage Jun 20, 2022

k8s-ci-robot assigned klueska Jun 21, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 21, 2022

pacoxu moved this from Needs Reviewer to Needs Approver in SIG Node PR Triage Jun 21, 2022

k8s-ci-robot assigned ffromani Jun 21, 2022

dchen1107 reviewed Jun 21, 2022

View reviewed changes

k8s-ci-robot assigned dchen1107 Jun 21, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 21, 2022

k8s-ci-robot merged commit fb10b2b into kubernetes:master Jun 21, 2022

SIG Node PR Triage automation moved this from Needs Approver to Done Jun 21, 2022

klueska mentioned this pull request Feb 9, 2024

KEP-4176: A Static Policy Option to spread hyperthreads across physical CPUs #4177

Merged

klueska mentioned this pull request May 6, 2024

kubelet: Add a TopologyManager policy option: max-allowable-numa-nodes kubernetes/kubernetes#124148

Merged


		Inspect the kubelet configuration of a node -- check for the presence of the feature gate and usage of the new policy option.

		###### How can someone using this feature know that it is working for their instance?


		### Proposed Change

		When `align-by-socket` is enabled as a policy option, the `CPUManager`’s function `GetTopologyHints` will prefer hints which are socket aligned in addition to hints which require minimum number of NUMA nodes.

KEP-3327: Add CPUManager policy option to align CPUs by Socket instead of by NUMA node #3334

KEP-3327: Add CPUManager policy option to align CPUs by Socket instead of by NUMA node #3334

Conversation

arpitsardhana commented Jun 2, 2022 • edited Loading

linux-foundation-easycla bot commented Jun 2, 2022 • edited Loading

k8s-ci-robot commented Jun 2, 2022

k8s-ci-robot commented Jun 2, 2022

klueska commented Jun 2, 2022

swatisehgal left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arpitsardhana Jun 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klueska left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arpitsardhana commented Jun 9, 2022

linux-foundation-easycla bot commented Jun 13, 2022 • edited Loading

klueska commented Jun 15, 2022

arpitsardhana commented Jun 15, 2022

ffromani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arpitsardhana commented Jun 17, 2022

klueska commented Jun 21, 2022 • edited Loading

ffromani commented Jun 21, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dchen1107 commented Jun 21, 2022

arpitsardhana commented Jun 21, 2022

johnbelamaric commented Jun 21, 2022

k8s-ci-robot commented Jun 21, 2022

cathchu commented Jul 27, 2022 • edited Loading

arpitsardhana commented Jul 30, 2022

arpitsardhana commented Jul 30, 2022

arpitsardhana commented Jun 2, 2022 •

edited

Loading

linux-foundation-easycla bot commented Jun 2, 2022 •

edited

Loading

swatisehgal left a comment •

edited

Loading

arpitsardhana Jun 4, 2022 •

edited

Loading

klueska left a comment •

edited

Loading

linux-foundation-easycla bot commented Jun 13, 2022 •

edited

Loading

klueska commented Jun 21, 2022 •

edited

Loading

cathchu commented Jul 27, 2022 •

edited

Loading