-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-2621: Add llc affinity to cpu manager. #2684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,183 @@ | ||
| # KEP-2621: Add LLC Affinity to CPU manager | ||
|
|
||
| <!-- toc --> | ||
| - [Release Signoff Checklist](#release-signoff-checklist) | ||
| - [Summary](#summary) | ||
| - [Motivation](#motivation) | ||
| - [Goals](#goals) | ||
| - [Future Work](#future-work) | ||
| - [Proposal](#proposal) | ||
| - [User Stories (Optional)](#user-stories-optional) | ||
| - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) | ||
| - [Risks and Mitigations](#risks-and-mitigations) | ||
| - [Design Details](#design-details) | ||
| - [Test Plan](#test-plan) | ||
| - [Graduation Criteria](#graduation-criteria) | ||
| - [Alpha](#alpha) | ||
| - [Alpha to Beta Graduation](#alpha-to-beta-graduation) | ||
| - [Beta to G.A Graduation](#beta-to-ga-graduation) | ||
| - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) | ||
| - [Version Skew Strategy](#version-skew-strategy) | ||
| - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) | ||
| - [Feature Enablement and Rollback](#feature-enablement-and-rollback) | ||
| - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) | ||
| - [Monitoring Requirements](#monitoring-requirements) | ||
| - [Dependencies](#dependencies) | ||
| - [Scalability](#scalability) | ||
| - [Troubleshooting](#troubleshooting) | ||
| - [Implementation History](#implementation-history) | ||
| <!-- /toc --> | ||
|
|
||
| ## Release Signoff Checklist | ||
|
|
||
| Items marked with (R) are required *prior to targeting to a milestone / release*. | ||
|
|
||
| - [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) | ||
| - [ ] (R) KEP approvers have approved the KEP status as `implementable` | ||
| - [x] (R) Design details are appropriately documented | ||
| - [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input | ||
| - [x] (R) Graduation criteria is in place | ||
| - [ ] (R) Production readiness review completed | ||
| - [ ] (R) Production readiness review approved | ||
| - [ ] "Implementation History" section is up-to-date for milestone | ||
| - [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] | ||
| - [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes | ||
|
|
||
| [kubernetes.io]: https://kubernetes.io/ | ||
| [kubernetes/enhancements]: https://git.k8s.io/enhancements | ||
| [kubernetes/kubernetes]: https://git.k8s.io/kubernetes | ||
| [kubernetes/website]: https://git.k8s.io/website | ||
|
|
||
| ## Summary | ||
|
|
||
| Caches are not considered in current Kubernetes cpu-manager, in some architectures, each socket/package owns more than one L3 cache, containers may encounter performance degradation for L3 cache interference and lower hit rate. | ||
| We propose to support for L3 cache affinity during container cpu allocation. While in the same package/socket, try to use cpus sharing L3 cache for container demand but not just choose from all cpus in the package/socket. | ||
|
|
||
| ## Motivation | ||
|
|
||
| Kubernetes cpu-manager tries to allocate cpus in the same core, socket/package, gaining better performance. In traditional architecture, L3 cache is shared between the whole socket, current cpus allocator works well. | ||
| However, the allocation algorithm may encounter problem in processors like `2nd Gen AMD EPYC™`, each `ccx`(a term used by AMD to describe a cluster of physical cores along with the shared L3 cache) owns its L3 cache, more than one L3 cache exists in a socket/package, we call L3 caches like this as uncore-cache all this design). Depending on current cpu allocation may face uncore-cache interference. For example, 4 cores with HT in ccx, a container demand for 8 cpus may not get the whole ccx, but get some cpus in other ccx(see figure below), container A and B may affect each other while the other flush uncore-cache. In our opinion, container's cpu locality should be considered. | ||
|
|
||
|  | ||
|
|
||
| ### Goals | ||
|
|
||
| Support uncore-cache affinity in cpu allocation in architecture. | ||
|
|
||
| ### Future Work | ||
|
|
||
| Cross-die may also decrease process performance. We will add die affinity future, and corresponding cpu assignment algorithm implemetation. | ||
|
|
||
| ## Proposal | ||
|
|
||
| In order to make a decision to allocate cpu with uncore-cache affinity, we should be aware of the uncore-cache information in kubelet, current kubelet gets cpu topology with cadvisor, which does not support the related details. So, we add cache id and uncore-cache items to cadvisor(all merged). | ||
| - Add cache id to cadvisor | ||
| In cadvisor PR(https://github.com/google/cadvisor/pull/2847/), use /sys/devices/system/cpu/cpu*/cache/index3/id to get L3 cache id of current cpu, and store it as cpu topology. | ||
| ```go | ||
| type Cache struct { | ||
| + // Id of memory cache | ||
| + Id int `json:"id"` | ||
| // Size of memory cache in bytes. | ||
| Size uint64 `json:"size"` | ||
| // Type of memory cache: data, instruction, or unified. | ||
| Type string `json:"type"` | ||
| // Level (distance from cpus) in a multi-level cache hierarchy. | ||
| Level int `json:"level"` | ||
| } | ||
| ``` | ||
| - Add uncore cache to cadvisor | ||
| In cadvisor PR(https://github.com/google/cadvisor/pull/2849), add L3 cache not shared among the whole socket(uncore cache) to core info in cpu topology. And we can get core->uncore-cache mappings. | ||
| ```go | ||
| type Core struct { | ||
| Id int `json:"core_id"` | ||
| Threads []int `json:"thread_ids"` | ||
| Caches []Cache `json:"caches"` | ||
| + UncoreCaches []Cache `json:"uncore_caches"` | ||
| SocketID int `json:"socket_id"` | ||
| } | ||
| ``` | ||
|
|
||
| ### User Stories (Optional) | ||
|
|
||
| Before change, when kubelet allocates cpus for containers, uncore-cache is not considered, and may get cpus across caches even there're free cpus shared uncore-caches. | ||
| We make a bench with `stream2` DAXPY, as we can see, cross ccx(cross uncore-cache) gets lower bandwidth. | ||
|
|
||
|  | ||
|
|
||
| And, when workload is memory sensitive, this feature can improve memory bandwidth significantly(20% above). | ||
| We also conduct tests in real applications like dpdk and redis, and get performance data. While in dpdk, with throughput degration from 19Mpps(share uncore-cache) to 12Mpps(cross uncore-cache). | ||
| ### Notes/Constraints/Caveats (Optional) | ||
|
|
||
| ### Risks and Mitigations | ||
|
|
||
| + Currently no risks was found. | ||
| + Feature is enabled by a gate - a new kube feature with default false, potential risk effects could be limited. | ||
|
|
||
| ## Design Details | ||
|
|
||
| - Feature Gate | ||
| - C1: Add `CPUManagerUncoreCacheAlign` to kubelet's feature-gates to enable(true)/disable(false) the feature. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So this feature (per my previous comment, see history) will need to depend on a feature gate. But you don't necessarily need a separate feature gate, you can depend on the one we added on #2626. This CPUManager feature gate will most likely be alpha in the 1.23 cycle, which fit the needs of this KEP. I think you can conditionally enable this optimization depending on a cpu manager policy options. This way you keep the conditional logic you already need to support the feature gate, without extra burden. Fitting in the new CPUManager policy options is also a very nice and clean design.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hi, @fromanirh since i've post a patch kubernetes/kubernetes#102307, it maybe help you to understand why i choose a kubelet running option other than a separate cpu manager policy. IMHO, attach to a policy is also optional, but it will cause some redundant logic.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for sharing the implementation. I don't see yet where we need redundant logic, however. We could:
IMHO this is also a nicer and more integrated design. So reshaping to use the cpuManagerOptions still seems a totally valid option and I think is cleaner from overall design perspective. |
||
| - C2: More than one l3 cache should exist in a single socket/package(uncore-cache exists). | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not quite sure what "C1" and "C2" mean in this context. |
||
| - And, we make a conclusion here, | ||
| - C1=true, C2=true -> Enabled | ||
| - C1=true, C2=false -> Disabled | ||
| - C1=false, C2=true -> Disabled | ||
| - C1=false, C2=false -> Disabled | ||
| - General Design | ||
| - Logic Elaboration | ||
| Try to allocate cpus sharing the same cache if demand is larger than one core. Add L3 cache affinity before trying core affinity best-fit. | ||
| If we cannot find llc-satisfied cpus, continue the original process(find available cores). | ||
|
|
||
|  | ||
|
|
||
| - Feature-gates `CPUManagerUncoreCacheAlign` | ||
| `CPUManagerUncoreCacheAlign` should set `false` in `defaultKubernetesFeatureGates`. And make a judge in `takeByTopology`, `enable`->`(do l3 cache affinity best-fit)`,`disable`->`(skip)`. | ||
| ### Test Plan | ||
| - Unit tests for new added allocation algorithm. | ||
| - E2E tests should work on two scenarios: | ||
| - For AMD rome/milan or other architectures with more than one L3 cache in a socket, cpu allocation for a container should always try to get all demand cpus sharing one L3 cache. Check containers’ cpuset.cpus for verification. | ||
| - For other architectures, cpu allocation should be the same as before. | ||
|
|
||
| ### Graduation Criteria | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it seems this section is not completed? Could you please fill the details? |
||
| #### Alpha | ||
|
|
||
| - Implement the new policy. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. According to the implementation, strictly it is not a new policy but an enhancement against existing static policy right? |
||
| - Ensure proper e2e node tests are in place. | ||
| #### Alpha to Beta Graduation | ||
|
|
||
| - Gather feedback from the consumer of the policy. | ||
| - No major bugs reported in the previous cycle. | ||
| #### Beta to G.A Graduation | ||
|
|
||
| - Allowing time for feedback (1 year). | ||
| - Risks have been addressed. | ||
| ### Upgrade / Downgrade Strategy | ||
|
|
||
| We expect no impact. The new kube feature is opt-in. | ||
|
|
||
| ### Version Skew Strategy | ||
| No changes needed. | ||
|
|
||
| ## Production Readiness Review Questionnaire | ||
|
|
||
| ### Feature Enablement and Rollback | ||
|
|
||
| - Feature gate | ||
| We use a feature symbol `CPUManagerUncoreCacheAlign` with default `false` to kubelet's kube-feature. | ||
|
|
||
| ### Rollout, Upgrade and Rollback Planning | ||
|
|
||
| ### Monitoring Requirements | ||
|
|
||
| ### Dependencies | ||
|
|
||
| High version cadvisor is in need, in which cache id and uncore cache info are stored in cpu topology. | ||
|
|
||
| ### Scalability | ||
|
|
||
| ### Troubleshooting | ||
|
|
||
| ## Implementation History | ||
| - 2021.5.10: KEP updated, add CPUManagerUncoreCacheAlign as a kube feature. | ||
| - 2021.5.6: KEP created | ||
| - 2021.5.1: Original design doc with solutions considered: https://docs.google.com/document/d/1BuiBgsittUnU3heKHRCQ66YYxzAItT5gcPlu3N83PfA/edit# | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| title: Add llc affinity to cpu manager | ||
| kep-number: 2621 | ||
| authors: | ||
| - "@ranchothu" | ||
| owning-sig: sig-node | ||
| participating-sigs: | ||
| - sig-node | ||
| status: provisional | ||
| creation-date: 2021-05-06 | ||
| reviewers: | ||
| - "@derekwaynecarr" | ||
| - "@dchen1107" | ||
| approvers: | ||
| - "@derekwaynecarr" | ||
| - TBD | ||
| prr-approvers: | ||
| - TBD | ||
| labels: | ||
| - sig/node | ||
|
|
||
| # The target maturity stage in the current dev cycle for this KEP. | ||
| stage: alpha | ||
|
|
||
| # The most recent milestone for which work toward delivery of this KEP has been | ||
| # done. This can be the current (upcoming) milestone, if it is being actively | ||
| # worked on. | ||
| latest-milestone: "v1.22" | ||
|
|
||
| # The milestone at which this feature was, or is targeted to be, at each stage. | ||
| milestone: | ||
| alpha: "v1.22" | ||
| beta: "v1.23" | ||
| stable: "v1.24" | ||
|
|
||
| # The following PRR answers are required at beta release | ||
| metrics: | ||
| - N/A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this looks like a unnecessary/confusing kludge. Would it be possible to present this similarly to how the linux kernel does, i.e. add the information about cpus in the
Cachestructure (like/sys/devices/system/cpu/cpu*/cache/index*/shared_cpu_list)?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @marquiz, it is an interesting problem. Previously, i've made a consideration here. And then, i want desgign here to be decoupling. But seems no other structure is more fit to place the uncore-cache information.
I think:
uncore, that's not a cache's charater. Infos in /sys/devices/system/cpu/cpu*/cache/index*/ could not tell us if the cache is uncore without information about socket/core.Dicussion is welcomed, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ranchothu , in my understanding you'd like to differentiate between node level cache (typically on Intel chips) and uncore cache (typically on AMD chips), if so, why do we need to differentiate between them? they are both L3 cache and it seems that they can both be aligned when assign CPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I agree that we need additional layer of abstraction since number of grouped CPU per L3 cache is not a feature of only AMD's processor, for example Kunpen920 (armv8) had 4 cpu per L3 cache, and had 24 of such blocks.