Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
183 changes: 183 additions & 0 deletions keps/sig-node/2621-cpu-allocation-llc-affinity/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
# KEP-2621: Add LLC Affinity to CPU manager

<!-- toc -->
- [Release Signoff Checklist](#release-signoff-checklist)
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Future Work](#future-work)
- [Proposal](#proposal)
- [User Stories (Optional)](#user-stories-optional)
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [Test Plan](#test-plan)
- [Graduation Criteria](#graduation-criteria)
- [Alpha](#alpha)
- [Alpha to Beta Graduation](#alpha-to-beta-graduation)
- [Beta to G.A Graduation](#beta-to-ga-graduation)
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
- [Version Skew Strategy](#version-skew-strategy)
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
- [Monitoring Requirements](#monitoring-requirements)
- [Dependencies](#dependencies)
- [Scalability](#scalability)
- [Troubleshooting](#troubleshooting)
- [Implementation History](#implementation-history)
<!-- /toc -->

## Release Signoff Checklist

Items marked with (R) are required *prior to targeting to a milestone / release*.

- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
- [x] (R) Design details are appropriately documented
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- [x] (R) Graduation criteria is in place
- [ ] (R) Production readiness review completed
- [ ] (R) Production readiness review approved
- [ ] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

[kubernetes.io]: https://kubernetes.io/
[kubernetes/enhancements]: https://git.k8s.io/enhancements
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
[kubernetes/website]: https://git.k8s.io/website

## Summary

Caches are not considered in current Kubernetes cpu-manager, in some architectures, each socket/package owns more than one L3 cache, containers may encounter performance degradation for L3 cache interference and lower hit rate.
We propose to support for L3 cache affinity during container cpu allocation. While in the same package/socket, try to use cpus sharing L3 cache for container demand but not just choose from all cpus in the package/socket.

## Motivation

Kubernetes cpu-manager tries to allocate cpus in the same core, socket/package, gaining better performance. In traditional architecture, L3 cache is shared between the whole socket, current cpus allocator works well.
However, the allocation algorithm may encounter problem in processors like `2nd Gen AMD EPYC™`, each `ccx`(a term used by AMD to describe a cluster of physical cores along with the shared L3 cache) owns its L3 cache, more than one L3 cache exists in a socket/package, we call L3 caches like this as uncore-cache all this design). Depending on current cpu allocation may face uncore-cache interference. For example, 4 cores with HT in ccx, a container demand for 8 cpus may not get the whole ccx, but get some cpus in other ccx(see figure below), container A and B may affect each other while the other flush uncore-cache. In our opinion, container's cpu locality should be considered.

![allocation_motivation](allocation_motivation.png "allocation_motivation")

### Goals

Support uncore-cache affinity in cpu allocation in architecture.

### Future Work

Cross-die may also decrease process performance. We will add die affinity future, and corresponding cpu assignment algorithm implemetation.

## Proposal

In order to make a decision to allocate cpu with uncore-cache affinity, we should be aware of the uncore-cache information in kubelet, current kubelet gets cpu topology with cadvisor, which does not support the related details. So, we add cache id and uncore-cache items to cadvisor(all merged).
- Add cache id to cadvisor
In cadvisor PR(https://github.com/google/cadvisor/pull/2847/), use /sys/devices/system/cpu/cpu*/cache/index3/id to get L3 cache id of current cpu, and store it as cpu topology.
```go
type Cache struct {
+ // Id of memory cache
+ Id int `json:"id"`
// Size of memory cache in bytes.
Size uint64 `json:"size"`
// Type of memory cache: data, instruction, or unified.
Type string `json:"type"`
// Level (distance from cpus) in a multi-level cache hierarchy.
Level int `json:"level"`
}
```
- Add uncore cache to cadvisor
In cadvisor PR(https://github.com/google/cadvisor/pull/2849), add L3 cache not shared among the whole socket(uncore cache) to core info in cpu topology. And we can get core->uncore-cache mappings.
```go
type Core struct {
Id int `json:"core_id"`
Threads []int `json:"thread_ids"`
Caches []Cache `json:"caches"`
+ UncoreCaches []Cache `json:"uncore_caches"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this looks like a unnecessary/confusing kludge. Would it be possible to present this similarly to how the linux kernel does, i.e. add the information about cpus in the Cache structure (like /sys/devices/system/cpu/cpu*/cache/index*/shared_cpu_list)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @marquiz, it is an interesting problem. Previously, i've made a consideration here. And then, i want desgign here to be decoupling. But seems no other structure is more fit to place the uncore-cache information.
I think:

  1. For a cache, it shouldn't get knowleage of whether it is uncore, that's not a cache's charater. Infos in /sys/devices/system/cpu/cpu*/cache/index*/ could not tell us if the cache is uncore without information about socket/core.
  2. But, core is awareness of the cache and uncore-cache it uses, and for a core, it include {id,threads, caches, uncore caches}.
    Dicussion is welcomed, thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ranchothu , in my understanding you'd like to differentiate between node level cache (typically on Intel chips) and uncore cache (typically on AMD chips), if so, why do we need to differentiate between them? they are both L3 cache and it seems that they can both be aligned when assign CPU.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I agree that we need additional layer of abstraction since number of grouped CPU per L3 cache is not a feature of only AMD's processor, for example Kunpen920 (armv8) had 4 cpu per L3 cache, and had 24 of such blocks.

SocketID int `json:"socket_id"`
}
```

### User Stories (Optional)

Before change, when kubelet allocates cpus for containers, uncore-cache is not considered, and may get cpus across caches even there're free cpus shared uncore-caches.
We make a bench with `stream2` DAXPY, as we can see, cross ccx(cross uncore-cache) gets lower bandwidth.

![stream2_daxpy](stream2_daxpy.png "stream2_daxpy")

And, when workload is memory sensitive, this feature can improve memory bandwidth significantly(20% above).
We also conduct tests in real applications like dpdk and redis, and get performance data. While in dpdk, with throughput degration from 19Mpps(share uncore-cache) to 12Mpps(cross uncore-cache).
### Notes/Constraints/Caveats (Optional)

### Risks and Mitigations

+ Currently no risks was found.
+ Feature is enabled by a gate - a new kube feature with default false, potential risk effects could be limited.

## Design Details

- Feature Gate
- C1: Add `CPUManagerUncoreCacheAlign` to kubelet's feature-gates to enable(true)/disable(false) the feature.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this feature (per my previous comment, see history) will need to depend on a feature gate. But you don't necessarily need a separate feature gate, you can depend on the one we added on #2626.

This CPUManager feature gate will most likely be alpha in the 1.23 cycle, which fit the needs of this KEP.

I think you can conditionally enable this optimization depending on a cpu manager policy options. This way you keep the conditional logic you already need to support the feature gate, without extra burden.

Fitting in the new CPUManager policy options is also a very nice and clean design.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, @fromanirh since i've post a patch kubernetes/kubernetes#102307, it maybe help you to understand why i choose a kubelet running option other than a separate cpu manager policy. IMHO, attach to a policy is also optional, but it will cause some redundant logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing the implementation. I don't see yet where we need redundant logic, however. We could:

  1. get the enable/disable flag from cpumanager options. Probably the option should be on by default
  2. propagate the flag down to the cpuAccumulator
  3. consume the flag into isUncoreCacheAlignEnabled

IMHO this is also a nicer and more integrated design.
Now: implementation wise, this flow seems compliant with the production-readiness review (see the details in https://kubernetes.slack.com/archives/CPNHUMN74/p1620312071045800) because new features should depend on A compatible and related feature gate; a new feature gate would be alpha level, and the cpuManagerOptions feature gate will still be alpha in 1.23.

So reshaping to use the cpuManagerOptions still seems a totally valid option and I think is cleaner from overall design perspective.
Let's see what other reviewers (@klueska :) ) think, and please let me know if there are concerns or requirements that I missed.

- C2: More than one l3 cache should exist in a single socket/package(uncore-cache exists).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure what "C1" and "C2" mean in this context.

- And, we make a conclusion here,
- C1=true, C2=true -> Enabled
- C1=true, C2=false -> Disabled
- C1=false, C2=true -> Disabled
- C1=false, C2=false -> Disabled
- General Design
- Logic Elaboration
Try to allocate cpus sharing the same cache if demand is larger than one core. Add L3 cache affinity before trying core affinity best-fit.
If we cannot find llc-satisfied cpus, continue the original process(find available cores).

![design_overview](design_overview.png "design_overview")

- Feature-gates `CPUManagerUncoreCacheAlign`
`CPUManagerUncoreCacheAlign` should set `false` in `defaultKubernetesFeatureGates`. And make a judge in `takeByTopology`, `enable`->`(do l3 cache affinity best-fit)`,`disable`->`(skip)`.
### Test Plan
- Unit tests for new added allocation algorithm.
- E2E tests should work on two scenarios:
- For AMD rome/milan or other architectures with more than one L3 cache in a socket, cpu allocation for a container should always try to get all demand cpus sharing one L3 cache. Check containers’ cpuset.cpus for verification.
- For other architectures, cpu allocation should be the same as before.

### Graduation Criteria
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems this section is not completed? Could you please fill the details?

#### Alpha

- Implement the new policy.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the implementation, strictly it is not a new policy but an enhancement against existing static policy right?

- Ensure proper e2e node tests are in place.
#### Alpha to Beta Graduation

- Gather feedback from the consumer of the policy.
- No major bugs reported in the previous cycle.
#### Beta to G.A Graduation

- Allowing time for feedback (1 year).
- Risks have been addressed.
### Upgrade / Downgrade Strategy

We expect no impact. The new kube feature is opt-in.

### Version Skew Strategy
No changes needed.

## Production Readiness Review Questionnaire

### Feature Enablement and Rollback

- Feature gate
We use a feature symbol `CPUManagerUncoreCacheAlign` with default `false` to kubelet's kube-feature.

### Rollout, Upgrade and Rollback Planning

### Monitoring Requirements

### Dependencies

High version cadvisor is in need, in which cache id and uncore cache info are stored in cpu topology.

### Scalability

### Troubleshooting

## Implementation History
- 2021.5.10: KEP updated, add CPUManagerUncoreCacheAlign as a kube feature.
- 2021.5.6: KEP created
- 2021.5.1: Original design doc with solutions considered: https://docs.google.com/document/d/1BuiBgsittUnU3heKHRCQ66YYxzAItT5gcPlu3N83PfA/edit#
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
37 changes: 37 additions & 0 deletions keps/sig-node/2621-cpu-allocation-llc-affinity/kep.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
title: Add llc affinity to cpu manager
kep-number: 2621
authors:
- "@ranchothu"
owning-sig: sig-node
participating-sigs:
- sig-node
status: provisional
creation-date: 2021-05-06
reviewers:
- "@derekwaynecarr"
- "@dchen1107"
approvers:
- "@derekwaynecarr"
- TBD
prr-approvers:
- TBD
labels:
- sig/node

# The target maturity stage in the current dev cycle for this KEP.
stage: alpha

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.22"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v1.22"
beta: "v1.23"
stable: "v1.24"

# The following PRR answers are required at beta release
metrics:
- N/A
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.