KEP: Per QoS-Class CPU Affinity Hints #2739

ipuustin · 2018-10-02T14:26:32Z

This KEP addresses partitioning the CPUs present in the node to help mitigate the side effects caused by SIMD instructions (such as AVX instruction sets). The current version is meant to be a basis for discussion rather than a finalized proposal.

@ConnorDoyle
@kad
@klihub

k8s-ci-robot · 2018-10-02T14:26:40Z

Hi @ipuustin. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2018-10-02T14:26:44Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: jdumars

If they are not already assigned, you can assign the PR to them by writing /assign @jdumars in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

keps/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vishh

At a high level, I'd recommend rethinking design considerations.

Can nodes be dedicated for AVX workloads?
Can AVX be disabled in clusters where they don't expect them to be consumed? - how many users run both AVX workloads today?
Can CPUs be dynamically re-assigned? - We can then group AVX apps into separate sockets dynamically (when possible)
Can applications be moved to other nodes if they cannot be locally optimized due to noisy-neighbor problems? - What if we expose an application performance score that could inform the de-scheduler (or re-scheduler) to move the application based on PDB?

vishh · 2018-10-10T19:46:40Z

keps/sig-node/0030-qos-class-cpu-affinity.md

+
+#### Variant 2: Workloads targeting AVX, run as Burstable or Guaranteed
+
+Enable the existing CPU manager static policy. Use existing ability to advertise extended resources to allow a limited number of "AVX cores". Enforce that if a container consumes the "AVX cores" resource:


how will a user choose a limited number?

This is the mechanism for advertising extended resources for a node: https://kubernetes.io/docs/tasks/administer-cluster/extended-resource-node/ . The user can check the node CPU capacity, and then based on that number and some policy choose a suitable number of AVX cores. An example policy might say that "half of the cores should be AVX cores" or something similar.

vishh · 2018-10-10T19:47:31Z

keps/sig-node/0030-qos-class-cpu-affinity.md

+
+Enable the existing CPU manager static policy. Use existing ability to advertise extended resources to allow a limited number of "AVX cores". Enforce that if a container consumes the "AVX cores" resource:
+- AVX Cores = CPU
+- QoS class is Guaranteed


What guarantees that a Gu pods consuming Avx will not disrupt another Avx pod not consuming Avx?

They still might, but at least the AVX and non-AVX pods wouldn't run on the same core. The current algorithm for choosing cores (takeByTopology() function) tries to first allocate full cores and sockets based on the request size.

ipuustin · 2018-10-11T08:52:31Z

Thanks for the comments @vishh !

Regarding the points you made:

This depends a lot how the cluster is used and how much spare nodes it has. We have heard from customers that they do not want to have nodes dedicated only to AVX workloads. Also, splitting the cluster based on the node features or kubelet configuration leads to combinatoric problems: if (let's say) the cluster is split based on whether nodes have a fast or a slow disk, a GPU or no GPU, and AVX or no AVX support, then that's already the cluster split into 2^3 parts. Still, this is definitely a viable alternative and we added it to the Alternatives section.
There is a kernel boot parameter which removes AVX flags from /proc/cpuinfo so that the applications can choose not to run AVX instructions based on the flag. I think that doesn't disable AVX support from the processor though. Workloads can be of course written or recompiled not use AVX instructions.
This is a good idea, but we don't at the moment have a good way to dynamically determine if a process uses AVX instructions or not. The key idea in this proposal is to find a way to indicate that a pod requires special handling, and that is via the QoS classes, so that pod spec can stay untouched. The actual mechanism to separate the containers to different CPUs is another question -- here we use kubelet options to assign QoS classes to CPU sets. However, later I can see a Dynamic CPU manager policy doing a good effort for this. We planned this proposal to leave enough design space for such a policy.
Maybe yes -- we already can know if a processor has the AVX pipeline active or not, meaning that there is some process which run AVX instructions there. But it might be difficult to do the heuristics right (for example to prevent the pods from being ping-ponged around), and the complexity would likely be much larger than with this proposal. This would be a very interesting topic for further investigation though.

justaugustus · 2018-10-13T03:03:40Z

/kind kep

justaugustus · 2018-11-20T04:35:53Z

REMINDER: KEPs are moving to k/enhancements on November 30. Please attempt to merge this KEP before then to signal consensus.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.

justaugustus · 2018-12-01T08:03:46Z

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

k8s-ci-robot · 2018-12-01T08:03:47Z

@justaugustus: Closed this PR.

In response to this:

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 2, 2018

k8s-ci-robot requested review from calebamiles and derekwaynecarr October 2, 2018 14:26

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 2, 2018

k8s-ci-robot added sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Oct 2, 2018

vishh reviewed Oct 10, 2018

View reviewed changes

ipuustin mentioned this pull request Oct 12, 2018

usage of AVX instructions by a container can affect other on the same host kubernetes/kubernetes#67355

Closed

k8s-ci-robot added the kind/kep label Oct 13, 2018

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 21, 2018

ipuustin added 2 commits October 23, 2018 15:15

Increased NEXT_KEP_NUMBER.

e7dc90c

Add KEP 'Per-QoS-Class CPU Affinity Hints'

630acc4

ipuustin force-pushed the qos-partitioning branch from 9546d0a to 630acc4 Compare October 23, 2018 12:16

k8s-ci-robot added sig/pm and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Oct 23, 2018

k8s-ci-robot closed this Dec 1, 2018


		#### Variant 2: Workloads targeting AVX, run as Burstable or Guaranteed

		Enable the existing CPU manager static policy. Use existing ability to advertise extended resources to allow a limited number of "AVX cores". Enforce that if a container consumes the "AVX cores" resource:

KEP: Per QoS-Class CPU Affinity Hints #2739

KEP: Per QoS-Class CPU Affinity Hints #2739

Uh oh!

Conversation

ipuustin commented Oct 2, 2018

Uh oh!

k8s-ci-robot commented Oct 2, 2018

Uh oh!

k8s-ci-robot commented Oct 2, 2018

Uh oh!

vishh left a comment

Choose a reason for hiding this comment

Uh oh!

vishh Oct 10, 2018

Choose a reason for hiding this comment

Uh oh!

ipuustin Oct 11, 2018

Choose a reason for hiding this comment

Uh oh!

vishh Oct 10, 2018

Choose a reason for hiding this comment

Uh oh!

ipuustin Oct 11, 2018

Choose a reason for hiding this comment

Uh oh!

ipuustin commented Oct 11, 2018

Uh oh!

justaugustus commented Oct 13, 2018

Uh oh!

justaugustus commented Nov 20, 2018

Uh oh!

justaugustus commented Dec 1, 2018

Uh oh!

k8s-ci-robot commented Dec 1, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants