Skip to content

Conversation

@ipuustin
Copy link

@ipuustin ipuustin commented Oct 2, 2018

This KEP addresses partitioning the CPUs present in the node to help mitigate the side effects caused by SIMD instructions (such as AVX instruction sets). The current version is meant to be a basis for discussion rather than a finalized proposal.

@ConnorDoyle
@kad
@klihub

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 2, 2018
@k8s-ci-robot
Copy link
Contributor

Hi @ipuustin. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 2, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: jdumars

If they are not already assigned, you can assign the PR to them by writing /assign @jdumars in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Oct 2, 2018
Copy link
Contributor

@vishh vishh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high level, I'd recommend rethinking design considerations.

  1. Can nodes be dedicated for AVX workloads?
  2. Can AVX be disabled in clusters where they don't expect them to be consumed? - how many users run both AVX workloads today?
  3. Can CPUs be dynamically re-assigned? - We can then group AVX apps into separate sockets dynamically (when possible)
  4. Can applications be moved to other nodes if they cannot be locally optimized due to noisy-neighbor problems? - What if we expose an application performance score that could inform the de-scheduler (or re-scheduler) to move the application based on PDB?


#### Variant 2: Workloads targeting AVX, run as Burstable or Guaranteed

Enable the existing CPU manager static policy. Use existing ability to advertise extended resources to allow a limited number of "AVX cores". Enforce that if a container consumes the "AVX cores" resource:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how will a user choose a limited number?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the mechanism for advertising extended resources for a node: https://kubernetes.io/docs/tasks/administer-cluster/extended-resource-node/ . The user can check the node CPU capacity, and then based on that number and some policy choose a suitable number of AVX cores. An example policy might say that "half of the cores should be AVX cores" or something similar.


Enable the existing CPU manager static policy. Use existing ability to advertise extended resources to allow a limited number of "AVX cores". Enforce that if a container consumes the "AVX cores" resource:
- AVX Cores = CPU
- QoS class is Guaranteed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What guarantees that a Gu pods consuming Avx will not disrupt another Avx pod not consuming Avx?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They still might, but at least the AVX and non-AVX pods wouldn't run on the same core. The current algorithm for choosing cores (takeByTopology() function) tries to first allocate full cores and sockets based on the request size.

@ipuustin
Copy link
Author

Thanks for the comments @vishh !

Regarding the points you made:

  1. This depends a lot how the cluster is used and how much spare nodes it has. We have heard from customers that they do not want to have nodes dedicated only to AVX workloads. Also, splitting the cluster based on the node features or kubelet configuration leads to combinatoric problems: if (let's say) the cluster is split based on whether nodes have a fast or a slow disk, a GPU or no GPU, and AVX or no AVX support, then that's already the cluster split into 2^3 parts. Still, this is definitely a viable alternative and we added it to the Alternatives section.
  2. There is a kernel boot parameter which removes AVX flags from /proc/cpuinfo so that the applications can choose not to run AVX instructions based on the flag. I think that doesn't disable AVX support from the processor though. Workloads can be of course written or recompiled not use AVX instructions.
  3. This is a good idea, but we don't at the moment have a good way to dynamically determine if a process uses AVX instructions or not. The key idea in this proposal is to find a way to indicate that a pod requires special handling, and that is via the QoS classes, so that pod spec can stay untouched. The actual mechanism to separate the containers to different CPUs is another question -- here we use kubelet options to assign QoS classes to CPU sets. However, later I can see a Dynamic CPU manager policy doing a good effort for this. We planned this proposal to leave enough design space for such a policy.
  4. Maybe yes -- we already can know if a processor has the AVX pipeline active or not, meaning that there is some process which run AVX instructions there. But it might be difficult to do the heuristics right (for example to prevent the pods from being ping-ponged around), and the complexity would likely be much larger than with this proposal. This would be a very interesting topic for further investigation though.

@justaugustus
Copy link
Member

/kind kep

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 21, 2018
@k8s-ci-robot k8s-ci-robot added sig/pm and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Oct 23, 2018
@justaugustus
Copy link
Member

REMINDER: KEPs are moving to k/enhancements on November 30. Please attempt to merge this KEP before then to signal consensus.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.

@justaugustus
Copy link
Member

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

@k8s-ci-robot
Copy link
Contributor

@justaugustus: Closed this PR.

In response to this:

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants