New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

usage of AVX instructions by a container can affect other on the same host #67355

Open
pdeva opened this Issue Aug 13, 2018 · 7 comments

Comments

Projects
None yet
4 participants
@pdeva

pdeva commented Aug 13, 2018

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:
Usage of AVX instructions could kill performance of other containers, since those affect the entire processor for intel processors.

Relevant tweets:
https://twitter.com/kellabyte/status/1028883520293203968
https://twitter.com/var_tec/status/1029139635706818560

What you expected to happen:
Other containers shouldnt be affected

How to reproduce it (as minimally and precisely as possible):
Run a program using AVX instructions in a container

Anything else we need to know?:
This is also a potential security issue

Environment:

  • Kubernetes version (use kubectl version): 1.11
  • Cloud provider or hardware configuration:
    on aws C5 Instances have custom xeon processors that support AVX
    https://twitter.com/rbranson/status/1029129721320075264
  • OS (e.g. from /etc/os-release): ubuntu
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@pdeva

This comment has been minimized.

pdeva commented Aug 13, 2018

@kubernetes/sig-node-bugs

@k8s-ci-robot k8s-ci-robot added sig/node and removed needs-sig labels Aug 13, 2018

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Aug 13, 2018

@pdeva: Reiterating the mentions to trigger a notification:
@kubernetes/sig-node-bugs

In response to this:

@kubernetes/sig-node-bugs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ipuustin

This comment has been minimized.

Contributor

ipuustin commented Aug 14, 2018

This has been discussed a lot in Resource Management WG. There was a proposal (actually two proposals) to implement CPU pooling to fix this. See documents

https://docs.google.com/document/d/1ZKBw6R3nM_1oTqZftCUiqjQ05LrMdW_6oe4OjBjZSRk/edit#heading=h.1ajrn5it5138

and

https://docs.google.com/document/d/1EZQJdV9OObt8rA2epZDDLDVtoIAUQrqEZHy4oku0MFk/edit#heading=h.odw27dov1mwy

The documents have been shared to at least kubernetes-dev, but please request access if you can't open them. The idea (briefly) is to split the CPUs in the system into pools. The containers are pinned to certain CPU pools, meaning that they can run only on those cores. The pools could be configured so that the AVX workloads could run on dedicated cores and the other workloads on other cores. A PoC exists, but it has met some resistance in Resource Management WG.

The thinking in the WG is that the easiest solution is to use taints and tolerations for separating AVX and non-AVX workloads on separate nodes. If a CPU pooling scheme is adopted, it should rather be automated (or based on SLOs) than user-configured.

@pdeva

This comment has been minimized.

pdeva commented Aug 15, 2018

@ipuustin i requested access on both the linked docs but still havent received it

@ipuustin

This comment has been minimized.

Contributor

ipuustin commented Aug 16, 2018

I edited the document sharing settings a bit, might work now.

@vishh

This comment has been minimized.

Member

vishh commented Aug 23, 2018

This issue exists in any application environment that uses Intel's recent CPUs. What is the general solution offered by Intel for shared environments? Would it be possible to dynamically disable AVX instruction on nodes where it is not needed?
Kubernetes offers more flexibility for hiding this issue, but I'd like to first identify simple solutions that I'm hoping Intel has already identified.

@ipuustin

This comment has been minimized.

Contributor

ipuustin commented Oct 12, 2018

@pdeva: There's now a KEP which aims to mitigate the AVX issue (kubernetes/community#2739) Please take a look and tell if this might help you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment