Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

usage of AVX instructions by a container can affect other on the same host #67355

Closed
pdeva opened this issue Aug 13, 2018 · 11 comments
Closed

usage of AVX instructions by a container can affect other on the same host #67355

pdeva opened this issue Aug 13, 2018 · 11 comments

Comments

@pdeva
Copy link

@pdeva pdeva commented Aug 13, 2018

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:
Usage of AVX instructions could kill performance of other containers, since those affect the entire processor for intel processors.

Relevant tweets:
https://twitter.com/kellabyte/status/1028883520293203968
https://twitter.com/var_tec/status/1029139635706818560

What you expected to happen:
Other containers shouldnt be affected

How to reproduce it (as minimally and precisely as possible):
Run a program using AVX instructions in a container

Anything else we need to know?:
This is also a potential security issue

Environment:

  • Kubernetes version (use kubectl version): 1.11
  • Cloud provider or hardware configuration:
    on aws C5 Instances have custom xeon processors that support AVX
    https://twitter.com/rbranson/status/1029129721320075264
  • OS (e.g. from /etc/os-release): ubuntu
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@pdeva
Copy link
Author

@pdeva pdeva commented Aug 13, 2018

@kubernetes/sig-node-bugs

@k8s-ci-robot k8s-ci-robot added sig/node and removed needs-sig labels Aug 13, 2018
@k8s-ci-robot
Copy link
Contributor

@k8s-ci-robot k8s-ci-robot commented Aug 13, 2018

@pdeva: Reiterating the mentions to trigger a notification:
@kubernetes/sig-node-bugs

In response to this:

@kubernetes/sig-node-bugs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ipuustin
Copy link
Member

@ipuustin ipuustin commented Aug 14, 2018

This has been discussed a lot in Resource Management WG. There was a proposal (actually two proposals) to implement CPU pooling to fix this. See documents

https://docs.google.com/document/d/1ZKBw6R3nM_1oTqZftCUiqjQ05LrMdW_6oe4OjBjZSRk/edit#heading=h.1ajrn5it5138

and

https://docs.google.com/document/d/1EZQJdV9OObt8rA2epZDDLDVtoIAUQrqEZHy4oku0MFk/edit#heading=h.odw27dov1mwy

The documents have been shared to at least kubernetes-dev, but please request access if you can't open them. The idea (briefly) is to split the CPUs in the system into pools. The containers are pinned to certain CPU pools, meaning that they can run only on those cores. The pools could be configured so that the AVX workloads could run on dedicated cores and the other workloads on other cores. A PoC exists, but it has met some resistance in Resource Management WG.

The thinking in the WG is that the easiest solution is to use taints and tolerations for separating AVX and non-AVX workloads on separate nodes. If a CPU pooling scheme is adopted, it should rather be automated (or based on SLOs) than user-configured.

@pdeva
Copy link
Author

@pdeva pdeva commented Aug 15, 2018

@ipuustin i requested access on both the linked docs but still havent received it

@ipuustin
Copy link
Member

@ipuustin ipuustin commented Aug 16, 2018

I edited the document sharing settings a bit, might work now.

@vishh
Copy link
Member

@vishh vishh commented Aug 23, 2018

This issue exists in any application environment that uses Intel's recent CPUs. What is the general solution offered by Intel for shared environments? Would it be possible to dynamically disable AVX instruction on nodes where it is not needed?
Kubernetes offers more flexibility for hiding this issue, but I'd like to first identify simple solutions that I'm hoping Intel has already identified.

@ipuustin
Copy link
Member

@ipuustin ipuustin commented Oct 12, 2018

@pdeva: There's now a KEP which aims to mitigate the AVX issue (kubernetes/community#2739) Please take a look and tell if this might help you.

@fejta-bot
Copy link

@fejta-bot fejta-bot commented Jan 10, 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@fejta-bot
Copy link

@fejta-bot fejta-bot commented Feb 9, 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@fejta-bot
Copy link

@fejta-bot fejta-bot commented Mar 11, 2019

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@k8s-ci-robot k8s-ci-robot commented Mar 11, 2019

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants