Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling ResourceQuotas increases controller-manager CPU usage ~4x #60988

Open
shyamjvs opened this issue Mar 9, 2018 · 13 comments
Open

Enabling ResourceQuotas increases controller-manager CPU usage ~4x #60988

shyamjvs opened this issue Mar 9, 2018 · 13 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@shyamjvs
Copy link
Member

shyamjvs commented Mar 9, 2018

Based on #60589 (comment)

After enabling quotas (specifically pods per namespace) by default in our tests, we're continuously seeing such failures:

Container kube-controller-manager-e2e-big-master/kube-controller-manager is using 0.531144723/0.5 CPU
not to have occurred

This may be a serious issue with scalability of quotas and needs some digging in.

/assign @gmarek
(who enabled it.. feel free to redirect this as apt)

cc @kubernetes/sig-scalability-bugs

@k8s-ci-robot k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. kind/bug Categorizes issue or PR as related to a bug. labels Mar 9, 2018
@shyamjvs shyamjvs added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Mar 9, 2018
@shyamjvs
Copy link
Member Author

shyamjvs commented Mar 9, 2018

I'm not marking this as a release-blocker as it's failures we started seeing due to testing something new (that we weren't before).
However, feel free to override me if you think otherwise.

@shyamjvs
Copy link
Member Author

shyamjvs commented Mar 9, 2018

For now I'll send a revert disabling it. However, you can easily experiment against 100-node cluster with the presubmit job I newly added. See #56032 (comment) for details.

@shyamjvs
Copy link
Member Author

shyamjvs commented Mar 9, 2018

One thing to note here is that, even before this change we were seeing flakes of this kind. E.g:

But after this change, we're seeing those continuously. My feeling is that this worsened things on top of something bad that happened already earlier.

@shyamjvs shyamjvs changed the title Performance test failing on 100-node cluster with ResourceQuotas enabled Performance test failing on big cluster with ResourceQuotas enabled Mar 9, 2018
k8s-github-robot pushed a commit that referenced this issue Mar 9, 2018
…-e2es

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Revert "Use quotas in default performance tests"

This reverts commit c3c1020.

Ref #60988

/cc @gmarek 
/kind bug
/sig scalability
/priority critical-urgent

```release-note
NONE
```
@tpepper
Copy link
Member

tpepper commented Mar 9, 2018

@kubernetes/sig-scalability-bugs do you see this as a v1.10 milestone issue?

@shyamjvs
Copy link
Member Author

shyamjvs commented Mar 9, 2018

Not really.. for the reason I mentioned in #60988 (comment)

@shyamjvs
Copy link
Member Author

Gathered some interesting evidence today. Look at how the CPU usage of controller-manager increased ~4x across runs 454 and 455:

controller-manager-with-quotas

@shyamjvs
Copy link
Member Author

From the commit diff, it seems most likely that enabling quotas (#60421) caused the mischief. Also, I looked at controller-manager logs from one of the newer runs and find many such lines:

E0310 00:16:53.089141       1 resource_quota_controller.go:255] Operation cannot be fulfilled on resourcequotas "e2e-tests-load-30-nodepods-1-vxb5m-quota": the object has been modified; please apply your changes to the latest version and try again
E0310 00:16:53.137643       1 resource_quota_controller.go:255] Operation cannot be fulfilled on resourcequotas "e2e-tests-load-30-nodepods-1-vxb5m-quota": the object has been modified; please apply your changes to the latest version and try again
E0310 00:16:53.181949       1 resource_quota_controller.go:255] Operation cannot be fulfilled on resourcequotas "e2e-tests-load-30-nodepods-1-vxb5m-quota": the object has been modified; please apply your changes to the latest version and try again
E0310 00:16:53.257855       1 resource_quota_controller.go:255] Operation cannot be fulfilled on resourcequotas "e2e-tests-load-30-nodepods-1-vxb5m-quota": the object has been modified; please apply your changes to the latest version and try again

@shyamjvs shyamjvs self-assigned this Mar 12, 2018
@shyamjvs
Copy link
Member Author

Found one final piece of evidence to prove that it's indeed a problem caused by resource-quotas. Look at how our scalability job running against HEAD saw a fall back to normal CPU usage (across runs 11485 and 11486), with the only change being disabling of quotas:

controller-manager without-quotas

@shyamjvs shyamjvs changed the title Performance test failing on big cluster with ResourceQuotas enabled Enabling ResourceQuotas increases controller-manager CPU usage ~4x Mar 12, 2018
@shyamjvs
Copy link
Member Author

cc @davidopp
(because I somehow remember you associated with the feature:)

@gmarek
Copy link
Contributor

gmarek commented Mar 12, 2018

cc @deads2k

@deads2k
Copy link
Contributor

deads2k commented Mar 12, 2018

cc @deads2k

@derekwaynecarr point of interest

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 23, 2018
@shyamjvs
Copy link
Member Author

/remove-lifecycle stale
/lifecycle frozen

@yliaog

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests

6 participants