Enabling ResourceQuotas increases controller-manager CPU usage ~4x #60988

shyamjvs · 2018-03-09T17:12:14Z

After enabling quotas (specifically pods per namespace) by default in our tests, we're continuously seeing such failures:

Container kube-controller-manager-e2e-big-master/kube-controller-manager is using 0.531144723/0.5 CPU
not to have occurred

This may be a serious issue with scalability of quotas and needs some digging in.

/assign @gmarek
(who enabled it.. feel free to redirect this as apt)

cc @kubernetes/sig-scalability-bugs

The text was updated successfully, but these errors were encountered:

shyamjvs · 2018-03-09T17:14:08Z

I'm not marking this as a release-blocker as it's failures we started seeing due to testing something new (that we weren't before).
However, feel free to override me if you think otherwise.

shyamjvs · 2018-03-09T17:16:41Z

For now I'll send a revert disabling it. However, you can easily experiment against 100-node cluster with the presubmit job I newly added. See #56032 (comment) for details.

shyamjvs · 2018-03-09T17:30:34Z

One thing to note here is that, even before this change we were seeing flakes of this kind. E.g:

But after this change, we're seeing those continuously. My feeling is that this worsened things on top of something bad that happened already earlier.

@gmarek

…-e2es Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Revert "Use quotas in default performance tests" This reverts commit c3c1020. Ref #60988 /cc @gmarek /kind bug /sig scalability /priority critical-urgent ```release-note NONE ```

tpepper · 2018-03-09T20:52:47Z

@kubernetes/sig-scalability-bugs do you see this as a v1.10 milestone issue?

shyamjvs · 2018-03-09T20:59:08Z

Not really.. for the reason I mentioned in #60988 (comment)

shyamjvs · 2018-03-12T11:18:30Z

Gathered some interesting evidence today. Look at how the CPU usage of controller-manager increased ~4x across runs 454 and 455:

shyamjvs · 2018-03-12T11:23:17Z

From the commit diff, it seems most likely that enabling quotas (#60421) caused the mischief. Also, I looked at controller-manager logs from one of the newer runs and find many such lines:

E0310 00:16:53.089141       1 resource_quota_controller.go:255] Operation cannot be fulfilled on resourcequotas "e2e-tests-load-30-nodepods-1-vxb5m-quota": the object has been modified; please apply your changes to the latest version and try again
E0310 00:16:53.137643       1 resource_quota_controller.go:255] Operation cannot be fulfilled on resourcequotas "e2e-tests-load-30-nodepods-1-vxb5m-quota": the object has been modified; please apply your changes to the latest version and try again
E0310 00:16:53.181949       1 resource_quota_controller.go:255] Operation cannot be fulfilled on resourcequotas "e2e-tests-load-30-nodepods-1-vxb5m-quota": the object has been modified; please apply your changes to the latest version and try again
E0310 00:16:53.257855       1 resource_quota_controller.go:255] Operation cannot be fulfilled on resourcequotas "e2e-tests-load-30-nodepods-1-vxb5m-quota": the object has been modified; please apply your changes to the latest version and try again

shyamjvs · 2018-03-12T11:28:17Z

Found one final piece of evidence to prove that it's indeed a problem caused by resource-quotas. Look at how our scalability job running against HEAD saw a fall back to normal CPU usage (across runs 11485 and 11486), with the only change being disabling of quotas:

shyamjvs · 2018-03-12T11:31:59Z

cc @davidopp
(because I somehow remember you associated with the feature:)

gmarek · 2018-03-12T11:34:39Z

cc @deads2k

deads2k · 2018-03-12T13:14:49Z

cc @deads2k

@derekwaynecarr point of interest

fejta-bot · 2018-08-23T16:17:37Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

shyamjvs · 2018-08-23T16:20:17Z

/remove-lifecycle stale
/lifecycle frozen

@yliaog

k8s-ci-robot assigned gmarek Mar 9, 2018

k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. kind/bug Categorizes issue or PR as related to a bug. labels Mar 9, 2018

shyamjvs added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Mar 9, 2018

shyamjvs mentioned this issue Mar 9, 2018

Revert "Use quotas in default performance tests" #60989

Merged

shyamjvs changed the title ~~Performance test failing on 100-node cluster with ResourceQuotas enabled~~ Performance test failing on big cluster with ResourceQuotas enabled Mar 9, 2018

shyamjvs self-assigned this Mar 12, 2018

shyamjvs changed the title ~~Performance test failing on big cluster with ResourceQuotas enabled~~ Enabling ResourceQuotas increases controller-manager CPU usage ~4x Mar 12, 2018

shyamjvs removed their assignment May 25, 2018

xarses mentioned this issue Jun 15, 2018

apiserver timeouts after rolling-update of etcd cluster #47131

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 23, 2018

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling ResourceQuotas increases controller-manager CPU usage ~4x #60988

Enabling ResourceQuotas increases controller-manager CPU usage ~4x #60988

shyamjvs commented Mar 9, 2018

shyamjvs commented Mar 9, 2018

shyamjvs commented Mar 9, 2018

shyamjvs commented Mar 9, 2018 •

edited

Loading

tpepper commented Mar 9, 2018

shyamjvs commented Mar 9, 2018

shyamjvs commented Mar 12, 2018

shyamjvs commented Mar 12, 2018

shyamjvs commented Mar 12, 2018

shyamjvs commented Mar 12, 2018

gmarek commented Mar 12, 2018

deads2k commented Mar 12, 2018

fejta-bot commented Aug 23, 2018

shyamjvs commented Aug 23, 2018

Enabling ResourceQuotas increases controller-manager CPU usage ~4x #60988

Enabling ResourceQuotas increases controller-manager CPU usage ~4x #60988

Comments

shyamjvs commented Mar 9, 2018

shyamjvs commented Mar 9, 2018

shyamjvs commented Mar 9, 2018

shyamjvs commented Mar 9, 2018 • edited Loading

tpepper commented Mar 9, 2018

shyamjvs commented Mar 9, 2018

shyamjvs commented Mar 12, 2018

shyamjvs commented Mar 12, 2018

shyamjvs commented Mar 12, 2018

shyamjvs commented Mar 12, 2018

gmarek commented Mar 12, 2018

deads2k commented Mar 12, 2018

fejta-bot commented Aug 23, 2018

shyamjvs commented Aug 23, 2018

shyamjvs commented Mar 9, 2018 •

edited

Loading