New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling ResourceQuotas increases controller-manager CPU usage ~4x #60988

Open
shyamjvs opened this Issue Mar 9, 2018 · 13 comments

Comments

Projects
None yet
6 participants
@shyamjvs
Member

shyamjvs commented Mar 9, 2018

Based on #60589 (comment)

After enabling quotas (specifically pods per namespace) by default in our tests, we're continuously seeing such failures:

Container kube-controller-manager-e2e-big-master/kube-controller-manager is using 0.531144723/0.5 CPU
not to have occurred

This may be a serious issue with scalability of quotas and needs some digging in.

/assign @gmarek
(who enabled it.. feel free to redirect this as apt)

cc @kubernetes/sig-scalability-bugs

@shyamjvs

This comment has been minimized.

Member

shyamjvs commented Mar 9, 2018

I'm not marking this as a release-blocker as it's failures we started seeing due to testing something new (that we weren't before).
However, feel free to override me if you think otherwise.

@shyamjvs

This comment has been minimized.

Member

shyamjvs commented Mar 9, 2018

For now I'll send a revert disabling it. However, you can easily experiment against 100-node cluster with the presubmit job I newly added. See #56032 (comment) for details.

@shyamjvs

This comment has been minimized.

Member

shyamjvs commented Mar 9, 2018

One thing to note here is that, even before this change we were seeing flakes of this kind. E.g:

But after this change, we're seeing those continuously. My feeling is that this worsened things on top of something bad that happened already earlier.

@shyamjvs shyamjvs changed the title from Performance test failing on 100-node cluster with ResourceQuotas enabled to Performance test failing on big cluster with ResourceQuotas enabled Mar 9, 2018

k8s-merge-robot added a commit that referenced this issue Mar 9, 2018

Merge pull request #60989 from shyamjvs/disable-quotas-in-scalability…
…-e2es

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Revert "Use quotas in default performance tests"

This reverts commit c3c1020.

Ref #60988

/cc @gmarek 
/kind bug
/sig scalability
/priority critical-urgent

```release-note
NONE
```
@tpepper

This comment has been minimized.

Contributor

tpepper commented Mar 9, 2018

@kubernetes/sig-scalability-bugs do you see this as a v1.10 milestone issue?

@shyamjvs

This comment has been minimized.

Member

shyamjvs commented Mar 9, 2018

Not really.. for the reason I mentioned in #60988 (comment)

@shyamjvs

This comment has been minimized.

Member

shyamjvs commented Mar 12, 2018

Gathered some interesting evidence today. Look at how the CPU usage of controller-manager increased ~4x across runs 454 and 455:

controller-manager-with-quotas

@shyamjvs

This comment has been minimized.

Member

shyamjvs commented Mar 12, 2018

From the commit diff, it seems most likely that enabling quotas (#60421) caused the mischief. Also, I looked at controller-manager logs from one of the newer runs and find many such lines:

E0310 00:16:53.089141       1 resource_quota_controller.go:255] Operation cannot be fulfilled on resourcequotas "e2e-tests-load-30-nodepods-1-vxb5m-quota": the object has been modified; please apply your changes to the latest version and try again
E0310 00:16:53.137643       1 resource_quota_controller.go:255] Operation cannot be fulfilled on resourcequotas "e2e-tests-load-30-nodepods-1-vxb5m-quota": the object has been modified; please apply your changes to the latest version and try again
E0310 00:16:53.181949       1 resource_quota_controller.go:255] Operation cannot be fulfilled on resourcequotas "e2e-tests-load-30-nodepods-1-vxb5m-quota": the object has been modified; please apply your changes to the latest version and try again
E0310 00:16:53.257855       1 resource_quota_controller.go:255] Operation cannot be fulfilled on resourcequotas "e2e-tests-load-30-nodepods-1-vxb5m-quota": the object has been modified; please apply your changes to the latest version and try again

@shyamjvs shyamjvs self-assigned this Mar 12, 2018

@shyamjvs

This comment has been minimized.

Member

shyamjvs commented Mar 12, 2018

Found one final piece of evidence to prove that it's indeed a problem caused by resource-quotas. Look at how our scalability job running against HEAD saw a fall back to normal CPU usage (across runs 11485 and 11486), with the only change being disabling of quotas:

controller-manager without-quotas

@shyamjvs shyamjvs changed the title from Performance test failing on big cluster with ResourceQuotas enabled to Enabling ResourceQuotas increases controller-manager CPU usage ~4x Mar 12, 2018

@shyamjvs

This comment has been minimized.

Member

shyamjvs commented Mar 12, 2018

cc @davidopp
(because I somehow remember you associated with the feature:)

@gmarek

This comment has been minimized.

Member

gmarek commented Mar 12, 2018

@deads2k

This comment has been minimized.

Contributor

deads2k commented Mar 12, 2018

cc @deads2k

@derekwaynecarr point of interest

@fejta-bot

This comment has been minimized.

fejta-bot commented Aug 23, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@shyamjvs

This comment has been minimized.

Member

shyamjvs commented Aug 23, 2018

/remove-lifecycle stale
/lifecycle frozen

@yliaog

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment