-
Notifications
You must be signed in to change notification settings - Fork 38.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kube-controller-manager v1.26 high cpu usage #118706
Comments
/sig api-machinery |
/sig scheduling @alculquicondor I think sig-scheduling is handling the cronjob and job controllers, right? |
No :) I personally look at Job, though /remove-sig scheduling |
Cronjob v2 got to stable in 1.22, so if anything, it's something more recent. |
My wild guess would be that somehow kcm is reconciling all CronJobs even though they are not due for schedule. @zigmund any chance you can look at the logs and see whether a particular cronjob with a low frequency is being reconciled continuosly? |
@alculquicondor But I see multiple logs about same jobs:
Also there are many error logs:
But I also see all same logs in other healthy clusters. |
I started kcm locally for debug and that's what I found so far:
Now is
Now is |
Workaround - recreated these cronjobs and now kcm works fine. |
@zigmund thanks for the detailed debugging. Just for clarification Feel free to submit a PR if you find the fix. /triage accepted |
It happened again after another controller-manager restart. |
Did you have a chance to test in 1.27? |
@alculquicondor, unfortunately it's not possible to upgrade the cluster in near future. |
I see. In any case @soltysh probably has more context about recent changes in CronJob that could solve the issue and can potentially be cherry-picked. |
/assign |
This af1c9e4 seems to have fixed this problem. I'm looking at ways to fix it. I will give some conclusions later. |
We need to pay attention to the calculation method of This is the unit test case I use for testing. You can use it for verification if you want. |
- bug-case reproduced
There is another interesting thing here: kubernetes/pkg/controller/cronjob/utils.go Lines 140 to 142 in b46a3f8
kubernetes/pkg/controller/cronjob/cronjob_controllerv2.go Lines 656 to 665 in b46a3f8
|
don't set TZ in the string schedule (that will be forbidden in future releases), set it in the cronjob spec |
@sxllwx thx for your effort with reproducer I'll make sure to prioritize this next week on Monday to figure out possible fixes. |
Indeed #110838 fixed a lot of issues around the calculations of time schedules, especially around its performance. Additionally PRs #118724 and #118940 and currently awaiting one #121327 (where I've just added the test case from @sxllwx) improve both accuracy and performance. I'll check which ones we can safely backport to previous releases once the last one merges. |
/assign |
@soltysh nice, looking forward for patch release. |
All related PR's are merged, looks like we can close it now. |
What happened?
Upgraded cluster from v1.24.6 to v1.26.4 (via v1.25.9) and after that kube-controller-manager starts to eat all available cpu:
Also I see massive
workqueue_retries_total{name="cronjob"}
metrics rate increase - from 2-3 per second to 20-30k:Dumped pprof profile from kube-controller-manager and also see massive cronjob related load:
What did you expect to happen?
Same CPU usage of kube-controller-manager.
How can we reproduce it (as minimally and precisely as possible)?
IDK. We have 7 similar clusters of same version - issue is presents only in one of them.
Anything else we need to know?
~170 cronjobs
~300 job
Deleted some cronjobs and old failed jobs - didn't see any effect.
Kubernetes version
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: