New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OomKilled when starting kyverno on existing clusters #1540
Comments
@JimBugwadia Could this error results in oom, as the client cannot connect to the server?
We had a few issues reported for the metric server, we fixed those by explicitly ignoring these errors, not sure why we are still seeing this. |
Yes, that is suspicious. The prior issues were calls were causing the init container to exit: @megakid - can you try removing the invalid api service registration entry as described here: |
Hi - thanks for the prompt responses. A quick update: the GOGC env var definitely helped things but it doesn't guarantee a successful start up. I will try your suggestion next week @JimBugwadia and report back. |
No dice.
OomKilled again. From
|
@megakid - thanks for checking. We will investigate this further and get back. |
No problem. We have this pod running seemingly fine in another cluster so it's definitely something environmental. |
@megakid Sorry for the late response, can you upgrade to v1.3.3 and try again? We have optimized policy report generation in this version. |
Thanks @realshuting - I upgraded to v1.3.3 and still got the same |
Thanks for your feedback! Can you share more info of:
Can you uninstall all policies and see if Kyverno can come up successfully? |
after deleting the
|
Interesting.. Kyverno wouldn't do anything if there's no policy installed. I'm curious to see profiling data, but in this case, I don't know if this would work as Pod never come up. Do you have default GOGC (set to 100) in uat cluster? |
I'm also experiencing OomKilled, also with very few policies (2). |
@snir911 - you can add GOGC to the env list of the Kyverno container: spec:
containers:
- env:
- name: GOGC
value: "25" If the pod comes up successfully, can you share heap and goroutine dumps?
I found a similar issue in fluxcd, where the user has Dex installed along with a massive amount of other resources. Do you have Dex installed by any chance? |
I just thought I'd mention the Helm chart does not appear to support passing additional environment variables into the containers. I mention it because I'm trying to troubleshoot a similar, or the same issue, on a 70 node cluster running ~800 pods where I see plenty of memory available. Also, I don't have any policies installed and I've disabled those that started shipping with 1.3.3 since I've set podSecurityStandard: disabled. |
@windowsrefund - thanks for adding your case.
Can you please open a feature request to support this? For now, have you tried to add the env manually to Kyverno Pod? |
Just a follow up to mention 1.3.4 is still OOMing even after I manually injected the GOGC=25 environment variable. That said, I never saw that as a real solution but wanted to see the behavior. Just incase the question comes up, I ingested the var into the deployment and then scaled it to 0, then back to 1. Once I saw the same OOMKilled status, I verified GOGC was set in the pod with
|
Actually, to my surprise, the pod has been running for about 2 hours after 17 restarts. I'll report back when I have a better understanding of whether or not GOGC has a hand in this success. |
It does appear that lowering GOGC from the default of 100 is needed on the particular cluster I'm working with. However, reducing this value still takes many (I've seen as high as 21) restarts before the kyverno pod is left in a Running status. Can anyone provide any insight as to what is happening until then? |
@windowsrefund - I'm trying to identify what exactly causes the issue, as mentioned above, do you have Dex installed by any chance? |
No |
@realshuting hi, does #1804 fix this OOMkilled issue? |
When I upgraded from 1.3.4 to 1.3.5, the kyverno preinstaller was restarting serval times running OOM, I deleted the helm deployment and redeployed it but still had the issues. Only after deleting all the CRD's related to kyverno and then reinstalling it worked, not sure if anyone encountered this issue. reportchangerequests.kyverno.io CRD had over 4800 items not sure what its tracking |
Hi @vakkur - could be a similar issue with #1759 (comment). Do you have CronJob running in your cluster? |
@realshuting thank you for your reply, I do not have any cron jobs running in my cluster. the report change requests are even higher in my test cluster, would this cause the issue when I am trying to upgrade?? |
It could take time for the init container to clean up all these resources. What you can do is delete the CRDs and reinstall Kyverno from a clean state. ✗ k get crd | grep report
clusterpolicyreports.wgpolicyk8s.io 2021-04-28T01:08:38Z
clusterreportchangerequests.kyverno.io 2021-04-28T01:08:38Z
policyreports.wgpolicyk8s.io 2021-04-28T01:08:39Z
reportchangerequests.kyverno.io 2021-04-28T01:08:39Z Is your test cluster running 1.3.4? |
Yes, my cluster is running on 1.3.4, Currently, I am running this script to delete change request for each in $(kubectl get Reportchangerequests -n kyverno -o jsonpath="{.items[*].metadata.name}" ) do |
That script should do as well. If you see this issue again with 1.3.5, can you please attach Kyverno logs and the RCRs details? If there are lots of RCRs, at least check the resource it reports on. |
@realshuting v1.3.5 onwards would it automatically delete the RCR's after a period of time? |
Yes, RCRs are cleaned up after they merged to policy reports. |
@snir911 - sorry I mixed this issue with #1731 (comment). I sent a PR #1878 to remove "Secret" from the default resource cache. In long term, we should find a way to custom informer cache to reduce memory usage. |
@snir911 - the PR was merged, here's the image tag with the fix |
@realshuting thanks, I'll update once I'll be able to test it |
Happy to report I've revisited this OOM issue and am now running 1.3.6 on my cluster. The pod fired up quickly and appears to be stable. I'll let it burn in for a bit but wanted to share the news. |
Great @windowsrefund, thanks for verifying! I'm closing it, please reopen if the same issue is observed again. |
Software version numbers
State the version numbers of applications involved in the bug.
Describe the bug
OomKilled on startup of main kyverno pod. We tried increasing the memory limit to 2GiB but no change. The only workaround for us was to introduce an environment variable
GOGC = 25
.To Reproduce
Steps to reproduce the behavior:
kubectl apply -k https://raw.githubusercontent.com/kyverno/kyverno/v1.3.2-rc1/definitions/release/install.yaml
Deployment
/Pod
being killed due toOomKilled
, it then enters aCrashLoopBackOff
as expected but never starts successfully. This is the main container, not thekyverno-pre
init container.We've tried this on 2 of our internal clusters, dev and uat, both have plenty of nodes with resources spare - dev runs 175 pods and uat runs 215 pods so pretty similar in size.
The only way we can get kyverno to start is to add a env var GOGC = "25" (default is "100") which causes the GC to be triggered more often inside the pod.
Expected behavior
Does not get OomKilled
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
Logs from pod
The text was updated successfully, but these errors were encountered: