-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Add circuit breakers for temporary resources to avoid bringing down the system #10308
Comments
|
This pretty much took down my cluster. We had ephemeralreports piling up in etcd (60k+, several gigs), overloading etcd and thus sprinkling failures all over the cluster. I had to delete thousands of them manually. |
|
This is related to the issue where ClusterPolicies with lots of generate rules create many multiples more UpdateRequest resources than would be expected: That issue must be fixed first, as otherwise, anyone with lots of generate rules will hit any reasonable limits almost immediately. |
|
@black-snow - are you seeing any errors or issues in the reports-controller? For ephemeralreports, this situation occurs when the admission-controller or background-controller produce ephemeralreports faster than they can be consumed by the reports-controller, or the reports-controller stops processing for any reason. v1.12.3 should help prevent this, and we are working on an improvement to add circuit breakers to the producers. However, we still need to characterize why the reports-controller stops processing ephemeralreports. Hence, any logs or additional data will help. |
|
Sadly, I didn't see anything suspicious on that end. And by now, magically, everything's fine again. I see ephemeral reports popping up but usually they disappear again after about 20 seconds. |
|
@realshuting are changes being made beside the circuit breakers/cleanup jobs to improve the handling of ephemeralreports? In our case, we've had several issues with them in 1.12.2, and some with 1.12.3 1.12.2In 1.12.2, policies targeting high rate of admission requests resources would generate a ton of ephemeralreports. In our case, we were using the following 2 policies for ArgoCD applications, which would trigger at a high rate: (we can reduce that rate, just want to provide this as example of high rate admission requests leading to the problem) Initially everything would work fine, however after a couple of hours, eventually This is despite kyverno's pods not seeing much resource usage: From that point on, Important: One thing to note, is that even after we've removed the 2 ArgoCD Logs were normal during all of this, except for admission controllers reporting a high amount of With 1.12.2, we reproduced the same problem detailed above in 3 different clusters, all leading to eventual degradation of cluster and infinite accumulation of 1.12.3In 1.12.3, we haven't been able to reproduce the problem yet. That being said, clusters that were already overloaded with COUNT=$(kubectl get ephemeralreports.reports.kyverno.io -A | wc -l)
if [ "$COUNT" -gt 10000 ]; then
echo "too many ephemeralreports found ($COUNT), cleaning up..."
kubectl delete ephemeralreports.reports.kyverno.io -A --allAs this will simply timeout, and the kubectl get ephemeralreport.reports.kyverno.io -n argocd -o=name | \
parallel -j 20 'kubectl delete --ignore-not-found=true {}'ConclusionGiven the Would love to hear if anything more was done or is planned. |
|
@jemag - can you please share the complete logs for the reports controller? To answer your question, yes circuit breakers are an improvement on the cleanup jobs but not really addressing the root cause of the issues. Something @eddycharly recently found, and is in the process of fixing, is that the reports-controller can stop processing requests when the lease is disrupted, or when the api-server is non-responsive due to high loads, and does not gracefully recover. Hence, the logs from your reports-controller will help. |
|
Thank you for the added information @JimBugwadia I can also add the logs of one of the admission-controller, in case there is interest in the |
|
Thanks @jemag! Yes, this is the issue that @eddycharly is chasing down. Here are the relevant lines in the log: The controller stops its worker pool, but does not exit. This creates the issue where ephemeralreports are no longer processed. We will plan a patch with these fixes. |
|
@jemag - there is a 1.12.4-rc1 release available with a fix that addresses the issue you are seeing. Is it possible for you to test this in your environment? |
|
I'm seeing a similar issue in that ephemeral reports are being created at massive numbers rapidly. Hit I've tried the 1.12.4-rc1 and it isn't resolving it. I'm still seeing the numbers increase quite rapidly and reports controller is crashlooping. Logs from Reports Controllerkyverno-reports-controller-5578558d9b-5wg8k controller {"level":"info","ts":"2024-06-12T10:59:28Z","logger":"setup.engine","caller":"internal/engine.go:45","msg":"setup engine..."}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"Level(-2)","ts":"2024-06-12T10:59:28Z","logger":"klog","caller":"cache/reflector.go:351","msg":"Caches populated for *v2alpha1.GlobalContextEntry from k8s.io/client-go@v0.29.2/tools/cache/reflector.go:229"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"info","ts":"2024-06-12T10:59:28Z","logger":"setup","caller":"internal/controller.go:32","msg":"starting controller","name":"kyverno-events","workers":3}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"info","ts":"2024-06-12T10:59:28Z","logger":"EventGenerator","caller":"event/controller.go:101","msg":"start"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"info","ts":"2024-06-12T10:59:28Z","logger":"klog","caller":"leaderelection/leaderelection.go:250","msg":"attempting to acquire leader lease kyverno/kyverno-reports-controller..."}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"info","ts":"2024-06-12T10:59:28Z","logger":"setup","caller":"internal/controller.go:32","msg":"starting controller","name":"global-context","workers":1}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"debug","ts":"2024-06-12T10:59:28Z","logger":"global-context","caller":"controller/run.go:58","msg":"starting ..."}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"debug","ts":"2024-06-12T10:59:28Z","logger":"global-context.worker","caller":"controller/run.go:71","msg":"starting worker","id":0}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"info","ts":"2024-06-12T10:59:28Z","logger":"klog","caller":"leaderelection/leaderelection.go:260","msg":"successfully acquired lease kyverno/kyverno-reports-controller"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"info","ts":"2024-06-12T10:59:28Z","logger":"setup.leader-election","caller":"leaderelection/leaderelection.go:97","msg":"still leading","id":"kyverno-reports-controller-5578558d9b-5wg8k"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"info","ts":"2024-06-12T10:59:28Z","logger":"setup.leader-election","caller":"leaderelection/leaderelection.go:83","msg":"started leading","id":"kyverno-reports-controller-5578558d9b-5wg8k"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"Level(-2)","ts":"2024-06-12T10:59:28Z","logger":"klog","caller":"cache/reflector.go:351","msg":"Caches populated for *v2beta1.PolicyException from k8s.io/client-go@v0.29.2/tools/cache/reflector.go:229"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"Level(-2)","ts":"2024-06-12T10:59:28Z","logger":"klog","caller":"cache/reflector.go:351","msg":"Caches populated for *v1.Policy from k8s.io/client-go@v0.29.2/tools/cache/reflector.go:229"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"Level(-2)","ts":"2024-06-12T10:59:28Z","logger":"klog","caller":"cache/reflector.go:351","msg":"Caches populated for *v1.ClusterPolicy from k8s.io/client-go@v0.29.2/tools/cache/reflector.go:229"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"Level(-2)","ts":"2024-06-12T10:59:28Z","logger":"klog","caller":"cache/reflector.go:351","msg":"Caches populated for *v1.Namespace from k8s.io/client-go@v0.29.2/tools/cache/reflector.go:229"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"Level(-2)","ts":"2024-06-12T10:59:28Z","logger":"klog","caller":"cache/reflector.go:351","msg":"Caches populated for *v1.PartialObjectMetadata from k8s.io/client-go@v0.29.2/tools/cache/reflector.go:229"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"Level(-2)","ts":"2024-06-12T10:59:28Z","logger":"klog","caller":"cache/reflector.go:351","msg":"Caches populated for *v1.PartialObjectMetadata from k8s.io/client-go@v0.29.2/tools/cache/reflector.go:229"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"Level(-2)","ts":"2024-06-12T10:59:28Z","logger":"klog","caller":"cache/reflector.go:351","msg":"Caches populated for *v1.PartialObjectMetadata from k8s.io/client-go@v0.29.2/tools/cache/reflector.go:229"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"info","ts":"2024-06-12T10:59:41Z","logger":"klog","caller":"trace/trace.go:236","msg":"Trace[1689897412]: \"Reflector ListAndWatch\" name:k8s.io/client-go@v0.29.2/tools/cache/reflector.go:229 (12-Jun-2024 10:59:28.534) (total time: 13088ms):\nTrace[1689897412]: ---\"Objects listed\" error:<nil> 12598ms (10:59:41.132)\nTrace[1689897412]: [13.08835733s] [13.08835733s] END"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"Level(-2)","ts":"2024-06-12T10:59:41Z","logger":"klog","caller":"cache/reflector.go:351","msg":"Caches populated for *v1.PartialObjectMetadata from k8s.io/client-go@v0.29.2/tools/cache/reflector.go:229"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"debug","ts":"2024-06-12T10:59:42Z","logger":"resource-report-controller","caller":"resource/controller.go:293","msg":"kind is not supported","gvk":"events.k8s.io/v1, Kind=Event"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"debug","ts":"2024-06-12T10:59:42Z","logger":"resource-report-controller","caller":"resource/controller.go:186","msg":"start watcher ...","gvr":"/v1, Resource=persistentvolumes","gvk":"/v1, Kind=PersistentVolume","resourceVersion":"1965490989"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"debug","ts":"2024-06-12T10:59:42Z","logger":"resource-report-controller","caller":"resource/controller.go:188","msg":"creating watcher...","gvr":"/v1, Resource=persistentvolumes","gvk":"/v1, Kind=PersistentVolume","resourceVersion":"1965490989"}Testing the use of the new reports controller in a different cluster at least appears to make it so etcd isn't filling up anymore. |
|
@JimBugwadia will do @knechtionscoding as per my comment above, the current cleanup strategy likely won't work for such large amount of ephemeralreports. You will probably need to clean these up manually before things can get back to normal. |
|
For those who encountered the ephemeralreports pile-up issue, check whether your application updates its deployment frequently. You can set a watch on (cluster)ephemeralreport and grep the labels for the target resource: If you see multiple ephemeralreports created for the same uid (label To temporarily "fix" the issue, you can turn admission reports off. The policy reports will still be generated through the background scan process and refreshed upon the next reconciliation (1h by default). There are several ways to turn admission reports off:
https://github.com/kyverno/kyverno/blob/main/charts/kyverno/values.yaml#L617
|
|
@JimBugwadia haven't seen the error happen again in the RC so far. Can we expect an official 1.12.4 version coming out soon? |
|
Hi @jemag yes we plan to release the GA version soon, unless any other issues come up. |
Hi @FernandoMiguel - have you checked my comment above? Do you have reports enabled for admission events? |
|
@realshuting we have the default upstream config. |
|
@FernandoMiguel - does the total number of ephemeralreports grow? From the last 30 min graph, I can see there were more UPDATE events on the Deployment, but I can't tell the count of ephemeralreports from reading the graph. If the total counts grow, can you tune the report controller's flag |
|
@FernandoMiguel - do you have a rough number of cluster sizes? i.e., the number of nodes, pods (pod controllers), and policies. |
|
@realshuting these metrics is from a test cluster, with not much traffic or node change. |
@realshuting is this it? https://github.com/kyverno/kyverno/blob/e64df59df/charts/kyverno/values.yaml#L632-L636 |
It's a container flag, see here:
You can configure via extraArgs. |
Increasing the number of workers helped. The spikes in the graph below are However, Kubernetes API resource usage is still significantly elevated. (Edit: Fixed by Reconcile Optimization in Argo CD.) |
|
Thanks for confirming @ustuehler l !
We have introduced the reports server in 1.12 as an alternative storage, you can check the blog post for details: https://kyverno.io/blog/2024/05/29/kyverno-reports-server-the-ultimate-solution-to-scale-reporting/ There's a CNCF livestream this Wednesday (Jun 19, 9:00 – 10:00 AM PT) for this topic: |
|
Updates - If you are still seeing constant creation of ephemeralreports after upgrading to 1.12.4, you can:
|
|
#10595 is scheduled for 1.12.6. 2 other tasks are completed, closing. |











Kyverno Version
1.12.0
Description
For temporary resources such as updaterequests and ephemeralreports that are used to apply policies, they could potentially overload the etcd/API-server if the background or the report controller is compromised.
To deal with unexpected situations gracefully without bringing down the cluster, it would be good to add circuit breakers to stop creating these resources if the total object count exceeds the threshold. We can have a period job to query the count for both resources and write to a configmap, so that the current object count can be shared across different instances/(admission, background, report) controllers.
Reported issues:
Tasks list:
Slack discussion
No response
Troubleshooting
The text was updated successfully, but these errors were encountered: