Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Add circuit breakers for temporary resources to avoid bringing down the system #10308

Closed
5 tasks done
realshuting opened this issue May 24, 2024 · 28 comments
Closed
5 tasks done
Assignees
Labels
enhancement New feature or request

Comments

@realshuting
Copy link
Member

realshuting commented May 24, 2024

Kyverno Version

1.12.0

Description

For temporary resources such as updaterequests and ephemeralreports that are used to apply policies, they could potentially overload the etcd/API-server if the background or the report controller is compromised.

To deal with unexpected situations gracefully without bringing down the cluster, it would be good to add circuit breakers to stop creating these resources if the total object count exceeds the threshold. We can have a period job to query the count for both resources and write to a configmap, so that the current object count can be shared across different instances/(admission, background, report) controllers.

Reported issues:

Tasks list:

Slack discussion

No response

Troubleshooting

  • I have read and followed the documentation AND the troubleshooting guide.
  • I have searched other issues in this repository and mine is not recorded.
@realshuting realshuting added the bug Something isn't working label May 24, 2024
@realshuting realshuting added this to the Kyverno Release 1.13.0 milestone May 24, 2024
@realshuting realshuting added enhancement New feature or request and removed bug Something isn't working labels May 24, 2024
@black-snow
Copy link

This pretty much took down my cluster. We had ephemeralreports piling up in etcd (60k+, several gigs), overloading etcd and thus sprinkling failures all over the cluster.
I find ephemeralreports somewhat elusive - I can't find much documentation about them. How come they pile up like that? Should they be cleaned up by some cronjob as well? We ran a couple thousand of pods simultaneously so I get that they might spike, but a week later they still sit there in etcd.

I had to delete thousands of them manually.

@thesuperzapper
Copy link
Contributor

This is related to the issue where ClusterPolicies with lots of generate rules create many multiples more UpdateRequest resources than would be expected:

That issue must be fixed first, as otherwise, anyone with lots of generate rules will hit any reasonable limits almost immediately.

@JimBugwadia
Copy link
Member

@black-snow - are you seeing any errors or issues in the reports-controller?

For ephemeralreports, this situation occurs when the admission-controller or background-controller produce ephemeralreports faster than they can be consumed by the reports-controller, or the reports-controller stops processing for any reason.

v1.12.3 should help prevent this, and we are working on an improvement to add circuit breakers to the producers.

However, we still need to characterize why the reports-controller stops processing ephemeralreports. Hence, any logs or additional data will help.

@black-snow
Copy link

black-snow commented Jun 3, 2024

Sadly, I didn't see anything suspicious on that end. And by now, magically, everything's fine again. I see ephemeral reports popping up but usually they disappear again after about 20 seconds.
It looks like they were generated so quickly that they piled up to Everest size and I guess that any request about reports to my tiny control plane then just timed out, so that they couldn't be cleaned up again. I deleted them down to like 11k reports the other day, then didn't look at it over the weekend and now it's smooth as butter again.

@JimBugwadia

@jemag
Copy link
Contributor

jemag commented Jun 10, 2024

@realshuting are changes being made beside the circuit breakers/cleanup jobs to improve the handling of ephemeralreports?

In our case, we've had several issues with them in 1.12.2, and some with 1.12.3


1.12.2

In 1.12.2, policies targeting high rate of admission requests resources would generate a ton of ephemeralreports. In our case, we were using the following 2 policies for ArgoCD applications, which would trigger at a high rate:

https://raw.githubusercontent.com/kyverno/policies/release-1.11/argo/application-field-validation/application-field-validation.yaml               
https://raw.githubusercontent.com/kyverno/policies/release-1.11/argo/application-prevent-default-project/application-prevent-default-project.yaml 

(we can reduce that rate, just want to provide this as example of high rate admission requests leading to the problem)
Interestingly, we can see a lot of simultaneous evaluation of the same resource (not sure if this could be a path for optimization):
wezterm-gui_QdrR65Zu1k

Initially everything would work fine, however after a couple of hours, eventually ephemeralreports would start accumulating and not get deleted:
wezterm-gui_fsbEwFllVE

This is despite kyverno's pods not seeing much resource usage:
wezterm-gui_jg2CWr5yt3

From that point on, ephemeralreports would just grow continuously, sometimes as fast as 200 reports per 30 seconds. They would grow infinitely until cluster is impacted.

Important: One thing to note, is that even after we've removed the 2 ArgoCD Application policies, the number of ephemeralreports would just stay the same overtime (new reports would get created and deleted just fine, but the old/stuck reports would remain stuck)

Logs were normal during all of this, except for admission controllers reporting a high amount of failed to create report, ephemeralreports.reports.kyverno.io \"0523a111-e7f0-4088-8695-25eff8dd12d5\" already exists

With 1.12.2, we reproduced the same problem detailed above in 3 different clusters, all leading to eventual degradation of cluster and infinite accumulation of ephemeralreports.

1.12.3

In 1.12.3, we haven't been able to reproduce the problem yet. That being said, clusters that were already overloaded with ephemeralreports in high number
image
do not seem to be helped by the current cleanup strategy, e.g.:

      COUNT=$(kubectl get ephemeralreports.reports.kyverno.io -A | wc -l)
      if [ "$COUNT" -gt 10000 ]; then
        echo "too many ephemeralreports found ($COUNT), cleaning up..."
        kubectl delete ephemeralreports.reports.kyverno.io -A --all

As this will simply timeout, and the cleanup-ephemeral-reports will simply keep crashing infinitely (because these commands will timeout). Perhaps it would be best to consider using an alternative strategy, maybe something with parallel?:

kubectl get ephemeralreport.reports.kyverno.io -n argocd -o=name | \
parallel -j 20 'kubectl delete --ignore-not-found=true {}'

Conclusion

Given the failed to created report errors of the admission controller and the fact that ephemeralreports were not getting deleted despite their growth stopped with deletion of policies, I am a bit concerned if the cleanup job is merely applying a fix to the symptoms rather than the root cause.

Would love to hear if anything more was done or is planned.

@JimBugwadia
Copy link
Member

@jemag - can you please share the complete logs for the reports controller?

To answer your question, yes circuit breakers are an improvement on the cleanup jobs but not really addressing the root cause of the issues.

Something @eddycharly recently found, and is in the process of fixing, is that the reports-controller can stop processing requests when the lease is disrupted, or when the api-server is non-responsive due to high loads, and does not gracefully recover. Hence, the logs from your reports-controller will help.

@jemag
Copy link
Contributor

jemag commented Jun 10, 2024

Thank you for the added information @JimBugwadia
Here are the reports-controller logs:
kyverno-reports-controller.json
Feel free to ignore the trivy reports that cannot be deleted since these are another issue being tracked by #10335

I can also add the logs of one of the admission-controller, in case there is interest in the failed to create report error:
kyverno-admission1.zip

@JimBugwadia
Copy link
Member

Thanks @jemag!

Yes, this is the issue that @eddycharly is chasing down. Here are the relevant lines in the log:

{"level":"error","ts":"2024-05-29T06:23:56Z","logger":"klog","caller":"leaderelection/leaderelection.go:332","msg":"error retrieving resource lock kyverno/kyverno-reports-controller: Get \"https://192.168.0.1:443/apis/coordination.k8s.io/v1/namespaces/kyverno/leases/kyverno-reports-controller\": context deadline exceeded"}
{"level":"info","ts":"2024-05-29T06:23:56Z","logger":"klog","caller":"leaderelection/leaderelection.go:285","msg":"failed to renew lease kyverno/kyverno-reports-controller: timed out waiting for the condition"}
{"level":"info","ts":"2024-05-29T06:23:56Z","logger":"setup.leader-election","caller":"leaderelection/leaderelection.go:90","msg":"leadership lost, stopped leading","id":"kyverno-reports-controller-5785c58ddd-7ztgx"}
{"level":"debug","ts":"2024-05-29T06:23:56Z","logger":"aggregate-report-controller","caller":"controller/run.go:88","msg":"waiting for workers to terminate ..."}
{"level":"debug","ts":"2024-05-29T06:23:56Z","logger":"background-scan-controller","caller":"controller/run.go:88","msg":"waiting for workers to terminate ..."}
{"level":"debug","ts":"2024-05-29T06:23:56Z","logger":"resource-report-controller","caller":"controller/run.go:88","msg":"waiting for workers to terminate ..."}
{"level":"debug","ts":"2024-05-29T06:23:56Z","logger":"aggregate-report-controller","caller":"controller/run.go:88","msg":"waiting for workers to terminate ..."}
{"level":"debug","ts":"2024-05-29T06:23:56Z","logger":"aggregate-report-controller.worker","caller":"controller/run.go:75","msg":"worker stopped","id":2}
{"level":"debug","ts":"2024-05-29T06:23:56Z","logger":"aggregate-report-controller.worker","caller":"controller/run.go:75","msg":"worker stopped","id":5}
{"level":"debug","ts":"2024-05-29T06:23:56Z","logger":"aggregate-report-controller.worker","caller":"controller/run.go:75","msg":"worker stopped","id":8}
{"level":"debug","ts":"2024-05-29T06:23:56Z","logger":"aggregate-report-controller.worker","caller":"controller/run.go:75","msg":"worker stopped","id":4}
{"level":"debug","ts":"2024-05-29T06:23:56Z","logger":"aggregate-report-controller.worker","caller":"controller/run.go:75","msg":"worker stopped","id":9}
{"level":"debug","ts":"2024-05-29T06:23:56Z","logger":"aggregate-report-controller.worker","caller":"controller/run.go:75","msg":"worker stopped","id":6}
{"level":"debug","ts":"2024-05-29T06:23:56Z","logger":"aggregate-report-controller.worker","caller":"controller/run.go:75","msg":"worker stopped","id":3}

The controller stops its worker pool, but does not exit. This creates the issue where ephemeralreports are no longer processed.

We will plan a patch with these fixes.

@JimBugwadia
Copy link
Member

The relevant PRs are:

@JimBugwadia
Copy link
Member

JimBugwadia commented Jun 11, 2024

@jemag - there is a 1.12.4-rc1 release available with a fix that addresses the issue you are seeing.

Is it possible for you to test this in your environment?

@knechtionscoding
Copy link

knechtionscoding commented Jun 12, 2024

I'm seeing a similar issue in that ephemeral reports are being created at massive numbers rapidly. Hit apiserver_storage_objects{resource="[ephemeralreports.reports.kyverno.io](http://ephemeralreports.reports.kyverno.io/)"} 1.141405e+06 in one cluster.

I've tried the 1.12.4-rc1 and it isn't resolving it. I'm still seeing the numbers increase quite rapidly and reports controller is crashlooping.

Logs from Reports Controller
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"info","ts":"2024-06-12T10:59:28Z","logger":"setup.engine","caller":"internal/engine.go:45","msg":"setup engine..."}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"Level(-2)","ts":"2024-06-12T10:59:28Z","logger":"klog","caller":"cache/reflector.go:351","msg":"Caches populated for *v2alpha1.GlobalContextEntry from k8s.io/client-go@v0.29.2/tools/cache/reflector.go:229"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"info","ts":"2024-06-12T10:59:28Z","logger":"setup","caller":"internal/controller.go:32","msg":"starting controller","name":"kyverno-events","workers":3}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"info","ts":"2024-06-12T10:59:28Z","logger":"EventGenerator","caller":"event/controller.go:101","msg":"start"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"info","ts":"2024-06-12T10:59:28Z","logger":"klog","caller":"leaderelection/leaderelection.go:250","msg":"attempting to acquire leader lease kyverno/kyverno-reports-controller..."}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"info","ts":"2024-06-12T10:59:28Z","logger":"setup","caller":"internal/controller.go:32","msg":"starting controller","name":"global-context","workers":1}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"debug","ts":"2024-06-12T10:59:28Z","logger":"global-context","caller":"controller/run.go:58","msg":"starting ..."}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"debug","ts":"2024-06-12T10:59:28Z","logger":"global-context.worker","caller":"controller/run.go:71","msg":"starting worker","id":0}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"info","ts":"2024-06-12T10:59:28Z","logger":"klog","caller":"leaderelection/leaderelection.go:260","msg":"successfully acquired lease kyverno/kyverno-reports-controller"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"info","ts":"2024-06-12T10:59:28Z","logger":"setup.leader-election","caller":"leaderelection/leaderelection.go:97","msg":"still leading","id":"kyverno-reports-controller-5578558d9b-5wg8k"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"info","ts":"2024-06-12T10:59:28Z","logger":"setup.leader-election","caller":"leaderelection/leaderelection.go:83","msg":"started leading","id":"kyverno-reports-controller-5578558d9b-5wg8k"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"Level(-2)","ts":"2024-06-12T10:59:28Z","logger":"klog","caller":"cache/reflector.go:351","msg":"Caches populated for *v2beta1.PolicyException from k8s.io/client-go@v0.29.2/tools/cache/reflector.go:229"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"Level(-2)","ts":"2024-06-12T10:59:28Z","logger":"klog","caller":"cache/reflector.go:351","msg":"Caches populated for *v1.Policy from k8s.io/client-go@v0.29.2/tools/cache/reflector.go:229"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"Level(-2)","ts":"2024-06-12T10:59:28Z","logger":"klog","caller":"cache/reflector.go:351","msg":"Caches populated for *v1.ClusterPolicy from k8s.io/client-go@v0.29.2/tools/cache/reflector.go:229"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"Level(-2)","ts":"2024-06-12T10:59:28Z","logger":"klog","caller":"cache/reflector.go:351","msg":"Caches populated for *v1.Namespace from k8s.io/client-go@v0.29.2/tools/cache/reflector.go:229"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"Level(-2)","ts":"2024-06-12T10:59:28Z","logger":"klog","caller":"cache/reflector.go:351","msg":"Caches populated for *v1.PartialObjectMetadata from k8s.io/client-go@v0.29.2/tools/cache/reflector.go:229"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"Level(-2)","ts":"2024-06-12T10:59:28Z","logger":"klog","caller":"cache/reflector.go:351","msg":"Caches populated for *v1.PartialObjectMetadata from k8s.io/client-go@v0.29.2/tools/cache/reflector.go:229"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"Level(-2)","ts":"2024-06-12T10:59:28Z","logger":"klog","caller":"cache/reflector.go:351","msg":"Caches populated for *v1.PartialObjectMetadata from k8s.io/client-go@v0.29.2/tools/cache/reflector.go:229"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"info","ts":"2024-06-12T10:59:41Z","logger":"klog","caller":"trace/trace.go:236","msg":"Trace[1689897412]: \"Reflector ListAndWatch\" name:k8s.io/client-go@v0.29.2/tools/cache/reflector.go:229 (12-Jun-2024 10:59:28.534) (total time: 13088ms):\nTrace[1689897412]: ---\"Objects listed\" error:<nil> 12598ms (10:59:41.132)\nTrace[1689897412]: [13.08835733s] [13.08835733s] END"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"Level(-2)","ts":"2024-06-12T10:59:41Z","logger":"klog","caller":"cache/reflector.go:351","msg":"Caches populated for *v1.PartialObjectMetadata from k8s.io/client-go@v0.29.2/tools/cache/reflector.go:229"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"debug","ts":"2024-06-12T10:59:42Z","logger":"resource-report-controller","caller":"resource/controller.go:293","msg":"kind is not supported","gvk":"events.k8s.io/v1, Kind=Event"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"debug","ts":"2024-06-12T10:59:42Z","logger":"resource-report-controller","caller":"resource/controller.go:186","msg":"start watcher ...","gvr":"/v1, Resource=persistentvolumes","gvk":"/v1, Kind=PersistentVolume","resourceVersion":"1965490989"}
kyverno-reports-controller-5578558d9b-5wg8k controller {"level":"debug","ts":"2024-06-12T10:59:42Z","logger":"resource-report-controller","caller":"resource/controller.go:188","msg":"creating watcher...","gvr":"/v1, Resource=persistentvolumes","gvk":"/v1, Kind=PersistentVolume","resourceVersion":"1965490989"}

Testing the use of the new reports controller in a different cluster at least appears to make it so etcd isn't filling up anymore.

@jemag
Copy link
Contributor

jemag commented Jun 12, 2024

@JimBugwadia will do

@knechtionscoding as per my comment above, the current cleanup strategy likely won't work for such large amount of ephemeralreports. You will probably need to clean these up manually before things can get back to normal.

@realshuting
Copy link
Member Author

realshuting commented Jun 14, 2024

For those who encountered the ephemeralreports pile-up issue, check whether your application updates its deployment frequently.

You can set a watch on (cluster)ephemeralreport and grep the labels for the target resource:

kubectl get ephemeralreport -w -o yaml | grep labels -A5

If you see multiple ephemeralreports created for the same uid (label audit.kyverno.io/ressource.uid), then you have reporting enabled for admission events (this is on by default).

To temporarily "fix" the issue, you can turn admission reports off. The policy reports will still be generated through the background scan process and refreshed upon the next reconciliation (1h by default).

There are several ways to turn admission reports off:

  1. globally through Helm:

https://github.com/kyverno/kyverno/blob/main/charts/kyverno/values.yaml#L617

  1. configure at the policy level by setting cpol.spec.admission=false:
✗ k explain cpol.spec.admission
KIND:     ClusterPolicy
VERSION:  kyverno.io/v1

FIELD:    admission <boolean>

DESCRIPTION:
     Admission controls if rules are applied during admission. Optional. Default
     value is "true".

@jemag
Copy link
Contributor

jemag commented Jun 14, 2024

@JimBugwadia haven't seen the error happen again in the RC so far. Can we expect an official 1.12.4 version coming out soon?

@JimBugwadia
Copy link
Member

Hi @jemag yes we plan to release the GA version soon, unless any other issues come up.

@FernandoMiguel
Copy link

i'm seeing small creeping again.
not as bad as before, but it is deff there
image

@realshuting
Copy link
Member Author

i'm seeing small creeping again. not as bad as before, but it is deff there image

Hi @FernandoMiguel - have you checked my comment above? Do you have reports enabled for admission events?

@FernandoMiguel
Copy link

FernandoMiguel commented Jun 18, 2024

@realshuting we have the default upstream config.
I'll temp disable reports as recommended and see how it performs.
What is the expected permanent fix?

@FernandoMiguel
Copy link

3h window
image

in the last 30 min, we can still see it increasing, with ephemeralreports disabled
image

@realshuting
Copy link
Member Author

@FernandoMiguel - does the total number of ephemeralreports grow? From the last 30 min graph, I can see there were more UPDATE events on the Deployment, but I can't tell the count of ephemeralreports from reading the graph.

If the total counts grow, can you tune the report controller's flag --aggregationWorkers=20 (default is 10) to increase the consuming capacity and see if it helps?

@realshuting
Copy link
Member Author

@FernandoMiguel - do you have a rough number of cluster sizes? i.e., the number of nodes, pods (pod controllers), and policies.

@FernandoMiguel
Copy link

@realshuting these metrics is from a test cluster, with not much traffic or node change.
should be 3 MNG and a 3 karpenter nodes.
maybe 40 pods
image

@FernandoMiguel
Copy link

If the total counts grow, can you tune the report controller's flag --aggregationWorkers=20 (default is 10) to increase the consuming capacity and see if it helps?

@realshuting is this it? https://github.com/kyverno/kyverno/blob/e64df59df/charts/kyverno/values.yaml#L632-L636
default seems to be 2... i couldnt find any other aggregation or workers value to change

@realshuting
Copy link
Member Author

If the total counts grow, can you tune the report controller's flag --aggregationWorkers=20 (default is 10) to increase the consuming capacity and see if it helps?

@realshuting is this it? https://github.com/kyverno/kyverno/blob/e64df59df/charts/kyverno/values.yaml#L632-L636 default seems to be 2... i couldnt find any other aggregation or workers value to change

It's a container flag, see here:

Configures the number of internal worker threads used to perform reports aggregation (1.12.3+).

You can configure via extraArgs.

@ustuehler
Copy link

ustuehler commented Jun 19, 2024

If the total counts grow, can you tune the report controller's flag --aggregationWorkers=20 (default is 10) to increase the consuming capacity and see if it helps?

Increasing the number of workers helped. The spikes in the graph below are EphemeralReports created for Argo CD Applications every hour when the background scan occurred in addition to a steady flood of about 2.5K/s admission webhook calls. Once we passed --aggregationWorkers=20 (via extraArgs) there were no more spikes.

Screenshot 2024-06-18 at 10 03 35

However, Kubernetes API resource usage is still significantly elevated. (Edit: Fixed by Reconcile Optimization in Argo CD.)

Screenshot 2024-06-18 at 10 04 20

@realshuting
Copy link
Member Author

realshuting commented Jun 19, 2024

Thanks for confirming @ustuehler l !

However, Kubernetes API resource usage is still significantly elevated.

We have introduced the reports server in 1.12 as an alternative storage, you can check the blog post for details:

https://kyverno.io/blog/2024/05/29/kyverno-reports-server-the-ultimate-solution-to-scale-reporting/

There's a CNCF livestream this Wednesday (Jun 19, 9:00 – 10:00 AM PT) for this topic:

https://community.cncf.io/events/details/cncf-cncf-online-programs-presents-cloud-native-live-kyvernos-report-server-a-new-approach-to-policy-report-management/

@realshuting
Copy link
Member Author

Updates -

If you are still seeing constant creation of ephemeralreports after upgrading to 1.12.4, you can:

  1. disable reporting for admission events, please see comment here
  2. tune --aggregationWorkers to increase the capacity of consuming ephemeralreports, see comment here. It can be configured directly via the container flag, or through Helm extraArgs.

@realshuting
Copy link
Member Author

#10595 is scheduled for 1.12.6.

2 other tasks are completed, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

No branches or pull requests

8 participants