-
Notifications
You must be signed in to change notification settings - Fork 800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Webhook Controller Endless Loop #6539
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
Sounds like it may be networking related. Look at the troubleshooting section of the docs and see if that helps: |
The controller is expected to run often because it's part of our health check system (IIRC every 10s). |
Can you post the full webhook config ? |
Hey @chipzoller thanks for the quick response. I already tried to delete the webhooks alone, the complete kyverno application and also only scaling the kyverno deployment. Always the same result in endless loops. What exactly I have tried already:
Second Try:
In general the networking should be fine, because the webhooks will be (with correct settings) created and are working. I.e. test an enforce policy and apply some "failing example pod definitions". |
Maybe you could capture the webhook config before and after an update to see what changed in between ? |
@eddycharly shure: kyverno-resource-validating-webhook-cfg
kyverno-resource-mutating-webhook-cfg
|
|
the controller running often is expected, as long as it doesn't update the webhook configs if nothing changed it shouldn't be an issue. |
I had deployed it again for the export (was a fresh deployment). But the diff also show now special changes as you mentioned: Diff of the
Same for the
|
if you |
When the watch command works the same as on other resources, then it means for me that it changes the whole webhook completely every time (got the complete output of the Interestingly the interval, at wich the update happens is "faster" than 1 second... |
Only the leader updates the webhook configs. If you capture the webhook configs in two sequential changes, is there something interesting in the diff ? I mean the diff between one version and the one that comes immediately after. |
This could be multiple controllers fighting each other, but i don't see why another controller would try to reconcile those webhook configs... |
@eddycharly the hint to try to get two sequential changes in row is maybe helpful. Now we can see, that the Diff Output:
Relevant part of the webhook:
VS
|
Yeah, that's what i was suspecting. Someone is constantly trying to add:
Then kyverno tries to restore the desired state. |
Quick google search, aks clusters have some kind of auto protection to remove that selector from webhooks... I will try to dig deeper into it. |
where is this selector coming from in the first place ? |
@eddycharly thats a good question. Currently I try to find the source of the selector. |
I think aks tries to add the selector and kyverno tries to remove it. |
@eddycharly you are absolutely correct. Inside aks you have a default admission enforcer (https://learn.microsoft.com/en-us/azure/aks/faq#can-i-use-admission-controller-webhooks-on-aks), that adds the following selector to an webhook:
After that, kyverno tries to delete that selector -> Welcome endless loop. In order to solve the problem for managed aks cluster, we have two possible options:
It is possible to change the deployed webhooks in some way? So that the endless loop can be prevented? |
I think you can add it in the configmap. |
Okay, found it (https://github.com/kyverno/kyverno/blob/main/charts/kyverno/values.yaml#L97) I will try it and post an update on the issue. |
@dhemeier this is kyverno/charts/kyverno/values.yaml Line 361 in df5774f
|
Okay, we have now solved a part of the problem. The But the other "internal" webhook that will be created by kyverno ( I tried to get the differences between two sequential changes again. i.e. for the Kyverno sets the following selector:
aks admission adds again the
It looks like, that the configuration only applies to the Intended behavior of kyverno or maybe some kind of missing feature? |
Okay, I'll try to see if we can extend to other webhooks. I find this quite dangerous though, I just have to add the |
I think you can add an annotation manually, kyverno should preserve it. |
Hey @eddycharly thanks again for your response. The In order to solve this problem, I also added the annotation So for me personally, the problem is completely solved and we have currently no need, to extend the selectors to the internal webhooks. The only reason we could think about enabling user of adding For other users, that have the same problems: Execute the following to fix the managed azure aks endless loops:
Issue can be closed, if the other ideas are not currently relevant. Again thank you very much for your support! Great work. Have a great evening |
I checked some older installations of kyverno. The problem exists all the time... At least the last couple of months. But before Kyverno 1.9.X, the webhook sources were not displayed inside ArgoCD, so that I didn't see the endless loops happening. And because of "aks simply manages the api server for us", we don't check metrics for the request to the api server. You are right, that could be a great extension of the troubleshooting guide. When it's okay for you, I can create a merge request for this. But not today anymore, I can start on this at Monday. |
FunFact, found an admission webhook in generation 45.678.532 :-D |
What I'm asking is about your AKS environment. Are you doing anything that's out of the ordinary? It seems like there must be something unique here or else we would have learned of this long ago. Kyverno has managed webhooks dynamically now for several versions. |
We have nothing special in our clusters. Maybe the webhooks are only visible since the last ArgoCD Update (done yesterday). Found also some other projects with similar problems / workarounds: |
I agree that it would be nice to be able to configure annotations. Maybe we could add that in the configmap. Some webhook configurations could be part of our helm chart and don't need to be managed programmatically (only the resource related ones need to) but we need to configure the CA bundles (we don't want to rely on cert manager). |
@dhemeier can you raise an issue in kyverno/website with instructions that are needed to get past this for AKS users? We can add it to the Troubleshooting section for others. |
I agree! Having the option to set the annotations with the configmap would help my team to manage everything with GitOps. |
Will be fixed by #6579 |
any chance this might get released as a patch version (e.g.: 1.9.3) before current 1.10 milestone? As a side note, in our case some annotations were missing from the commands provided above, and one of them was duplicated, we ended up using: kubectl annotate mutatingwebhookconfigurations.admissionregistration.k8s.io kyverno-resource-mutating-webhook-cfg "admissions.enforcer/disabled"=true
kubectl annotate mutatingwebhookconfigurations.admissionregistration.k8s.io kyverno-verify-mutating-webhook-cfg "admissions.enforcer/disabled"=true
kubectl annotate mutatingwebhookconfigurations.admissionregistration.k8s.io kyverno-policy-mutating-webhook-cfg "admissions.enforcer/disabled"=true
kubectl annotate validatingwebhookconfigurations.admissionregistration.k8s.io kyverno-resource-validating-webhook-cfg "admissions.enforcer/disabled"=true
kubectl annotate validatingwebhookconfigurations.admissionregistration.k8s.io kyverno-policy-validating-webhook-cfg "admissions.enforcer/disabled"=true
# Missing annotations for us:
kubectl annotate validatingwebhookconfigurations.admissionregistration.k8s.io kyverno-cleanup-validating-webhook-cfg "admissions.enforcer/disabled"=true
kubectl annotate validatingwebhookconfigurations.admissionregistration.k8s.io kyverno-exception-validating-webhook-cfg "admissions.enforcer/disabled"=true |
Glad I found this issue as we at @swisspost also face this issue on our AKS clusters. Like @dhemeier posted in some comments above, I also found it inside the FAQ of AKS: https://learn.microsoft.com/en-us/azure/aks/faq#can-admission-controller-webhooks-impact-kube-system-and-internal-aks-namespaces
-> We would also appreciate a backported fix in 1.9.x |
Any update on getting this backported to 1.9.x? |
Also looking for this in 1.9.X. Version 1.10.X does not look ready for production clusters
@chipzoller I can speak from our experience of how we found this so late. Kyverno continues to operate normally because its fighting over this selector. The selector is to prevent Kyverno from interacting with the We only saw this because of the high amount if API server requests coming from the webhooks. Until our API server started to get stressed from this it is basically unnoticeable. I would think most won't see it until it stresses the API server. |
We have decided to backport this to 1.9.3 which should be released today or early next week. |
1.9.3 is available with this backport: https://github.com/kyverno/kyverno/releases/tag/v1.9.3 |
Kyverno Version
1.9.0
Kubernetes Version
1.25.x
Kubernetes Platform
AKS
Kyverno Rule Type
Other
Description
Currently we have the problem, that the webhook controller recreates all mutating and validation webhooks in an endless loop. We have currently installed kyverno in version 1.9.1 (also tested 1.9.0 with the same results)
It looks like the same problem as the following issues:
which are all closed. After further investigating and changing the log level to
--v=4
we have seen, that the webhook-controller is the misbehaving part. Interestingly there a no error inside the logs about "errors" on updating or creating the webhook, so normally after the first update, the controller should stop updating the webhook if I understand the functionality correctly.As an example i have filtered the logs for webhook-controller events:
As a follow up error, we got throttling messages from the api server:
To get the number of update, that will be performed by the webhook-controller, we watched the current version of the webhook. After around 5 minutes, we got already 2846 Updates:
Also we tested to delete all
ClusterPolicies
andPolicies
, but that changed nothing on the endless loop behavior.As a last hint, we also see this behavior with other aks clusters and different kubernetes versions (Kubernetes Versions 1.24 and 1.25).
Maybe someone has an idea, what we can check or try, in order to get the correct behavior again.
Thanks in advance
Dennis
PS: The full logs can be found on pastebin: https://pastebin.com/MYpxAKe4
Steps to reproduce
Expected behavior
After the first correct creation of the webhook configs, the
webhook-controller
should stop updating the webhooks.Screenshots
No response
Kyverno logs
No response
Slack discussion
No response
Troubleshooting
The text was updated successfully, but these errors were encountered: