Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Webhook Controller Endless Loop #6539

Closed
2 tasks done
dhemeier opened this issue Mar 10, 2023 · 43 comments · Fixed by #6579
Closed
2 tasks done

[Bug] Webhook Controller Endless Loop #6539

dhemeier opened this issue Mar 10, 2023 · 43 comments · Fixed by #6579
Labels
bug Something isn't working Documentation Update Documentation end user This label is used to track the issue that is raised by the end user. webhook

Comments

@dhemeier
Copy link

Kyverno Version

1.9.0

Kubernetes Version

1.25.x

Kubernetes Platform

AKS

Kyverno Rule Type

Other

Description

Currently we have the problem, that the webhook controller recreates all mutating and validation webhooks in an endless loop. We have currently installed kyverno in version 1.9.1 (also tested 1.9.0 with the same results)

It looks like the same problem as the following issues:

which are all closed. After further investigating and changing the log level to --v=4 we have seen, that the webhook-controller is the misbehaving part. Interestingly there a no error inside the logs about "errors" on updating or creating the webhook, so normally after the first update, the controller should stop updating the webhook if I understand the functionality correctly.

As an example i have filtered the logs for webhook-controller events:

kubectl logs -n kyverno aks-infra-kyverno-7f8bd6764b-6rgkt -f | grep webhook-controller                                                                                                                  fact-aks-infra
Defaulted container "kyverno" out of: kyverno, kyverno-pre (init)
I0310 15:34:04.892045       1 controller.go:28] setup/leader/controllers "msg"="starting controller" "name"="webhook-controller" "workers"=2
I0310 15:34:04.892061       1 run.go:19] webhook-controller "msg"="starting ..."
I0310 15:34:04.892089       1 run.go:39] webhook-controller/routine "msg"="starting routine" "id"=0
I0310 15:34:04.892477       1 run.go:30] webhook-controller/worker "msg"="starting worker" "id"=0
I0310 15:34:04.892512       1 run.go:94] webhook-controller/worker "msg"="reconciling ..." "id"=0 "key"="kyverno-policy-mutating-webhook-cfg" "name"="kyverno-policy-mutating-webhook-cfg" "namespace"=""
I0310 15:34:04.892696       1 run.go:30] webhook-controller/worker "msg"="starting worker" "id"=1
I0310 15:34:04.892727       1 run.go:94] webhook-controller/worker "msg"="reconciling ..." "id"=1 "key"="kyverno-resource-mutating-webhook-cfg" "name"="kyverno-resource-mutating-webhook-cfg" "namespace"=""
I0310 15:34:04.912706       1 run.go:96] webhook-controller/worker "msg"="done" "duration"="23.402µs" "id"=1 "key"="kyverno-resource-mutating-webhook-cfg" "name"="kyverno-resource-mutating-webhook-cfg" "namespace"=""
I0310 15:34:04.912737       1 run.go:94] webhook-controller/worker "msg"="reconciling ..." "id"=1 "key"="kyverno-verify-mutating-webhook-cfg" "name"="kyverno-verify-mutating-webhook-cfg" "namespace"=""
I0310 15:34:04.914684       1 run.go:96] webhook-controller/worker "msg"="done" "duration"="35.003µs" "id"=0 "key"="kyverno-policy-mutating-webhook-cfg" "name"="kyverno-policy-mutating-webhook-cfg" "namespace"=""
I0310 15:34:04.914711       1 run.go:94] webhook-controller/worker "msg"="reconciling ..." "id"=0 "key"="aks-node-mutating-webhook" "name"="aks-node-mutating-webhook" "namespace"=""
I0310 15:34:04.914725       1 run.go:96] webhook-controller/worker "msg"="done" "duration"="13.801µs" "id"=0 "key"="aks-node-mutating-webhook" "name"="aks-node-mutating-webhook" "namespace"=""
I0310 15:34:04.914742       1 run.go:94] webhook-controller/worker "msg"="reconciling ..." "id"=0 "key"="aks-webhook-admission-controller" "name"="aks-webhook-admission-controller" "namespace"=""
I0310 15:34:04.914761       1 run.go:96] webhook-controller/worker "msg"="done" "duration"="15.902µs" "id"=0 "key"="aks-webhook-admission-controller" "name"="aks-webhook-admission-controller" "namespace"=""
I0310 15:34:04.914781       1 run.go:94] webhook-controller/worker "msg"="reconciling ..." "id"=0 "key"="cert-manager-webhook" "name"="cert-manager-webhook" "namespace"=""
I0310 15:34:04.914806       1 run.go:96] webhook-controller/worker "msg"="done" "duration"="21.701µs" "id"=0 "key"="cert-manager-webhook" "name"="cert-manager-webhook" "namespace"=""
I0310 15:34:04.914826       1 run.go:94] webhook-controller/worker "msg"="reconciling ..." "id"=0 "key"="ingress-nginx-admission" "name"="ingress-nginx-admission" "namespace"=""
I0310 15:34:04.914850       1 run.go:96] webhook-controller/worker "msg"="done" "duration"="22.402µs" "id"=0 "key"="ingress-nginx-admission" "name"="ingress-nginx-admission" "namespace"=""
I0310 15:34:04.914871       1 run.go:94] webhook-controller/worker "msg"="reconciling ..." "id"=0 "key"="ingress-nginx-public-admission" "name"="ingress-nginx-public-admission" "namespace"=""
I0310 15:34:04.914889       1 run.go:96] webhook-controller/worker "msg"="done" "duration"="22.802µs" "id"=0 "key"="ingress-nginx-public-admission" "name"="ingress-nginx-public-admission" "namespace"=""
I0310 15:34:04.914904       1 run.go:94] webhook-controller/worker "msg"="reconciling ..." "id"=0 "key"="kyverno-policy-validating-webhook-cfg" "name"="kyverno-policy-validating-webhook-cfg" "namespace"=""
I0310 15:34:04.925285       1 run.go:96] webhook-controller/worker "msg"="done" "duration"="16.101µs" "id"=1 "key"="kyverno-verify-mutating-webhook-cfg" "name"="kyverno-verify-mutating-webhook-cfg" "namespace"=""
I0310 15:34:04.925315       1 run.go:94] webhook-controller/worker "msg"="reconciling ..." "id"=1 "key"="kyverno-resource-validating-webhook-cfg" "name"="kyverno-resource-validating-webhook-cfg" "namespace"=""
I0310 15:34:04.931332       1 run.go:96] webhook-controller/worker "msg"="done" "duration"="17.502µs" "id"=0 "key"="kyverno-policy-validating-webhook-cfg" "name"="kyverno-policy-validating-webhook-cfg" "namespace"=""
I0310 15:34:04.931373       1 run.go:94] webhook-controller/worker "msg"="reconciling ..." "id"=0 "key"="aks-node-validating-webhook" "name"="aks-node-validating-webhook" "namespace"=""
I0310 15:34:04.931396       1 run.go:96] webhook-controller/worker "msg"="done" "duration"="24.202µs" "id"=0 
[ . . . ]

As a follow up error, we got throttling messages from the api server:

I0310 15:37:58.508759       1 request.go:614] Waited for 73.498116ms due to client-side throttling, not priority and fairness, request: PUT:https://192.168.0.1:443/apis/admissionregistration.k8s.io/v1/mutatingwebhookconfigurations/kyverno-policy-mutating-webhook-cfg

To get the number of update, that will be performed by the webhook-controller, we watched the current version of the webhook. After around 5 minutes, we got already 2846 Updates:

$ kubectl get validatingwebhookconfigurations.admissionregistration.k8s.io kyverno-cleanup-validating-webhook-cfg
NAME                                     WEBHOOKS   AGE
kyverno-cleanup-validating-webhook-cfg   1          5m4s

$ kubectl get validatingwebhookconfigurations.admissionregistration.k8s.io kyverno-cleanup-validating-webhook-cfg -o yaml
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  creationTimestamp: "2023-03-10T14:49:19Z"
  generation: 2846
  labels:
    webhook.kyverno.io/managed-by: kyverno
  name: kyverno-cleanup-validating-webhook-cfg
  resourceVersion: "245823835"
  uid: 06eb8596-4d13-4264-a651-c8fabef64cdd
[ . . . ]

Also we tested to delete all ClusterPolicies and Policies, but that changed nothing on the endless loop behavior.

As a last hint, we also see this behavior with other aks clusters and different kubernetes versions (Kubernetes Versions 1.24 and 1.25).

Maybe someone has an idea, what we can check or try, in order to get the correct behavior again.

Thanks in advance
Dennis

PS: The full logs can be found on pastebin: https://pastebin.com/MYpxAKe4

Steps to reproduce

  1. Install Kyverno 1.9.X inside an aks managed kubernetes cluster with 3 replicas (Helm Chart Version 2.7.1)
  2. Watch the updating webhooks in an endless loop

Expected behavior

After the first correct creation of the webhook configs, the webhook-controller should stop updating the webhooks.

Screenshots

No response

Kyverno logs

No response

Slack discussion

No response

Troubleshooting

  • I have read and followed the documentation AND the troubleshooting guide.
  • I have searched other issues in this repository and mine is not recorded.
@dhemeier dhemeier added bug Something isn't working triage Default label assigned to all new issues indicating label curation is needed to fully organize. labels Mar 10, 2023
@welcome
Copy link

welcome bot commented Mar 10, 2023

Thanks for opening your first issue here! Be sure to follow the issue template!

@chipzoller
Copy link
Member

Sounds like it may be networking related. Look at the troubleshooting section of the docs and see if that helps:

https://kyverno.io/docs/troubleshooting/

@eddycharly
Copy link
Member

The controller is expected to run often because it's part of our health check system (IIRC every 10s).
But there should be no update as the webhook config should most of the time be up to date.

@eddycharly
Copy link
Member

Can you post the full webhook config ?

@dhemeier
Copy link
Author

dhemeier commented Mar 10, 2023

Hey @chipzoller

thanks for the quick response. I already tried to delete the webhooks alone, the complete kyverno application and also only scaling the kyverno deployment. Always the same result in endless loops.

What exactly I have tried already:

  1. Scale to 0
  2. Delete webhooks (kyverno-resource-validating-webhook-cfg, kyverno-resource-mutating-webhook-cfg)
  3. Scale to 3 (also tried same procedure with 1 replica)

Second Try:

  1. Delete Deployment
  2. Delete all webhooks (to simulate a fresh install)
  3. Apply Deployment again

In general the networking should be fine, because the webhooks will be (with correct settings) created and are working. I.e. test an enforce policy and apply some "failing example pod definitions".

@eddycharly
Copy link
Member

Maybe you could capture the webhook config before and after an update to see what changed in between ?

@dhemeier
Copy link
Author

@eddycharly shure:

kyverno-resource-validating-webhook-cfg

$ kubectl get validatingwebhookconfigurations.admissionregistration.k8s.io kyverno-resource-validating-webhook-cfg -o yaml
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  creationTimestamp: "2023-03-10T17:10:58Z"
  generation: 27
  labels:
    webhook.kyverno.io/managed-by: kyverno
  name: kyverno-resource-validating-webhook-cfg
  ownerReferences:
  - apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    name: aks-infra-kyverno:webhook
    uid: 1d3bb4c9-ff4c-4f89-8bb6-73e3a82c0b0a
  resourceVersion: "246010314"
  uid: d10a0165-a435-4685-b9a5-1ec8ac9ee0e0
webhooks:
- admissionReviewVersions:
  - v1
  clientConfig:
    caBundle: LS0tLS1CRUdXXXXXXXXXXXXXXX
    service:
      name: aks-infra-kyverno-svc
      namespace: kyverno
      path: /validate/fail
      port: 443
  failurePolicy: Fail
  matchPolicy: Equivalent
  name: validate.kyverno.svc-fail
  namespaceSelector:
    matchExpressions:
    - key: kubernetes.io/metadata.name
      operator: NotIn
      values:
      - kyverno
    - key: control-plane
      operator: DoesNotExist
  objectSelector: {}
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    - UPDATE
    - DELETE
    - CONNECT
    resources:
    - configmaps
    scope: '*'
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    - UPDATE
    - DELETE
    - CONNECT
    resources:
    - replicationcontrollers
    scope: '*'
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    - UPDATE
    - DELETE
    - CONNECT
    resources:
    - services
    scope: '*'
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    - UPDATE
    - DELETE
    - CONNECT
    resources:
    - pods
    - pods/ephemeralcontainers
    scope: '*'
  - apiGroups:
    - apps
    apiVersions:
    - v1
    operations:
    - CREATE
    - UPDATE
    - DELETE
    - CONNECT
    resources:
    - daemonsets
    scope: '*'
  - apiGroups:
    - apps
    apiVersions:
    - v1
    operations:
    - CREATE
    - UPDATE
    - DELETE
    - CONNECT
    resources:
    - deployments
    scope: '*'
  - apiGroups:
    - apps
    apiVersions:
    - v1
    operations:
    - CREATE
    - UPDATE
    - DELETE
    - CONNECT
    resources:
    - replicasets
    scope: '*'
  - apiGroups:
    - apps
    apiVersions:
    - v1
    operations:
    - CREATE
    - UPDATE
    - DELETE
    - CONNECT
    resources:
    - statefulsets
    scope: '*'
  - apiGroups:
    - batch
    apiVersions:
    - v1
    operations:
    - CREATE
    - UPDATE
    - DELETE
    - CONNECT
    resources:
    - cronjobs
    scope: '*'
  - apiGroups:
    - batch
    apiVersions:
    - v1
    operations:
    - CREATE
    - UPDATE
    - DELETE
    - CONNECT
    resources:
    - jobs
    scope: '*'
  - apiGroups:
    - networking.k8s.io
    apiVersions:
    - v1
    operations:
    - CREATE
    - UPDATE
    - DELETE
    - CONNECT
    resources:
    - ingresses
    scope: '*'
  sideEffects: NoneOnDryRun
  timeoutSeconds: 10

kyverno-resource-mutating-webhook-cfg

$ kubectl get mutatingwebhookconfigurations.admissionregistration.k8s.io kyverno-resource-mutating-webhook-cfg -o yaml
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  creationTimestamp: "2023-03-10T17:10:58Z"
  generation: 469
  labels:
    webhook.kyverno.io/managed-by: kyverno
  name: kyverno-resource-mutating-webhook-cfg
  ownerReferences:
  - apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    name: aks-infra-kyverno:webhook
    uid: 1d3bb4c9-ff4c-4f89-8bb6-73e3a82c0b0a
  resourceVersion: "246016947"
  uid: 041622ad-ef2a-4e2d-b60a-ab632cc5b7d7
webhooks:
- admissionReviewVersions:
  - v1
  clientConfig:
    caBundle: LS0tLS1CRUdJTiBDRVJUXXXXXXXX
    service:
      name: aks-infra-kyverno-svc
      namespace: kyverno
      path: /mutate/fail
      port: 443
  failurePolicy: Fail
  matchPolicy: Equivalent
  name: mutate.kyverno.svc-fail
  namespaceSelector:
    matchExpressions:
    - key: kubernetes.io/metadata.name
      operator: NotIn
      values:
      - kyverno
    - key: control-plane
      operator: DoesNotExist
  objectSelector: {}
  reinvocationPolicy: IfNeeded
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    - UPDATE
    resources:
    - replicationcontrollers
    scope: '*'
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    - UPDATE
    resources:
    - pods
    - pods/ephemeralcontainers
    scope: '*'
  - apiGroups:
    - apps
    apiVersions:
    - v1
    operations:
    - CREATE
    - UPDATE
    resources:
    - daemonsets
    scope: '*'
  - apiGroups:
    - apps
    apiVersions:
    - v1
    operations:
    - CREATE
    - UPDATE
    resources:
    - deployments
    scope: '*'
  - apiGroups:
    - apps
    apiVersions:
    - v1
    operations:
    - CREATE
    - UPDATE
    resources:
    - replicasets
    scope: '*'
  - apiGroups:
    - apps
    apiVersions:
    - v1
    operations:
    - CREATE
    - UPDATE
    resources:
    - statefulsets
    scope: '*'
  - apiGroups:
    - batch
    apiVersions:
    - v1
    operations:
    - CREATE
    - UPDATE
    resources:
    - cronjobs
    scope: '*'
  - apiGroups:
    - batch
    apiVersions:
    - v1
    operations:
    - CREATE
    - UPDATE
    resources:
    - jobs
    scope: '*'
  sideEffects: NoneOnDryRun
  timeoutSeconds: 10

@eddycharly
Copy link
Member

generation: 27 looks like it didn't change that much

@eddycharly
Copy link
Member

the controller running often is expected, as long as it doesn't update the webhook configs if nothing changed it shouldn't be an issue.

@dhemeier
Copy link
Author

I had deployed it again for the export (was a fresh deployment). But the diff also show now special changes as you mentioned:

Diff of the kyverno-resource-mutating-webhook-cfg (exported it with a 5 second interval)

diff test1 test3                                                                                                                                                                             cloudpirates-aks-infra
5c5
<   generation: 1119
---
>   generation: 1155
14c14
<   resourceVersion: "246027113"
---
>   resourceVersion: "246027696"

Same for the kyverno-resource-validating-webhook-cfg (exported it with a 5 second interval)

<   generation: 1157
---
>   generation: 1177
14c14
<   resourceVersion: "246038780"
---
>   resourceVersion: "246039047"

@eddycharly
Copy link
Member

if you k get validatingwebhookconfigurations.admissionregistration.k8s.io kyverno-resource-validating-webhook-cfg -o yaml -w you see it changing continuously ?

@dhemeier
Copy link
Author

When the watch command works the same as on other resources, then it means for me that it changes the whole webhook completely every time (got the complete output of the kyverno-resource-validating-webhook-cfg as a result)

Interestingly the interval, at wich the update happens is "faster" than 1 second...
When I remember correctly I have seen somewhere in the source code, that the controller works in a 1 second interval. So maybe all 3 replicas updating at the same time?!? :-(

@eddycharly
Copy link
Member

So maybe all 3 replicas updating at the same time?!? :-(

Only the leader updates the webhook configs.

If you capture the webhook configs in two sequential changes, is there something interesting in the diff ? I mean the diff between one version and the one that comes immediately after.

@eddycharly
Copy link
Member

This could be multiple controllers fighting each other, but i don't see why another controller would try to reconcile those webhook configs...

@dhemeier
Copy link
Author

@eddycharly the hint to try to get two sequential changes in row is maybe helpful. Now we can see, that the namespaceSelector is changing:

Diff Output:

diff test1 test2
5c5
<   generation: 5014
---
>   generation: 5013
14c14
<   resourceVersion: "246100802"
---
>   resourceVersion: "246100800"
35,36d34
<     - key: control-plane
<       operator: DoesNotExist

Relevant part of the webhook:

  namespaceSelector:
    matchExpressions:
    - key: kubernetes.io/metadata.name
      operator: NotIn
      values:
      - kyverno
    - key: control-plane
      operator: DoesNotExist

VS

  namespaceSelector:
    matchExpressions:
    - key: kubernetes.io/metadata.name
      operator: NotIn
      values:
      - kyverno

@eddycharly
Copy link
Member

Yeah, that's what i was suspecting. Someone is constantly trying to add:

    - key: control-plane
      operator: DoesNotExist

Then kyverno tries to restore the desired state.

@dhemeier
Copy link
Author

Quick google search, aks clusters have some kind of auto protection to remove that selector from webhooks... I will try to dig deeper into it.

@eddycharly
Copy link
Member

aks clusters have some kind of auto protection to remove that selector from webhooks

where is this selector coming from in the first place ?

@dhemeier
Copy link
Author

@eddycharly thats a good question. Currently I try to find the source of the selector.

@eddycharly
Copy link
Member

I think aks tries to add the selector and kyverno tries to remove it.

@dhemeier
Copy link
Author

@eddycharly you are absolutely correct.

Inside aks you have a default admission enforcer (https://learn.microsoft.com/en-us/azure/aks/faq#can-i-use-admission-controller-webhooks-on-aks), that adds the following selector to an webhook:

namespaceSelector:
    matchExpressions:
    - key: control-plane
      operator: DoesNotExist

After that, kyverno tries to delete that selector -> Welcome endless loop.

In order to solve the problem for managed aks cluster, we have two possible options:

  1. set the control-plane selector directly on create (so that kyverno created webhook will be the same as aks would expect it)
  2. disable the aks admission enforcer by adding a label or annotation ("admissions.enforcer/disabled": true) to the created webhooks (https://learn.microsoft.com/en-us/azure/aks/faq#can-admission-controller-webhooks-impact-kube-system-and-internal-aks-namespaces)

It is possible to change the deployed webhooks in some way? So that the endless loop can be prevented?

@eddycharly
Copy link
Member

I think you can add it in the configmap.

@dhemeier
Copy link
Author

Okay, found it (https://github.com/kyverno/kyverno/blob/main/charts/kyverno/values.yaml#L97)

I will try it and post an update on the issue.

@eddycharly
Copy link
Member

@dhemeier this is main correct link for 1.9

@dhemeier
Copy link
Author

Okay, we have now solved a part of the problem. The kyverno-resource-validating-webhook-cfg and the kyverno-resource-mutating-webhook-cfg are now working correctly and don't have the endless loop problem.

But the other "internal" webhook that will be created by kyverno (kyverno-resource-mutating-webhook-cfg, kyverno-policy-validating-webhook-cfg and kyverno-verify-mutating-webhook-cfg) having still the same problems of endless loops.

I tried to get the differences between two sequential changes again. i.e. for the kyverno-verify-mutating-webhook-cfg:

Kyverno sets the following selector:

  namespaceSelector: {}
  objectSelector:
    matchLabels:
      app.kubernetes.io/name: kyverno

aks admission adds again the namespaceSelector :

  namespaceSelector:
    matchExpressions:
    - key: control-plane
      operator: DoesNotExist
  objectSelector:
    matchLabels:
      app.kubernetes.io/name: kyverno

It looks like, that the configuration only applies to the kyverno-resource-validating-webhook-cfg and the kyverno-resource-mutating-webhook-cfg and not the the "internal" webhooks.

Intended behavior of kyverno or maybe some kind of missing feature?

@eddycharly
Copy link
Member

Okay, I'll try to see if we can extend to other webhooks.

I find this quite dangerous though, I just have to add the control-plane label to a namespace to bypass kyverno ?

@eddycharly
Copy link
Member

I think you can add an annotation manually, kyverno should preserve it.

@dhemeier
Copy link
Author

Hey @eddycharly

thanks again for your response. The annotation will be preserved by kyverno, so that solution works. As you also mentioned, the control-plane label to bypass kyverno is quite dangerous.

In order to solve this problem, I also added the annotation "admissions.enforcer/disabled": true to all webhooks and removed the kyverno helm chart config.webhooks completely.

So for me personally, the problem is completely solved and we have currently no need, to extend the selectors to the internal webhooks.

The only reason we could think about enabling user of adding namespaceSelectors or maybe adding annotations/labels to the webhooks could be, that users want to deploy kyverno completely via GitOps without manual changes on the webhooks.

For other users, that have the same problems: Execute the following to fix the managed azure aks endless loops:

kubectl annotate mutatingwebhookconfigurations.admissionregistration.k8s.io kyverno-resource-mutating-webhook-cfg "admissions.enforcer/disabled"=true
kubectl annotate mutatingwebhookconfigurations.admissionregistration.k8s.io kyverno-verify-mutating-webhook-cfg "admissions.enforcer/disabled"=true
kubectl annotate mutatingwebhookconfigurations.admissionregistration.k8s.io kyverno-policy-mutating-webhook-cfg "admissions.enforcer/disabled"=true
kubectl annotate mutatingwebhookconfigurations.admissionregistration.k8s.io kyverno-resource-mutating-webhook-cfg "admissions.enforcer/disabled"=true
kubectl annotate validatingwebhookconfigurations.admissionregistration.k8s.io kyverno-resource-validating-webhook-cfg "admissions.enforcer/disabled"=true
kubectl annotate validatingwebhookconfigurations.admissionregistration.k8s.io kyverno-policy-validating-webhook-cfg "admissions.enforcer/disabled"=true

Issue can be closed, if the other ideas are not currently relevant.

Again thank you very much for your support! Great work.

Have a great evening
Dennis

@dhemeier
Copy link
Author

I checked some older installations of kyverno.

The problem exists all the time... At least the last couple of months. But before Kyverno 1.9.X, the webhook sources were not displayed inside ArgoCD, so that I didn't see the endless loops happening. And because of "aks simply manages the api server for us", we don't check metrics for the request to the api server.

You are right, that could be a great extension of the troubleshooting guide. When it's okay for you, I can create a merge request for this. But not today anymore, I can start on this at Monday.

@dhemeier
Copy link
Author

FunFact, found an admission webhook in generation 45.678.532 :-D

@chipzoller
Copy link
Member

What I'm asking is about your AKS environment. Are you doing anything that's out of the ordinary? It seems like there must be something unique here or else we would have learned of this long ago. Kyverno has managed webhooks dynamically now for several versions.

@dhemeier
Copy link
Author

dhemeier commented Mar 10, 2023

We have nothing special in our clusters. Maybe the webhooks are only visible since the last ArgoCD Update (done yesterday).

Found also some other projects with similar problems / workarounds:

@eddycharly
Copy link
Member

I agree that it would be nice to be able to configure annotations.

Maybe we could add that in the configmap.

Some webhook configurations could be part of our helm chart and don't need to be managed programmatically (only the resource related ones need to) but we need to configure the CA bundles (we don't want to rely on cert manager).

@chipzoller
Copy link
Member

@dhemeier can you raise an issue in kyverno/website with instructions that are needed to get past this for AKS users? We can add it to the Troubleshooting section for others.

@chipzoller chipzoller added Documentation Update Documentation end user This label is used to track the issue that is raised by the end user. webhook and removed triage Default label assigned to all new issues indicating label curation is needed to fully organize. labels Mar 13, 2023
@JoHaHu
Copy link

JoHaHu commented Mar 15, 2023

I agree that it would be nice to be able to configure annotations.

Maybe we could add that in the configmap.

I agree! Having the option to set the annotations with the configmap would help my team to manage everything with GitOps.

@eddycharly
Copy link
Member

Will be fixed by #6579

@jemag
Copy link
Contributor

jemag commented Mar 23, 2023

any chance this might get released as a patch version (e.g.: 1.9.3) before current 1.10 milestone?


As a side note, in our case some annotations were missing from the commands provided above, and one of them was duplicated, we ended up using:

kubectl annotate mutatingwebhookconfigurations.admissionregistration.k8s.io kyverno-resource-mutating-webhook-cfg "admissions.enforcer/disabled"=true
kubectl annotate mutatingwebhookconfigurations.admissionregistration.k8s.io kyverno-verify-mutating-webhook-cfg "admissions.enforcer/disabled"=true
kubectl annotate mutatingwebhookconfigurations.admissionregistration.k8s.io kyverno-policy-mutating-webhook-cfg "admissions.enforcer/disabled"=true
kubectl annotate validatingwebhookconfigurations.admissionregistration.k8s.io kyverno-resource-validating-webhook-cfg "admissions.enforcer/disabled"=true
kubectl annotate validatingwebhookconfigurations.admissionregistration.k8s.io kyverno-policy-validating-webhook-cfg "admissions.enforcer/disabled"=true
# Missing annotations for us:
kubectl annotate validatingwebhookconfigurations.admissionregistration.k8s.io kyverno-cleanup-validating-webhook-cfg "admissions.enforcer/disabled"=true
kubectl annotate validatingwebhookconfigurations.admissionregistration.k8s.io kyverno-exception-validating-webhook-cfg "admissions.enforcer/disabled"=true

@mkilchhofer
Copy link
Contributor

mkilchhofer commented Apr 5, 2023

Glad I found this issue as we at @swisspost also face this issue on our AKS clusters.

Like @dhemeier posted in some comments above, I also found it inside the FAQ of AKS: https://learn.microsoft.com/en-us/azure/aks/faq#can-admission-controller-webhooks-impact-kube-system-and-internal-aks-namespaces

Can admission controller webhooks impact kube-system and internal AKS namespaces?

To protect the stability of the system and prevent custom admission controllers from impacting internal services in the kube-system, namespace AKS has an Admissions Enforcer, which automatically excludes kube-system and AKS internal namespaces. This service ensures the custom admission controllers don't affect the services running in kube-system.

If you have a critical use case for deploying something on kube-system (not recommended) in support of your custom admission webhook, you may add the following label or annotation so that Admissions Enforcer ignores it.

Label: "admissions.enforcer/disabled": "true" or Annotation: "admissions.enforcer/disabled": true

-> We would also appreciate a backported fix in 1.9.x

@MattiasPernhult
Copy link

Any update on getting this backported to 1.9.x?

@maxwell-gregory
Copy link

Also looking for this in 1.9.X.

Version 1.10.X does not look ready for production clusters

What I'm asking is about your AKS environment. Are you doing anything that's out of the ordinary? It seems like there must be something unique here or else we would have learned of this long ago. Kyverno has managed webhooks dynamically now for several versions.

@chipzoller I can speak from our experience of how we found this so late. Kyverno continues to operate normally because its fighting over this selector. The selector is to prevent Kyverno from interacting with the kube-system namespace but we already do this in Kyverno to ignore kube-system. The AKS doc mentioned above explains more on this.

We only saw this because of the high amount if API server requests coming from the webhooks. Until our API server started to get stressed from this it is basically unnoticeable. I would think most won't see it until it stresses the API server.

@chipzoller
Copy link
Member

We have decided to backport this to 1.9.3 which should be released today or early next week.

@chipzoller
Copy link
Member

1.9.3 is available with this backport: https://github.com/kyverno/kyverno/releases/tag/v1.9.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Documentation Update Documentation end user This label is used to track the issue that is raised by the end user. webhook
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants