[BUG] kyverno slowing made K9s timedout #918

sgandon · 2020-06-05T07:23:52Z

Describe the bug
I tracked down this bug for while and it happens on v1.1.5. I know you have been working hard since and I have not had the time to test the lastest versions but I will.
I used the helm chat install kyverno.
I also installed 3 policies.
What I observed after a while is that K9s (a cli to manage clusters) was failing to start because of some timeout.
I tried to debug it and it happens one POST was taking 2 to 3 seconds.

I0605 00:01:55.146161   40327 request.go:1068] Request Body: {"kind":"SelfSubjectAccessReview","apiVersion":"authorization.k8s.io/v1","metadata":{"creationTimestamp":null},"spec":{"resourceAttributes":{"verb":"list","resource":"nodes"}},"status":{"allowed":false}}
I0605 00:01:55.146190   40327 round_trippers.go:420] POST https://localhost:6443/apis/authorization.k8s.io/v1/selfsubjectaccessreviews
I0605 00:01:55.146194   40327 round_trippers.go:427] Request Headers:
I0605 00:01:55.146198   40327 round_trippers.go:431]     User-Agent: k9s/v0.0.0 (darwin/amd64) kubernetes/$Format
I0605 00:01:55.146201   40327 round_trippers.go:431]     Accept: application/json, */*
I0605 00:01:55.146203   40327 round_trippers.go:431]     Content-Type: application/json
I0605 00:01:55.146210   40327 round_trippers.go:431]     Authorization: Basic <masked>
I0605 00:01:58.149696   40327 round_trippers.go:446] Response Status: 201 Created in 3003 milliseconds

And this made the tools timedout.
I have uninstalled kyverno and this post is taking 4 milliseconds.

To Reproduce
Steps to reproduce the behavior:

install kyverno with chart from that commit 2bf51494
install some policies (I don't know if it is related to my policies, I can provide them for reproduction if required)
wait for a while (1 hour I'd say)
try to launch K9s

Expected behavior
K9s should be starting fine

Additional context
here is the k9s (k8s client) logs

The text was updated successfully, but these errors were encountered:

JimBugwadia · 2020-06-05T14:07:59Z

@sgandon - yes, please share the policies. Not sure which rule types they contain - can this be checked via the CLI? Also, can you run a test with 1.1.6 as well?

We are running tests with all of the sample policies and not seeing such a delay. The webhook is configured with a 3s timeout by default via the webhooktimeout flag, so if Kyverno takes longer the API server should not block.

mchebitou · 2020-06-08T18:47:54Z

I have exactly the same problem on the v1.6.0, even without any policy deployed. I noticed also synchronization timeouts when using ArgoCD.

sgandon · 2020-06-10T15:10:26Z

I have been running the latest 1.1.6 for 5 days with those 3 policies and it works like a charm now, so I guess this has been solved with the latest version on my side at least :)
here are the policies :
One mutation policy

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: mutate-cloud-engine-pods
spec:
  rules:
    - name: "remove default kube services environment variables and service account token"
      match:
        resources:
          kinds:
            - Pod
      mutate:
        overlay:
          spec:
            enableServiceLinks: false
            automountServiceAccountToken: false

One generation policy

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: generate-cloud-engine-resources
spec:
  rules:
    - name: create docker secrets
      match:
        resources:
          kinds:
            - Namespace
          selector:
            matchLabels:
                remote-engine: "true"
      generate:
        kind: Secret # Kind of resource 
        name: docker-registry # Name of the new Resource
        namespace: "{{request.object.metadata.name}}" # namespace that triggers this rule
        clone:
          namespace: default
          name: docker-registry
    - name: create default network policy to deny all inbound and outbound traffic
      match:
        resources:
          kinds:
            - Namespace
          selector:
            matchLabels:
                remote-engine: "true"
      generate:
        kind: NetworkPolicy # Kind of resource 
        name: default-deny-all # Name of the new Resource
        namespace: "{{request.object.metadata.name}}" # namespace that triggers this rule
        data:
          spec:
            # select all pods in the namespace
            podSelector: {}
            policyTypes:
            - Ingress
            - Egress

and one validation policy

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: validation-ce-helmrelease-configmap
spec:
  validationFailureAction: enforce
  background: false #disable check on existing resources
  rules:
    - name: check-configmap-has-the-right-key
      match:
        resources:
          # Kind specifies one or more resource types to match
          kinds:
            - ConfigMap
          # Selector is optional
          selector:
            matchLabels:
                remote-engine: "true"
      validate:
        # Message is optional, used to report custom message if the rule condition fails
        message: "The key cloud-engine-release.yaml is required"    
        pattern:
          data:
            cloud-engine-release.yaml: "?*"
    - name: check-crd-follows-the-right-template
      match:
        resources:
          # Kind specifies one or more resource types to match
          kinds:
            - ConfigMap
          # Selector is optional
          selector:
            matchLabels:
                remote-engine: "true"
      validate:
        # Message is optional, used to report custom message if the rule condition fails
        message: | 
          The cloud engind CRD in the configmap does not respect the following template
          apiVersion: helm.fluxcd.io/v1
        pattern:
          data:
            cloud-engine-release.yaml: >
              *kind: HelmRelease*
              namespace: [[tenantNamespace]]*
              engineId: [[engineId]]*
              releaseName:*
              wait: true*
              timeout:*

JimBugwadia · 2020-06-11T22:47:18Z

I have exactly the same problem on the v1.6.0, even without any policy deployed. I noticed also synchronization timeouts when using ArgoCD.

@mchebitou - this is not expected, as if no policies are deployed, Kyverno does not register with the for admission control webhooks. Can you please provide the API server logs and the Kyverno logs? Thanks!

JimBugwadia · 2020-06-11T22:49:20Z

@sgandon - thanks for the update and details. Glad to hear that you are not seeing any issues with 1.1.6 in your setup. Any additional data points over the last few days?

sgandon · 2020-06-12T07:56:52Z

I have not use kyverno lately, just left it run on the cluster and it is running ok and not preventing K9s to start anymore.

sgandon · 2020-06-16T12:19:00Z

Unfortunatly it took longer but is have the same misbehavior again, I did not touch anything and left kyverno (1.1.6) run constantly.
The pods has been restarted 4 times (don't know why) and is now running for 11 days and I have the same behaviour with K9s which I cannot start anymore.
After deleting kyverno, k9s starts fine again. So there is still an issue slowing the kube api server responses again.

JimBugwadia · 2020-06-16T15:22:09Z

@sgandon - do you happen to have the logs? We will also check on clusters where Kyverno 1.1.6 has been running.

sgandon · 2020-06-16T16:43:46Z

@JimBugwadia yes I have stored them before deleting the pod, I am just wondering how to send them to you cause they are too big for pastebin.

JimBugwadia · 2020-06-16T16:53:37Z

@sgandon - I will share a GDrive link via slack.

realshuting · 2020-07-06T23:09:08Z

Hi @sgandon, I could not reproduce this slowness issue while deleting/creating the resource, what were you trying to do with k9s?

Until we figure out the exact issue, we had improved the webhook logic to reduce the processing time when receiving an admission request, see PR #967. Also we print out the processing time per request of an admission request at log level 4. This could potentially help us to find the issue.

sgandon · 2020-07-07T07:19:42Z

@realshuting , the issue happens after a while and prevent k9s to launch. the last time I reproduced the issue I just left Kyverno run and did nothing at all but launching K9s regularly to check if it could be launched or not.
So I guess that I'll wait for the next release to try to reproduce it.

JimBugwadia · 2020-07-29T00:47:15Z

@sgandon - any data on behaviors with 1.1.8?

sgandon · 2020-08-03T09:20:10Z

@JimBugwadia I'll be starting the 1;1.8 to check on this again. I'll let you know.

mbarrien · 2020-08-14T03:18:31Z

I'm definitely seeing this behavior on EKS 1.16 + Kyverno 1.1.9 + k9s 0.21.7. If I uninstall Kyverno, k9s works properly. If I re-install, k9s errors out after a long wait with an error message Post "https://(censored).eks.amazonaws.com/apis/authorization.k8s.io/v1/selfsubjectaccessreviews": context deadline exceeded (where censored is the EKS cluster domain name).

The discussion in derailed/k9s#718 suggests maybe something where Kyverno is interfering with SubjectAccessReview calls?

Also noticed when installing something via "kubectl apply -f" that what would normally be an instantaneous creation of about 20 resources that somethng would slow things down considerably (several seconds between creating each resource). Highly suspecting Kyverno is massively slowing things down in the same way it's slowing down k9s.

Replacing the kyverno pod does nothing to affect the performance or ability to run k9s. Neither does removing all clusterpolicies.

JimBugwadia · 2020-08-14T04:21:10Z

@mbarrien thanks for the details.

Any idea who (which user / group) is invoking the SubjectAccessReview?

Kyverno should be already excluding the SubjectAccessReview. Can you please check the deployment configuration to make sure this is set as expected?

data:
  excludeGroupRole: system:serviceaccounts:kube-system,system:nodes,system:kube-scheduler
  resourceFilters: '[Event,*,*][*,kube-system,*][*,kube-public,*][*,kube-node-lease,*][Node,*,*][APIService,*,*][TokenReview,*,*][SubjectAccessReview,*,*][*,kyverno,*][Binding,*,*][ReplicaSet,*,*]'

mbarrien · 2020-08-14T04:38:19Z

This was with a stock Helm installation, so resourceFilters is whatever's in the Helm chart.

According to kubectl get cm -n (non-standard-namespace) kyverno -o yaml:

data:
  resourceFilters: '[Event,*,*][*,kube-system,*][*,kube-public,*][*,kube-node-lease,*][Node,*,*][APIService,*,*][TokenReview,*,*][SubjectAccessReview,*,*][*,kyverno,*]'

(No excludeGroupRole)

For the user, since we're using EKS, it's using aws-auth to map an IAM role to something that is in group system:masters.

Note that between uninstall and reinstall, I manually deleted any left over bits from kyverno to try to get it to be as clean as possible (removed the apiservice, removed all configmaps, secrets, and CRDs), and this is still happening.

mbarrien · 2020-08-14T04:44:18Z

Also for what it's worth, I have Kyverno installed in a completely separate cluster where k9s+kyverno is not exhibiting this behavior. There's is nothing really of note that's different between the 2 clusters (they're both fairly empty testing clusters that were created around the same time, with the same versions of the same system applications installed on it, only difference is user workloads none of which do anything notable with Kubernetes other than use it to host standard containers).

mbarrien · 2020-08-14T04:51:53Z

In kyverno's logs with standard logging level I see this exactly every minute:

E0814 04:46:33.329088       1 checker.go:99] LastReqTime "msg"="webhook check failed" "error"="Admission control configuration error"  "deadline"=180000000000
E0814 04:46:38.332034       1 memcache.go:111] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0814 04:46:38.332336       1 client.go:326] dclient "msg"="failed to get registered preferred resources" "error"="unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request"
I0814 04:46:38.332358       1 client.go:275] dclient "msg"="schema not found"  "kind"="Deployment"
E0814 04:46:38.334943       1 status.go:61] LastReqTime/StatusControl "msg"="failed to get deployment" "error"="the server could not find the requested resource" "name"="kyverno" "namespace"="(namespace kyverno is installed in)"

These log messages do not show up in the cluster where k9s+kyverno works properly.

EDIT: However, after uninstalling and reinstalling metrics server, these error messages no longer appear but k9s still causes the same error as above when kyverno is installed.
EDIT 2: Correction, once I reinstalled the Kyverno policies, these error messages started showing up again. k9s+kyverno doesn't work together with or without policies installed.

mbarrien · 2020-08-14T07:35:38Z

Further progress on this. I got k9s+kyverno working... but the only thing that got it to work was by replacing every node in the cluster (I did a full node rotation).

I have suspicions something was going wrong at the lowest level in the cluster that somehow was blocking traffic from getting to something that backed an apiservice (based on the error messages above metrics server was a big suspicion).

In any event, I still think there's still something Kyverno related here, in that Kyverno is blocking some API call that k9s uses if something kyverno relies on is broken, even for API calls that are readonly. Unfortunately, I can't help with reproducing any more now that the broken cluster is fixed.

sgandon · 2020-09-02T10:13:54Z

I am back from my vacation and tried to launch K9s without success on my local k3d cluster using version v1.1.8.

Post "https://0.0.0.0:58792/apis/authorization.k8s.io/v1/selfsubjectaccessreviews": context deadline exceeded

So this is happening again after some time.
The only error that kyverno logs is the following

E0901 12:12:29.024570       1 crdSync.go:160]  "msg"="could not parse crd schema" "error"="ERROR schema.properties.spec.properties.rules.items.\u003carray\u003e.properties.validate.properties.deny.properties.conditions.items.\u003carray\u003e.properties.value has invalid property: anyOf"  "name"="ClusterPolicy"

realshuting · 2020-09-02T18:33:48Z

@sgandon
We have found this issue with Helm install, and had it fixed in 1.1.10. Can you upgrade to this version?

Also for the issue you saw, it's related to open API schema version, logged issue, should be fixed in next patch release 1.1.11.

JimBugwadia · 2020-09-24T23:50:23Z

@sgandon @mbarrien - any additional data on this with v1.1.10+?

realshuting · 2020-10-07T22:41:29Z

Closing this issue for now.

Feel free to reopen it if needed.

dirien · 2020-11-23T06:48:58Z

Hi @realshuting ,

ca we reopen this issue again? I get the same error mentioned in the posts above

Post "https://xxx.azmk8s.io:443/apis/authorization.k8s.io/v1/selfsubjectaccessreviews": context deadline exceeded

Only removing kyverno is working to get k9s up and running again.

realshuting · 2020-11-23T20:22:43Z

Which version of Kyverno are you running?
Can you also share the output of kubectl -n kyverno get configmap init-config -o yaml?

JimBugwadia · 2020-11-23T21:02:43Z

See: derailed/k9s#718 and #918 for more context.

realshuting · 2020-11-23T23:18:30Z

@dirien

Another thing wants to confirm - was selfsubjectaccessreviews the only resource that reported "context deadline"? If so, can you try to add SelfSubjectAccessReview to the exclude list and check if the error happens again?

Run $ kubectl -n kyverno edit configmap init-config then add [SelfSubjectAccessReview,*,*] to resourceFilters.

dirien · 2020-11-24T05:04:36Z

hi @realshuting ,

there is not init-config (helm install) but a configmap with the name kyverno

apiVersion: v1
data:
  resourceFilters: '[Event,*,*][*,kube-system,*][*,kube-public,*][*,kube-node-lease,*][Node,*,*][APIService,*,*][TokenReview,*,*][SubjectAccessReview,*,*][*,kyverno,*]'
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: kyverno-kyverno
    meta.helm.sh/release-namespace: kyverno
  creationTimestamp: "2020-11-23T06:51:16Z"
  labels:
    app.kubernetes.io/instance: kyverno-kyverno
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kyverno
    helm.sh/chart: kyverno-1.2.1
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:data:
        .: {}
        f:resourceFilters: {}
      f:metadata:
        f:annotations:
          .: {}
          f:meta.helm.sh/release-name: {}
          f:meta.helm.sh/release-namespace: {}
        f:labels:
          .: {}
          f:app.kubernetes.io/instance: {}
          f:app.kubernetes.io/managed-by: {}
          f:app.kubernetes.io/name: {}
          f:helm.sh/chart: {}
    manager: Go-http-client
    operation: Update
    time: "2020-11-23T06:51:16Z"
  name: kyverno
  namespace: kyverno
  resourceVersion: "1903250"
  selfLink: /api/v1/namespaces/kyverno/configmaps/kyverno
  uid: 82fbdf03-7a49-4826-a5b3-f3ac0c5641bd

Reagardinig adding [SelfSubjectAccessReview,,] to resourceFilters: done

realshuting · 2020-11-24T05:59:57Z

Do you see the same error again?

dirien · 2020-11-24T07:19:00Z

not for now... have to wait :)

realshuting · 2020-11-24T07:28:48Z

If SelfSubjectAccessReview was added to the exclude list, there should have a logging message for this.
What's the output of k logs deploy/kyverno -n kyverno -f | grep ConfigData?

Also, are you seeing these 3 mutating webhook configurations?

✗ k get mutatingwebhookconfigurations
NAME                                    WEBHOOKS   AGE
kyverno-policy-mutating-webhook-cfg     1          114m
kyverno-resource-mutating-webhook-cfg   1          113m
kyverno-verify-mutating-webhook-cfg     1          114m `

dirien · 2020-11-24T07:31:57Z

I1123 06:51:21.245243       1 dynamicconfig.go:255] ConfigData "msg"="Init resource "  "excludeRoles"=""
I1123 06:51:24.611039       1 dynamicconfig.go:197] ConfigData "msg"="Updated resource filters" "name"="kyverno" "namespace"="kyverno" "newFilters"=[{"Kind":"Event","Namespace":"*","Name":"*"},{"Kind":"*","Namespace":"kube-system","Name":"*"},{"Kind":"*","Namespace":"kube-public","Name":"*"},{"Kind":"*","Namespace":"kube-node-lease","Name":"*"},{"Kind":"Node","Namespace":"*","Name":"*"},{"Kind":"APIService","Namespace":"*","Name":"*"},{"Kind":"TokenReview","Namespace":"*","Name":"*"},{"Kind":"SubjectAccessReview","Namespace":"*","Name":"*"},{"Kind":"*","Namespace":"kyverno","Name":"*"}] "oldFilters"=null
I1124 04:58:52.380251       1 dynamicconfig.go:197] ConfigData "msg"="Updated resource filters" "name"="kyverno" "namespace"="kyverno" "newFilters"=[{"Kind":"SelfSubjectAccessReview","Namespace":"*","Name":"*"},{"Kind":"Event","Namespace":"*","Name":"*"},{"Kind":"*","Namespace":"kube-system","Name":"*"},{"Kind":"*","Namespace":"kube-public","Name":"*"},{"Kind":"*","Namespace":"kube-node-lease","Name":"*"},{"Kind":"Node","Namespace":"*","Name":"*"},{"Kind":"APIService","Namespace":"*","Name":"*"},{"Kind":"TokenReview","Namespace":"*","Name":"*"},{"Kind":"SubjectAccessReview","Namespace":"*","Name":"*"},{"Kind":"*","Namespace":"kyverno","Name":"*"}] "oldFilters"=[{"Kind":"Event","Namespace":"*","Name":"*"},{"Kind":"*","Namespace":"kube-system","Name":"*"},{"Kind":"*","Namespace":"kube-public","Name":"*"},{"Kind":"*","Namespace":"kube-node-lease","Name":"*"},{"Kind":"Node","Namespace":"*","Name":"*"},{"Kind":"APIService","Namespace":"*","Name":"*"},{"Kind":"TokenReview","Namespace":"*","Name":"*"},{"Kind":"SubjectAccessReview","Namespace":"*","Name":"*"},{"Kind":"*","Namespace":"kyverno","Name":"*"}]
I1124 04:59:44.014026       1 dynamicconfig.go:197] ConfigData "msg"="Updated resource filters" "name"="kyverno" "namespace"="kyverno" "newFilters"=[{"Kind":"Event","Namespace":"*","Name":"*"},{"Kind":"*","Namespace":"kube-system","Name":"*"},{"Kind":"*","Namespace":"kube-public","Name":"*"},{"Kind":"*","Namespace":"kube-node-lease","Name":"*"},{"Kind":"Node","Namespace":"*","Name":"*"},{"Kind":"APIService","Namespace":"*","Name":"*"},{"Kind":"TokenReview","Namespace":"*","Name":"*"},{"Kind":"SubjectAccessReview","Namespace":"*","Name":"*"},{"Kind":"*","Namespace":"kyverno","Name":"*"}] "oldFilters"=[{"Kind":"SelfSubjectAccessReview","Namespace":"*","Name":"*"},{"Kind":"Event","Namespace":"*","Name":"*"},{"Kind":"*","Namespace":"kube-system","Name":"*"},{"Kind":"*","Namespace":"kube-public","Name":"*"},{"Kind":"*","Namespace":"kube-node-lease","Name":"*"},{"Kind":"Node","Namespace":"*","Name":"*"},{"Kind":"APIService","Namespace":"*","Name":"*"},{"Kind":"TokenReview","Namespace":"*","Name":"*"},{"Kind":"SubjectAccessReview","Namespace":"*","Name":"*"},{"Kind":"*","Namespace":"kyverno","Name":"*"}]
I1124 05:03:30.991975       1 dynamicconfig.go:197] ConfigData "msg"="Updated resource filters" "name"="kyverno" "namespace"="kyverno" "newFilters"=[{"Kind":"SelfSubjectAccessReview","Namespace":"*","Name":"*"},{"Kind":"Event","Namespace":"*","Name":"*"},{"Kind":"*","Namespace":"kube-system","Name":"*"},{"Kind":"*","Namespace":"kube-public","Name":"*"},{"Kind":"*","Namespace":"kube-node-lease","Name":"*"},{"Kind":"Node","Namespace":"*","Name":"*"},{"Kind":"APIService","Namespace":"*","Name":"*"},{"Kind":"TokenReview","Namespace":"*","Name":"*"},{"Kind":"SubjectAccessReview","Namespace":"*","Name":"*"},{"Kind":"*","Namespace":"kyverno","Name":"*"},{"Kind":"Binding","Namespace":"*","Name":"*"},{"Kind":"ReplicaSet","Namespace":"*","Name":"*"},{"Kind":"ReportChangeRequest","Namespace":"*","Name":"*"},{"Kind":"ClusterReportChangeRequest","Namespace":"*","Name":"*"}] "oldFilters"=[{"Kind":"Event","Namespace":"*","Name":"*"},{"Kind":"*","Namespace":"kube-system","Name":"*"},{"Kind":"*","Namespace":"kube-public","Name":"*"},{"Kind":"*","Namespace":"kube-node-lease","Name":"*"},{"Kind":"Node","Namespace":"*","Name":"*"},{"Kind":"APIService","Namespace":"*","Name":"*"},{"Kind":"TokenReview","Namespace":"*","Name":"*"},{"Kind":"SubjectAccessReview","Namespace":"*","Name":"*"},{"Kind":"*","Namespace":"kyverno","Name":"*"}]

And yes i can see the three webhooks :)

realshuting · 2020-11-24T18:32:37Z

Seems now the error is gone, leaving the issue opening for a few more days.

@dirien - Feel free to update the issue if you see it again.

dirien · 2020-11-25T05:08:09Z

unfortunalty it come back...

i noticed this happens when the kynverno pod is crashing on unknown reasons:

kyverno-55459799c5-6w45x 0/1 CrashLoopBackOff 8 46h

Logs are

I1125 05:02:55.309288       1 version.go:17]  "msg"="Kyverno"  "Version"="v1.2.1"
I1125 05:02:55.309534       1 version.go:18]  "msg"="Kyverno"  "BuildHash"="(HEAD/af07e66edea834bf8bbc12580b57345ff9cb543a"
I1125 05:02:55.309657       1 version.go:19]  "msg"="Kyverno"  "BuildTime"="2020-10-22_08:08:42PM"
I1125 05:02:55.310031       1 config.go:84] CreateClientConfig "msg"="Using in-cluster configuration"
I1125 05:02:55.311343       1 client.go:257] dclient/Poll "msg"="starting registered resources sync"  "period"=300000000000
E1125 05:02:55.464826       1 memcache.go:199] couldn't get resource list for external.metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E1125 05:03:00.467669       1 memcache.go:199] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

kube-metrics is up and running.

metrics-server-58fdc875d5-bb8wt 1/1 Running 0 46h

dirien · 2020-11-25T05:10:16Z

Unfortunalty a restart of the kyverno pod is not helping at all. Somehow kyverno get stuck...

dirien · 2020-11-25T05:18:16Z

Onyl thing what is helping, is to reinstall kyverno. Than everything works fine:

I1125 05:15:56.654509       1 config.go:84] CreateClientConfig "msg"="Using in-cluster configuration"                                                                               │
│ I1125 05:15:56.655921       1 client.go:257] dclient/Poll "msg"="starting registered resources sync"  "period"=300000000000                                                         │
│ I1125 05:15:59.062480       1 util.go:76] CRDInstalled "msg"="CRD found"  "kind"="ClusterPolicy"                                                                                    │
│ I1125 05:15:59.062937       1 util.go:76] CRDInstalled "msg"="CRD found"  "kind"="ClusterPolicyViolation"                                                                           │
│ I1125 05:15:59.063264       1 util.go:76] CRDInstalled "msg"="CRD found"  "kind"="PolicyViolation"                                                                                  │
│ I1125 05:15:59.065051       1 client.go:280] dclient "msg"="schema not found"  "kind"="NamespacedPolicyViolation"                                                                   │
│ I1125 05:15:59.065143       1 dynamicconfig.go:255] ConfigData "msg"="Init resource "  "excludeRoles"=""                                                                            │
│ I1125 05:15:59.066248       1 certificates.go:26] dclient "msg"="Building key/certificate pair for TLS"                                                                             │
│ I1125 05:15:59.458919       1 certificates.go:169] dclient/CAcert "msg"="secret updated"  "name"="kyverno-svc.kyverno.svc.kyverno-tls-ca" "namespace"="kyverno"                     │
│ I1125 05:15:59.748016       1 certificates.go:217] dclient/WriteTlsPair "msg"="secret updated"  "name"="kyverno-svc.kyverno.svc.kyverno-tls-pair" "namespace"="kyverno"             │
│ I1125 05:15:59.748056       1 registration.go:252] WebhookRegistrationClient "msg"="removing prior webhook configurations"                                                          │
│ I1125 05:16:02.065753       1 controller.go:210] GenerateCleanUpController "msg"="starting"                                                                                         │
│ I1125 05:16:02.065760       1 controller.go:326] PolicyController "msg"="starting"                                                                                                  │
│ I1125 05:16:02.065786       1 informer.go:109] PolicyCacheController "msg"="starting"                                                                                               │
│ I1125 05:16:02.065817       1 generator.go:148] PolicyViolationGenerator "msg"="start"                                                                                              │
│ I1125 05:16:02.065837       1 controller.go:232] GenerateController "msg"="starting"                                                                                                │
│ I1125 05:16:02.065827       1 controller.go:107] EventGenerator "msg"="start"                                                                                                       │
│ I1125 05:16:02.442666       1 dynamicconfig.go:197] ConfigData "msg"="Updated resource filters" "name"="kyverno" "namespace"="kyverno" "newFilters"=[{"Kind":"SelfSubjectAccessRevi │
│ I1125 05:16:02.654660       1 rwebhookregister.go:107] ResourceWebhookRegister "msg"="created validating webhook"  "name"="kyverno-resource-validating-webhook-cfg"                 │
│ I1125 05:16:02.853627       1 rwebhookregister.go:82] ResourceWebhookRegister "msg"="created mutating webhook"  "name"="kyverno-resource-mutating-webhook-cfg"                      │
│ I1125 05:16:03.057849       1 rwebhookregister.go:82] ResourceWebhookRegister "msg"="created mutating webhook"  "name"="kyverno-resource-mutating-webhook-cfg"                      │
│ I1125 05:16:04.231436       1 server.go:516] WebhookServer "msg"="starting"

realshuting · 2020-11-25T05:22:02Z

Found similar issues kubernetes-sigs/metrics-server#188, kubernetes-sigs/metrics-server#448, is there any error in metric-server?

Are you running an EKS cluster? If so, do you have this flag --kubelet-preferred-address-types=InternalIPset? See detail.

It would be helpful if you can share your metric-server configurations, I'll try to reproduce the same on my end.

dirien · 2020-11-25T05:26:42Z

Hi,

i stumbled on the same issues :D regarding the metrics server. But this one looks fine just some logs reagarding the ACI virtual nodes

│ E1125 05:20:46.018975       1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:virtual-node-aci-linux: unable to get val │
│ id timestamp for metric point for node "10.34.0.7", discarding data: no non-zero timestamp on either CPU or memory                                                                  │
│ E1125 05:21:19.537495       1 reststorage.go:135] unable to fetch node metrics for node "virtual-node-aci-linux": no metrics known for node                                         │
│ E1125 05:21:46.079975       1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:virtual-node-aci-linux: unable to get val │
│ id timestamp for metric point for node "10.34.0.7", discarding data: no non-zero timestamp on either CPU or memory                                                                  │
│ E1125 05:22:19.739157       1 reststorage.go:135] unable to fetch node metrics for node "virtual-node-aci-linux": no metrics known for node                                         │
│ E1125 05:22:46.026240       1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:virtual-node-aci-linux: unable to get val │
│ id timestamp for metric point for node "10.34.0.7", discarding data: no non-zero timestamp on either CPU or memory                                                                  │
│ E1125 05:23:19.936476       1 reststorage.go:135] unable to fetch node metrics for node "virtual-node-aci-linux": no metrics known for node                                         │
│ E1125 05:23:46.024302       1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:virtual-node-aci-linux: unable to get val │
│ id timestamp for metric point for node "10.34.0.7", discarding data: no non-zero timestamp on either CPU or memory

Reinstalling kyverno without touching kube-metrics works fine..

I am running an AKS (1.18.x) stack and the kube-metrics deploy gets done my Microsoft.

here the snippet of the deployment:


   containers:
      - command:
        - /metrics-server
        - --kubelet-insecure-tls
        - --kubelet-preferred-address-types=InternalIP
        - --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
        image: mcr.microsoft.com/oss/kubernetes/metrics-server:v0.3.6
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:

realshuting · 2020-11-25T05:35:50Z

Ok, I installed the metric-server on my cluster, so far it looks all good. I'll get back to you if I see the error.

spec:
  containers:
  - command:
    - /metrics-server
    - --source=kubernetes.summary_api:https://kubernetes.default?kubeletHttps=true&kubeletPort=10250&insecure=true
    image: k8s.gcr.io/metrics-server-amd64:v0.2.1
    imagePullPolicy: Always
    name: metrics-server

dirien · 2020-11-25T05:43:46Z

curious to see the results... :)

dirien · 2020-11-30T07:09:55Z

Still crashing... with

I1130 07:05:59.751872       1 client.go:257] dclient/Poll "msg"="starting registered resources sync"  "period"=300000000000
E1130 07:05:59.870142       1 memcache.go:199] couldn't get resource list for external.metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E1130 07:06:04.872074       1 memcache.go:199] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E1130 07:06:09.874085       1 memcache.go:111] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E1130 07:06:09.876580       1 memcache.go:111] couldn't get resource list for external.metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E1130 07:06:09.876956       1 main.go:52] resourcecache "msg"="error in udateGVRCache function" "error"="unable to retrieve the complete list of server APIs: external.metrics.k8s.io/v1beta1: the server is currently unable to handle the request, metrics.k8s.io/v1beta1: the server is currently unable to handle the request"
E1130 07:06:09.877002       1 main.go:112] setup "msg"="Failed to create resource cache" "error"="unable to retrieve the complete list of server APIs: external.metrics.k8s.io/v1beta1: the server is currently unable to handle the request, metrics.k8s.io/v1beta1: the server is currently unable to handle the request"

is this related with #1325 ?

realshuting · 2020-11-30T21:13:34Z

Yes it's related to #1324, you can track the fix in PR #1342.

For the slowness issue, I have left the metric server running in my cluster for over a day, and I could see Kyverno stopped working, or say, behaved relatively slow. We suspect it dues to the flooding throttling requests, see #1319.

We are patching several fixes to see if it resolves the slowness issue, I'll test this again after the fix.

dirien · 2020-12-01T04:53:34Z

thanks @realshuting for the feedback, the throtteling i could see too under flux2

flux get helmreleases -A
I1201 05:52:38.575285   14972 request.go:645] Throttling request took 1.149248624s, request: GET:https://xxx.westeurope.azmk8s.io:443/apis/monitoring.coreos.com/v1?timeout=32s

realshuting · 2020-12-02T20:53:50Z

@dirien
Can you upgrade Kyverno to v1.3.0-rc6? You should not see the crashing anymore.

And we had trimmed some of the throttling requests that would potentially help to solve the slowness issue.

realshuting · 2020-12-16T01:05:04Z

@dirien Any updates on this?

realshuting · 2021-01-06T23:54:12Z

Closing this.

Feel free to reopen if needed.

sgandon added the bug Something isn't working label Jun 5, 2020

realshuting self-assigned this Jun 24, 2020

realshuting added this to the Kyverno Release 1.1.7 milestone Jun 24, 2020

JimBugwadia modified the milestones: Kyverno Release 1.1.7, Kyverno Release 1.1.9 Jul 28, 2020

realshuting modified the milestones: Kyverno Release 1.1.9, Kyverno Release 1.2.0 Aug 12, 2020

realshuting closed this as completed Oct 7, 2020

realshuting reopened this Nov 23, 2020

realshuting modified the milestones: Kyverno Release 1.2.0, Kyverno Release 1.3.0 Nov 24, 2020

realshuting closed this as completed Jan 6, 2021

[BUG] kyverno slowing made K9s timedout #918

[BUG] kyverno slowing made K9s timedout #918

Comments

sgandon commented Jun 5, 2020 • edited Loading

JimBugwadia commented Jun 5, 2020

mchebitou commented Jun 8, 2020 • edited Loading

sgandon commented Jun 10, 2020

JimBugwadia commented Jun 11, 2020

JimBugwadia commented Jun 11, 2020

sgandon commented Jun 12, 2020 • edited Loading

sgandon commented Jun 16, 2020 • edited Loading

JimBugwadia commented Jun 16, 2020

sgandon commented Jun 16, 2020

JimBugwadia commented Jun 16, 2020

realshuting commented Jul 6, 2020

sgandon commented Jul 7, 2020

JimBugwadia commented Jul 29, 2020

sgandon commented Aug 3, 2020

mbarrien commented Aug 14, 2020

JimBugwadia commented Aug 14, 2020

mbarrien commented Aug 14, 2020

mbarrien commented Aug 14, 2020

mbarrien commented Aug 14, 2020 • edited Loading

mbarrien commented Aug 14, 2020

sgandon commented Sep 2, 2020

realshuting commented Sep 2, 2020

JimBugwadia commented Sep 24, 2020

realshuting commented Oct 7, 2020

dirien commented Nov 23, 2020

realshuting commented Nov 23, 2020

JimBugwadia commented Nov 23, 2020

realshuting commented Nov 23, 2020

dirien commented Nov 24, 2020

realshuting commented Nov 24, 2020

dirien commented Nov 24, 2020

realshuting commented Nov 24, 2020 • edited Loading

dirien commented Nov 24, 2020

realshuting commented Nov 24, 2020

dirien commented Nov 25, 2020

dirien commented Nov 25, 2020 • edited Loading

dirien commented Nov 25, 2020

realshuting commented Nov 25, 2020

dirien commented Nov 25, 2020 • edited Loading

realshuting commented Nov 25, 2020

dirien commented Nov 25, 2020 • edited Loading

dirien commented Nov 30, 2020

realshuting commented Nov 30, 2020

dirien commented Dec 1, 2020

realshuting commented Dec 2, 2020

realshuting commented Dec 16, 2020

realshuting commented Jan 6, 2021

sgandon commented Jun 5, 2020 •

edited

Loading

mchebitou commented Jun 8, 2020 •

edited

Loading

sgandon commented Jun 12, 2020 •

edited

Loading

sgandon commented Jun 16, 2020 •

edited

Loading

mbarrien commented Aug 14, 2020 •

edited

Loading

realshuting commented Nov 24, 2020 •

edited

Loading

dirien commented Nov 25, 2020 •

edited

Loading

dirien commented Nov 25, 2020 •

edited

Loading

dirien commented Nov 25, 2020 •

edited

Loading