Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] kyverno slowing made K9s timedout #918

Closed
sgandon opened this issue Jun 5, 2020 · 47 comments
Closed

[BUG] kyverno slowing made K9s timedout #918

sgandon opened this issue Jun 5, 2020 · 47 comments
Assignees
Labels
bug Something isn't working

Comments

@sgandon
Copy link
Contributor

sgandon commented Jun 5, 2020

Describe the bug
I tracked down this bug for while and it happens on v1.1.5. I know you have been working hard since and I have not had the time to test the lastest versions but I will.
I used the helm chat install kyverno.
I also installed 3 policies.
What I observed after a while is that K9s (a cli to manage clusters) was failing to start because of some timeout.
I tried to debug it and it happens one POST was taking 2 to 3 seconds.

I0605 00:01:55.146161   40327 request.go:1068] Request Body: {"kind":"SelfSubjectAccessReview","apiVersion":"authorization.k8s.io/v1","metadata":{"creationTimestamp":null},"spec":{"resourceAttributes":{"verb":"list","resource":"nodes"}},"status":{"allowed":false}}
I0605 00:01:55.146190   40327 round_trippers.go:420] POST https://localhost:6443/apis/authorization.k8s.io/v1/selfsubjectaccessreviews
I0605 00:01:55.146194   40327 round_trippers.go:427] Request Headers:
I0605 00:01:55.146198   40327 round_trippers.go:431]     User-Agent: k9s/v0.0.0 (darwin/amd64) kubernetes/$Format
I0605 00:01:55.146201   40327 round_trippers.go:431]     Accept: application/json, */*
I0605 00:01:55.146203   40327 round_trippers.go:431]     Content-Type: application/json
I0605 00:01:55.146210   40327 round_trippers.go:431]     Authorization: Basic <masked>
I0605 00:01:58.149696   40327 round_trippers.go:446] Response Status: 201 Created in 3003 milliseconds

And this made the tools timedout.
I have uninstalled kyverno and this post is taking 4 milliseconds.

To Reproduce
Steps to reproduce the behavior:

  1. install kyverno with chart from that commit 2bf51494
  2. install some policies (I don't know if it is related to my policies, I can provide them for reproduction if required)
  3. wait for a while (1 hour I'd say)
  4. try to launch K9s

Expected behavior
K9s should be starting fine

Additional context
here is the k9s (k8s client) logs

@sgandon sgandon added the bug Something isn't working label Jun 5, 2020
@JimBugwadia
Copy link
Member

@sgandon - yes, please share the policies. Not sure which rule types they contain - can this be checked via the CLI? Also, can you run a test with 1.1.6 as well?

We are running tests with all of the sample policies and not seeing such a delay. The webhook is configured with a 3s timeout by default via the webhooktimeout flag, so if Kyverno takes longer the API server should not block.

@mchebitou
Copy link

mchebitou commented Jun 8, 2020

I have exactly the same problem on the v1.6.0, even without any policy deployed. I noticed also synchronization timeouts when using ArgoCD.

@sgandon
Copy link
Contributor Author

sgandon commented Jun 10, 2020

I have been running the latest 1.1.6 for 5 days with those 3 policies and it works like a charm now, so I guess this has been solved with the latest version on my side at least :)
here are the policies :
One mutation policy

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: mutate-cloud-engine-pods
spec:
  rules:
    - name: "remove default kube services environment variables and service account token"
      match:
        resources:
          kinds:
            - Pod
      mutate:
        overlay:
          spec:
            enableServiceLinks: false
            automountServiceAccountToken: false

One generation policy

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: generate-cloud-engine-resources
spec:
  rules:
    - name: create docker secrets
      match:
        resources:
          kinds:
            - Namespace
          selector:
            matchLabels:
                remote-engine: "true"
      generate:
        kind: Secret # Kind of resource 
        name: docker-registry # Name of the new Resource
        namespace: "{{request.object.metadata.name}}" # namespace that triggers this rule
        clone:
          namespace: default
          name: docker-registry
    - name: create default network policy to deny all inbound and outbound traffic
      match:
        resources:
          kinds:
            - Namespace
          selector:
            matchLabels:
                remote-engine: "true"
      generate:
        kind: NetworkPolicy # Kind of resource 
        name: default-deny-all # Name of the new Resource
        namespace: "{{request.object.metadata.name}}" # namespace that triggers this rule
        data:
          spec:
            # select all pods in the namespace
            podSelector: {}
            policyTypes:
            - Ingress
            - Egress

and one validation policy

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: validation-ce-helmrelease-configmap
spec:
  validationFailureAction: enforce
  background: false #disable check on existing resources
  rules:
    - name: check-configmap-has-the-right-key
      match:
        resources:
          # Kind specifies one or more resource types to match
          kinds:
            - ConfigMap
          # Selector is optional
          selector:
            matchLabels:
                remote-engine: "true"
      validate:
        # Message is optional, used to report custom message if the rule condition fails
        message: "The key cloud-engine-release.yaml is required"    
        pattern:
          data:
            cloud-engine-release.yaml: "?*"
    - name: check-crd-follows-the-right-template
      match:
        resources:
          # Kind specifies one or more resource types to match
          kinds:
            - ConfigMap
          # Selector is optional
          selector:
            matchLabels:
                remote-engine: "true"
      validate:
        # Message is optional, used to report custom message if the rule condition fails
        message: | 
          The cloud engind CRD in the configmap does not respect the following template
          apiVersion: helm.fluxcd.io/v1
        pattern:
          data:
            cloud-engine-release.yaml: >
              *kind: HelmRelease*
              namespace: [[tenantNamespace]]*
              engineId: [[engineId]]*
              releaseName:*
              wait: true*
              timeout:*

@JimBugwadia
Copy link
Member

I have exactly the same problem on the v1.6.0, even without any policy deployed. I noticed also synchronization timeouts when using ArgoCD.

@mchebitou - this is not expected, as if no policies are deployed, Kyverno does not register with the for admission control webhooks. Can you please provide the API server logs and the Kyverno logs? Thanks!

@JimBugwadia
Copy link
Member

@sgandon - thanks for the update and details. Glad to hear that you are not seeing any issues with 1.1.6 in your setup. Any additional data points over the last few days?

@sgandon
Copy link
Contributor Author

sgandon commented Jun 12, 2020

I have not use kyverno lately, just left it run on the cluster and it is running ok and not preventing K9s to start anymore.

@sgandon
Copy link
Contributor Author

sgandon commented Jun 16, 2020

Unfortunatly it took longer but is have the same misbehavior again, I did not touch anything and left kyverno (1.1.6) run constantly.
The pods has been restarted 4 times (don't know why) and is now running for 11 days and I have the same behaviour with K9s which I cannot start anymore.
After deleting kyverno, k9s starts fine again. So there is still an issue slowing the kube api server responses again.

@JimBugwadia
Copy link
Member

@sgandon - do you happen to have the logs? We will also check on clusters where Kyverno 1.1.6 has been running.

@sgandon
Copy link
Contributor Author

sgandon commented Jun 16, 2020

@JimBugwadia yes I have stored them before deleting the pod, I am just wondering how to send them to you cause they are too big for pastebin.

@JimBugwadia
Copy link
Member

@sgandon - I will share a GDrive link via slack.

@realshuting realshuting self-assigned this Jun 24, 2020
@realshuting realshuting added this to the Kyverno Release 1.1.7 milestone Jun 24, 2020
@realshuting
Copy link
Member

Hi @sgandon, I could not reproduce this slowness issue while deleting/creating the resource, what were you trying to do with k9s?

Until we figure out the exact issue, we had improved the webhook logic to reduce the processing time when receiving an admission request, see PR #967. Also we print out the processing time per request of an admission request at log level 4. This could potentially help us to find the issue.

@sgandon
Copy link
Contributor Author

sgandon commented Jul 7, 2020

@realshuting , the issue happens after a while and prevent k9s to launch. the last time I reproduced the issue I just left Kyverno run and did nothing at all but launching K9s regularly to check if it could be launched or not.
So I guess that I'll wait for the next release to try to reproduce it.

@JimBugwadia
Copy link
Member

@sgandon - any data on behaviors with 1.1.8?

@sgandon
Copy link
Contributor Author

sgandon commented Aug 3, 2020

@JimBugwadia I'll be starting the 1;1.8 to check on this again. I'll let you know.

@mbarrien
Copy link
Contributor

I'm definitely seeing this behavior on EKS 1.16 + Kyverno 1.1.9 + k9s 0.21.7. If I uninstall Kyverno, k9s works properly. If I re-install, k9s errors out after a long wait with an error message Post "https://(censored).eks.amazonaws.com/apis/authorization.k8s.io/v1/selfsubjectaccessreviews": context deadline exceeded (where censored is the EKS cluster domain name).

The discussion in derailed/k9s#718 suggests maybe something where Kyverno is interfering with SubjectAccessReview calls?

Also noticed when installing something via "kubectl apply -f" that what would normally be an instantaneous creation of about 20 resources that somethng would slow things down considerably (several seconds between creating each resource). Highly suspecting Kyverno is massively slowing things down in the same way it's slowing down k9s.

Replacing the kyverno pod does nothing to affect the performance or ability to run k9s. Neither does removing all clusterpolicies.

@JimBugwadia
Copy link
Member

@mbarrien thanks for the details.

Any idea who (which user / group) is invoking the SubjectAccessReview?

Kyverno should be already excluding the SubjectAccessReview. Can you please check the deployment configuration to make sure this is set as expected?

data:
  excludeGroupRole: system:serviceaccounts:kube-system,system:nodes,system:kube-scheduler
  resourceFilters: '[Event,*,*][*,kube-system,*][*,kube-public,*][*,kube-node-lease,*][Node,*,*][APIService,*,*][TokenReview,*,*][SubjectAccessReview,*,*][*,kyverno,*][Binding,*,*][ReplicaSet,*,*]'

@mbarrien
Copy link
Contributor

This was with a stock Helm installation, so resourceFilters is whatever's in the Helm chart.

According to kubectl get cm -n (non-standard-namespace) kyverno -o yaml:

data:
  resourceFilters: '[Event,*,*][*,kube-system,*][*,kube-public,*][*,kube-node-lease,*][Node,*,*][APIService,*,*][TokenReview,*,*][SubjectAccessReview,*,*][*,kyverno,*]'

(No excludeGroupRole)

For the user, since we're using EKS, it's using aws-auth to map an IAM role to something that is in group system:masters.

Note that between uninstall and reinstall, I manually deleted any left over bits from kyverno to try to get it to be as clean as possible (removed the apiservice, removed all configmaps, secrets, and CRDs), and this is still happening.

@mbarrien
Copy link
Contributor

Also for what it's worth, I have Kyverno installed in a completely separate cluster where k9s+kyverno is not exhibiting this behavior. There's is nothing really of note that's different between the 2 clusters (they're both fairly empty testing clusters that were created around the same time, with the same versions of the same system applications installed on it, only difference is user workloads none of which do anything notable with Kubernetes other than use it to host standard containers).

@mbarrien
Copy link
Contributor

mbarrien commented Aug 14, 2020

In kyverno's logs with standard logging level I see this exactly every minute:

E0814 04:46:33.329088       1 checker.go:99] LastReqTime "msg"="webhook check failed" "error"="Admission control configuration error"  "deadline"=180000000000
E0814 04:46:38.332034       1 memcache.go:111] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0814 04:46:38.332336       1 client.go:326] dclient "msg"="failed to get registered preferred resources" "error"="unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request"
I0814 04:46:38.332358       1 client.go:275] dclient "msg"="schema not found"  "kind"="Deployment"
E0814 04:46:38.334943       1 status.go:61] LastReqTime/StatusControl "msg"="failed to get deployment" "error"="the server could not find the requested resource" "name"="kyverno" "namespace"="(namespace kyverno is installed in)"

These log messages do not show up in the cluster where k9s+kyverno works properly.

EDIT: However, after uninstalling and reinstalling metrics server, these error messages no longer appear but k9s still causes the same error as above when kyverno is installed.
EDIT 2: Correction, once I reinstalled the Kyverno policies, these error messages started showing up again. k9s+kyverno doesn't work together with or without policies installed.

@mbarrien
Copy link
Contributor

Further progress on this. I got k9s+kyverno working... but the only thing that got it to work was by replacing every node in the cluster (I did a full node rotation).

I have suspicions something was going wrong at the lowest level in the cluster that somehow was blocking traffic from getting to something that backed an apiservice (based on the error messages above metrics server was a big suspicion).

In any event, I still think there's still something Kyverno related here, in that Kyverno is blocking some API call that k9s uses if something kyverno relies on is broken, even for API calls that are readonly. Unfortunately, I can't help with reproducing any more now that the broken cluster is fixed.

@sgandon
Copy link
Contributor Author

sgandon commented Sep 2, 2020

I am back from my vacation and tried to launch K9s without success on my local k3d cluster using version v1.1.8.

Post "https://0.0.0.0:58792/apis/authorization.k8s.io/v1/selfsubjectaccessreviews": context deadline exceeded

So this is happening again after some time.
The only error that kyverno logs is the following

E0901 12:12:29.024570       1 crdSync.go:160]  "msg"="could not parse crd schema" "error"="ERROR schema.properties.spec.properties.rules.items.\u003carray\u003e.properties.validate.properties.deny.properties.conditions.items.\u003carray\u003e.properties.value has invalid property: anyOf"  "name"="ClusterPolicy"

@realshuting
Copy link
Member

@sgandon
We have found this issue with Helm install, and had it fixed in 1.1.10. Can you upgrade to this version?

Also for the issue you saw, it's related to open API schema version, logged issue, should be fixed in next patch release 1.1.11.

@JimBugwadia
Copy link
Member

@sgandon @mbarrien - any additional data on this with v1.1.10+?

@realshuting
Copy link
Member

Closing this issue for now.

Feel free to reopen it if needed.

@dirien
Copy link

dirien commented Nov 23, 2020

Hi @realshuting ,

ca we reopen this issue again? I get the same error mentioned in the posts above

Post "https://xxx.azmk8s.io:443/apis/authorization.k8s.io/v1/selfsubjectaccessreviews": context deadline exceeded

image

Only removing kyverno is working to get k9s up and running again.

@realshuting
Copy link
Member

Which version of Kyverno are you running?
Can you also share the output of kubectl -n kyverno get configmap init-config -o yaml?

@realshuting realshuting reopened this Nov 23, 2020
@JimBugwadia
Copy link
Member

See: derailed/k9s#718 and #918 for more context.

@realshuting
Copy link
Member

@dirien

Another thing wants to confirm - was selfsubjectaccessreviews the only resource that reported "context deadline"? If so, can you try to add SelfSubjectAccessReview to the exclude list and check if the error happens again?

Run $ kubectl -n kyverno edit configmap init-config then add [SelfSubjectAccessReview,*,*] to resourceFilters.

@dirien
Copy link

dirien commented Nov 24, 2020

hi @realshuting ,

there is not init-config (helm install) but a configmap with the name kyverno

apiVersion: v1
data:
  resourceFilters: '[Event,*,*][*,kube-system,*][*,kube-public,*][*,kube-node-lease,*][Node,*,*][APIService,*,*][TokenReview,*,*][SubjectAccessReview,*,*][*,kyverno,*]'
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: kyverno-kyverno
    meta.helm.sh/release-namespace: kyverno
  creationTimestamp: "2020-11-23T06:51:16Z"
  labels:
    app.kubernetes.io/instance: kyverno-kyverno
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kyverno
    helm.sh/chart: kyverno-1.2.1
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:data:
        .: {}
        f:resourceFilters: {}
      f:metadata:
        f:annotations:
          .: {}
          f:meta.helm.sh/release-name: {}
          f:meta.helm.sh/release-namespace: {}
        f:labels:
          .: {}
          f:app.kubernetes.io/instance: {}
          f:app.kubernetes.io/managed-by: {}
          f:app.kubernetes.io/name: {}
          f:helm.sh/chart: {}
    manager: Go-http-client
    operation: Update
    time: "2020-11-23T06:51:16Z"
  name: kyverno
  namespace: kyverno
  resourceVersion: "1903250"
  selfLink: /api/v1/namespaces/kyverno/configmaps/kyverno
  uid: 82fbdf03-7a49-4826-a5b3-f3ac0c5641bd

Reagardinig adding [SelfSubjectAccessReview,,] to resourceFilters: done

@realshuting
Copy link
Member

Do you see the same error again?

@dirien
Copy link

dirien commented Nov 24, 2020

not for now... have to wait :)

@realshuting
Copy link
Member

realshuting commented Nov 24, 2020

If SelfSubjectAccessReview was added to the exclude list, there should have a logging message for this.
What's the output of k logs deploy/kyverno -n kyverno -f | grep ConfigData?

Also, are you seeing these 3 mutating webhook configurations?

✗ k get mutatingwebhookconfigurations
NAME                                    WEBHOOKS   AGE
kyverno-policy-mutating-webhook-cfg     1          114m
kyverno-resource-mutating-webhook-cfg   1          113m
kyverno-verify-mutating-webhook-cfg     1          114m `

@dirien
Copy link

dirien commented Nov 24, 2020

I1123 06:51:21.245243       1 dynamicconfig.go:255] ConfigData "msg"="Init resource "  "excludeRoles"=""
I1123 06:51:24.611039       1 dynamicconfig.go:197] ConfigData "msg"="Updated resource filters" "name"="kyverno" "namespace"="kyverno" "newFilters"=[{"Kind":"Event","Namespace":"*","Name":"*"},{"Kind":"*","Namespace":"kube-system","Name":"*"},{"Kind":"*","Namespace":"kube-public","Name":"*"},{"Kind":"*","Namespace":"kube-node-lease","Name":"*"},{"Kind":"Node","Namespace":"*","Name":"*"},{"Kind":"APIService","Namespace":"*","Name":"*"},{"Kind":"TokenReview","Namespace":"*","Name":"*"},{"Kind":"SubjectAccessReview","Namespace":"*","Name":"*"},{"Kind":"*","Namespace":"kyverno","Name":"*"}] "oldFilters"=null
I1124 04:58:52.380251       1 dynamicconfig.go:197] ConfigData "msg"="Updated resource filters" "name"="kyverno" "namespace"="kyverno" "newFilters"=[{"Kind":"SelfSubjectAccessReview","Namespace":"*","Name":"*"},{"Kind":"Event","Namespace":"*","Name":"*"},{"Kind":"*","Namespace":"kube-system","Name":"*"},{"Kind":"*","Namespace":"kube-public","Name":"*"},{"Kind":"*","Namespace":"kube-node-lease","Name":"*"},{"Kind":"Node","Namespace":"*","Name":"*"},{"Kind":"APIService","Namespace":"*","Name":"*"},{"Kind":"TokenReview","Namespace":"*","Name":"*"},{"Kind":"SubjectAccessReview","Namespace":"*","Name":"*"},{"Kind":"*","Namespace":"kyverno","Name":"*"}] "oldFilters"=[{"Kind":"Event","Namespace":"*","Name":"*"},{"Kind":"*","Namespace":"kube-system","Name":"*"},{"Kind":"*","Namespace":"kube-public","Name":"*"},{"Kind":"*","Namespace":"kube-node-lease","Name":"*"},{"Kind":"Node","Namespace":"*","Name":"*"},{"Kind":"APIService","Namespace":"*","Name":"*"},{"Kind":"TokenReview","Namespace":"*","Name":"*"},{"Kind":"SubjectAccessReview","Namespace":"*","Name":"*"},{"Kind":"*","Namespace":"kyverno","Name":"*"}]
I1124 04:59:44.014026       1 dynamicconfig.go:197] ConfigData "msg"="Updated resource filters" "name"="kyverno" "namespace"="kyverno" "newFilters"=[{"Kind":"Event","Namespace":"*","Name":"*"},{"Kind":"*","Namespace":"kube-system","Name":"*"},{"Kind":"*","Namespace":"kube-public","Name":"*"},{"Kind":"*","Namespace":"kube-node-lease","Name":"*"},{"Kind":"Node","Namespace":"*","Name":"*"},{"Kind":"APIService","Namespace":"*","Name":"*"},{"Kind":"TokenReview","Namespace":"*","Name":"*"},{"Kind":"SubjectAccessReview","Namespace":"*","Name":"*"},{"Kind":"*","Namespace":"kyverno","Name":"*"}] "oldFilters"=[{"Kind":"SelfSubjectAccessReview","Namespace":"*","Name":"*"},{"Kind":"Event","Namespace":"*","Name":"*"},{"Kind":"*","Namespace":"kube-system","Name":"*"},{"Kind":"*","Namespace":"kube-public","Name":"*"},{"Kind":"*","Namespace":"kube-node-lease","Name":"*"},{"Kind":"Node","Namespace":"*","Name":"*"},{"Kind":"APIService","Namespace":"*","Name":"*"},{"Kind":"TokenReview","Namespace":"*","Name":"*"},{"Kind":"SubjectAccessReview","Namespace":"*","Name":"*"},{"Kind":"*","Namespace":"kyverno","Name":"*"}]
I1124 05:03:30.991975       1 dynamicconfig.go:197] ConfigData "msg"="Updated resource filters" "name"="kyverno" "namespace"="kyverno" "newFilters"=[{"Kind":"SelfSubjectAccessReview","Namespace":"*","Name":"*"},{"Kind":"Event","Namespace":"*","Name":"*"},{"Kind":"*","Namespace":"kube-system","Name":"*"},{"Kind":"*","Namespace":"kube-public","Name":"*"},{"Kind":"*","Namespace":"kube-node-lease","Name":"*"},{"Kind":"Node","Namespace":"*","Name":"*"},{"Kind":"APIService","Namespace":"*","Name":"*"},{"Kind":"TokenReview","Namespace":"*","Name":"*"},{"Kind":"SubjectAccessReview","Namespace":"*","Name":"*"},{"Kind":"*","Namespace":"kyverno","Name":"*"},{"Kind":"Binding","Namespace":"*","Name":"*"},{"Kind":"ReplicaSet","Namespace":"*","Name":"*"},{"Kind":"ReportChangeRequest","Namespace":"*","Name":"*"},{"Kind":"ClusterReportChangeRequest","Namespace":"*","Name":"*"}] "oldFilters"=[{"Kind":"Event","Namespace":"*","Name":"*"},{"Kind":"*","Namespace":"kube-system","Name":"*"},{"Kind":"*","Namespace":"kube-public","Name":"*"},{"Kind":"*","Namespace":"kube-node-lease","Name":"*"},{"Kind":"Node","Namespace":"*","Name":"*"},{"Kind":"APIService","Namespace":"*","Name":"*"},{"Kind":"TokenReview","Namespace":"*","Name":"*"},{"Kind":"SubjectAccessReview","Namespace":"*","Name":"*"},{"Kind":"*","Namespace":"kyverno","Name":"*"}]

And yes i can see the three webhooks :)

@realshuting
Copy link
Member

Seems now the error is gone, leaving the issue opening for a few more days.

@dirien - Feel free to update the issue if you see it again.

@dirien
Copy link

dirien commented Nov 25, 2020

unfortunalty it come back...

i noticed this happens when the kynverno pod is crashing on unknown reasons:

kyverno-55459799c5-6w45x 0/1 CrashLoopBackOff 8 46h

Logs are

I1125 05:02:55.309288       1 version.go:17]  "msg"="Kyverno"  "Version"="v1.2.1"
I1125 05:02:55.309534       1 version.go:18]  "msg"="Kyverno"  "BuildHash"="(HEAD/af07e66edea834bf8bbc12580b57345ff9cb543a"
I1125 05:02:55.309657       1 version.go:19]  "msg"="Kyverno"  "BuildTime"="2020-10-22_08:08:42PM"
I1125 05:02:55.310031       1 config.go:84] CreateClientConfig "msg"="Using in-cluster configuration"
I1125 05:02:55.311343       1 client.go:257] dclient/Poll "msg"="starting registered resources sync"  "period"=300000000000
E1125 05:02:55.464826       1 memcache.go:199] couldn't get resource list for external.metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E1125 05:03:00.467669       1 memcache.go:199] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

kube-metrics is up and running.

metrics-server-58fdc875d5-bb8wt 1/1 Running 0 46h

@dirien
Copy link

dirien commented Nov 25, 2020

Unfortunalty a restart of the kyverno pod is not helping at all. Somehow kyverno get stuck...

@dirien
Copy link

dirien commented Nov 25, 2020

Onyl thing what is helping, is to reinstall kyverno. Than everything works fine:

I1125 05:15:56.654509       1 config.go:84] CreateClientConfig "msg"="Using in-cluster configuration"                                                                               │
│ I1125 05:15:56.655921       1 client.go:257] dclient/Poll "msg"="starting registered resources sync"  "period"=300000000000                                                         │
│ I1125 05:15:59.062480       1 util.go:76] CRDInstalled "msg"="CRD found"  "kind"="ClusterPolicy"                                                                                    │
│ I1125 05:15:59.062937       1 util.go:76] CRDInstalled "msg"="CRD found"  "kind"="ClusterPolicyViolation"                                                                           │
│ I1125 05:15:59.063264       1 util.go:76] CRDInstalled "msg"="CRD found"  "kind"="PolicyViolation"                                                                                  │
│ I1125 05:15:59.065051       1 client.go:280] dclient "msg"="schema not found"  "kind"="NamespacedPolicyViolation"                                                                   │
│ I1125 05:15:59.065143       1 dynamicconfig.go:255] ConfigData "msg"="Init resource "  "excludeRoles"=""                                                                            │
│ I1125 05:15:59.066248       1 certificates.go:26] dclient "msg"="Building key/certificate pair for TLS"                                                                             │
│ I1125 05:15:59.458919       1 certificates.go:169] dclient/CAcert "msg"="secret updated"  "name"="kyverno-svc.kyverno.svc.kyverno-tls-ca" "namespace"="kyverno"                     │
│ I1125 05:15:59.748016       1 certificates.go:217] dclient/WriteTlsPair "msg"="secret updated"  "name"="kyverno-svc.kyverno.svc.kyverno-tls-pair" "namespace"="kyverno"             │
│ I1125 05:15:59.748056       1 registration.go:252] WebhookRegistrationClient "msg"="removing prior webhook configurations"                                                          │
│ I1125 05:16:02.065753       1 controller.go:210] GenerateCleanUpController "msg"="starting"                                                                                         │
│ I1125 05:16:02.065760       1 controller.go:326] PolicyController "msg"="starting"                                                                                                  │
│ I1125 05:16:02.065786       1 informer.go:109] PolicyCacheController "msg"="starting"                                                                                               │
│ I1125 05:16:02.065817       1 generator.go:148] PolicyViolationGenerator "msg"="start"                                                                                              │
│ I1125 05:16:02.065837       1 controller.go:232] GenerateController "msg"="starting"                                                                                                │
│ I1125 05:16:02.065827       1 controller.go:107] EventGenerator "msg"="start"                                                                                                       │
│ I1125 05:16:02.442666       1 dynamicconfig.go:197] ConfigData "msg"="Updated resource filters" "name"="kyverno" "namespace"="kyverno" "newFilters"=[{"Kind":"SelfSubjectAccessRevi │
│ I1125 05:16:02.654660       1 rwebhookregister.go:107] ResourceWebhookRegister "msg"="created validating webhook"  "name"="kyverno-resource-validating-webhook-cfg"                 │
│ I1125 05:16:02.853627       1 rwebhookregister.go:82] ResourceWebhookRegister "msg"="created mutating webhook"  "name"="kyverno-resource-mutating-webhook-cfg"                      │
│ I1125 05:16:03.057849       1 rwebhookregister.go:82] ResourceWebhookRegister "msg"="created mutating webhook"  "name"="kyverno-resource-mutating-webhook-cfg"                      │
│ I1125 05:16:04.231436       1 server.go:516] WebhookServer "msg"="starting"

@realshuting
Copy link
Member

Found similar issues kubernetes-sigs/metrics-server#188, kubernetes-sigs/metrics-server#448, is there any error in metric-server?

Are you running an EKS cluster? If so, do you have this flag --kubelet-preferred-address-types=InternalIPset? See detail.

It would be helpful if you can share your metric-server configurations, I'll try to reproduce the same on my end.

@dirien
Copy link

dirien commented Nov 25, 2020

Hi,

i stumbled on the same issues :D regarding the metrics server. But this one looks fine just some logs reagarding the ACI virtual nodes

│ E1125 05:20:46.018975       1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:virtual-node-aci-linux: unable to get val │
│ id timestamp for metric point for node "10.34.0.7", discarding data: no non-zero timestamp on either CPU or memory                                                                  │
│ E1125 05:21:19.537495       1 reststorage.go:135] unable to fetch node metrics for node "virtual-node-aci-linux": no metrics known for node                                         │
│ E1125 05:21:46.079975       1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:virtual-node-aci-linux: unable to get val │
│ id timestamp for metric point for node "10.34.0.7", discarding data: no non-zero timestamp on either CPU or memory                                                                  │
│ E1125 05:22:19.739157       1 reststorage.go:135] unable to fetch node metrics for node "virtual-node-aci-linux": no metrics known for node                                         │
│ E1125 05:22:46.026240       1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:virtual-node-aci-linux: unable to get val │
│ id timestamp for metric point for node "10.34.0.7", discarding data: no non-zero timestamp on either CPU or memory                                                                  │
│ E1125 05:23:19.936476       1 reststorage.go:135] unable to fetch node metrics for node "virtual-node-aci-linux": no metrics known for node                                         │
│ E1125 05:23:46.024302       1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:virtual-node-aci-linux: unable to get val │
│ id timestamp for metric point for node "10.34.0.7", discarding data: no non-zero timestamp on either CPU or memory

Reinstalling kyverno without touching kube-metrics works fine..

I am running an AKS (1.18.x) stack and the kube-metrics deploy gets done my Microsoft.

here the snippet of the deployment:


   containers:
      - command:
        - /metrics-server
        - --kubelet-insecure-tls
        - --kubelet-preferred-address-types=InternalIP
        - --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
        image: mcr.microsoft.com/oss/kubernetes/metrics-server:v0.3.6
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:

@realshuting
Copy link
Member

Ok, I installed the metric-server on my cluster, so far it looks all good. I'll get back to you if I see the error.

spec:
  containers:
  - command:
    - /metrics-server
    - --source=kubernetes.summary_api:https://kubernetes.default?kubeletHttps=true&kubeletPort=10250&insecure=true
    image: k8s.gcr.io/metrics-server-amd64:v0.2.1
    imagePullPolicy: Always
    name: metrics-server

@dirien
Copy link

dirien commented Nov 25, 2020

curious to see the results... :)

@dirien
Copy link

dirien commented Nov 30, 2020

Still crashing... with

I1130 07:05:59.751872       1 client.go:257] dclient/Poll "msg"="starting registered resources sync"  "period"=300000000000
E1130 07:05:59.870142       1 memcache.go:199] couldn't get resource list for external.metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E1130 07:06:04.872074       1 memcache.go:199] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E1130 07:06:09.874085       1 memcache.go:111] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E1130 07:06:09.876580       1 memcache.go:111] couldn't get resource list for external.metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E1130 07:06:09.876956       1 main.go:52] resourcecache "msg"="error in udateGVRCache function" "error"="unable to retrieve the complete list of server APIs: external.metrics.k8s.io/v1beta1: the server is currently unable to handle the request, metrics.k8s.io/v1beta1: the server is currently unable to handle the request"
E1130 07:06:09.877002       1 main.go:112] setup "msg"="Failed to create resource cache" "error"="unable to retrieve the complete list of server APIs: external.metrics.k8s.io/v1beta1: the server is currently unable to handle the request, metrics.k8s.io/v1beta1: the server is currently unable to handle the request"

is this related with #1325 ?

@realshuting
Copy link
Member

Yes it's related to #1324, you can track the fix in PR #1342.

For the slowness issue, I have left the metric server running in my cluster for over a day, and I could see Kyverno stopped working, or say, behaved relatively slow. We suspect it dues to the flooding throttling requests, see #1319.

We are patching several fixes to see if it resolves the slowness issue, I'll test this again after the fix.

@dirien
Copy link

dirien commented Dec 1, 2020

thanks @realshuting for the feedback, the throtteling i could see too under flux2

flux get helmreleases -A
I1201 05:52:38.575285   14972 request.go:645] Throttling request took 1.149248624s, request: GET:https://xxx.westeurope.azmk8s.io:443/apis/monitoring.coreos.com/v1?timeout=32s

@realshuting
Copy link
Member

@dirien
Can you upgrade Kyverno to v1.3.0-rc6? You should not see the crashing anymore.

And we had trimmed some of the throttling requests that would potentially help to solve the slowness issue.

@realshuting
Copy link
Member

@dirien Any updates on this?

@realshuting
Copy link
Member

Closing this.

Feel free to reopen if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants