-
Notifications
You must be signed in to change notification settings - Fork 801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] kyverno slowing made K9s timedout #918
Comments
@sgandon - yes, please share the policies. Not sure which rule types they contain - can this be checked via the CLI? Also, can you run a test with 1.1.6 as well? We are running tests with all of the sample policies and not seeing such a delay. The webhook is configured with a 3s timeout by default via the |
I have exactly the same problem on the v1.6.0, even without any policy deployed. I noticed also synchronization timeouts when using ArgoCD. |
I have been running the latest 1.1.6 for 5 days with those 3 policies and it works like a charm now, so I guess this has been solved with the latest version on my side at least :)
One generation policy
and one validation policy
|
@mchebitou - this is not expected, as if no policies are deployed, Kyverno does not register with the for admission control webhooks. Can you please provide the API server logs and the Kyverno logs? Thanks! |
@sgandon - thanks for the update and details. Glad to hear that you are not seeing any issues with 1.1.6 in your setup. Any additional data points over the last few days? |
I have not use kyverno lately, just left it run on the cluster and it is running ok and not preventing K9s to start anymore. |
Unfortunatly it took longer but is have the same misbehavior again, I did not touch anything and left kyverno (1.1.6) run constantly. |
@sgandon - do you happen to have the logs? We will also check on clusters where Kyverno 1.1.6 has been running. |
@JimBugwadia yes I have stored them before deleting the pod, I am just wondering how to send them to you cause they are too big for pastebin. |
@sgandon - I will share a GDrive link via slack. |
Hi @sgandon, I could not reproduce this slowness issue while deleting/creating the resource, what were you trying to do with k9s? Until we figure out the exact issue, we had improved the webhook logic to reduce the processing time when receiving an admission request, see PR #967. Also we print out the processing time per request of an admission request at log level 4. This could potentially help us to find the issue. |
@realshuting , the issue happens after a while and prevent k9s to launch. the last time I reproduced the issue I just left Kyverno run and did nothing at all but launching K9s regularly to check if it could be launched or not. |
@sgandon - any data on behaviors with 1.1.8? |
@JimBugwadia I'll be starting the 1;1.8 to check on this again. I'll let you know. |
I'm definitely seeing this behavior on EKS 1.16 + Kyverno 1.1.9 + k9s 0.21.7. If I uninstall Kyverno, k9s works properly. If I re-install, k9s errors out after a long wait with an error message The discussion in derailed/k9s#718 suggests maybe something where Kyverno is interfering with SubjectAccessReview calls? Also noticed when installing something via "kubectl apply -f" that what would normally be an instantaneous creation of about 20 resources that somethng would slow things down considerably (several seconds between creating each resource). Highly suspecting Kyverno is massively slowing things down in the same way it's slowing down k9s. Replacing the kyverno pod does nothing to affect the performance or ability to run k9s. Neither does removing all clusterpolicies. |
@mbarrien thanks for the details. Any idea who (which user / group) is invoking the SubjectAccessReview? Kyverno should be already excluding the
|
This was with a stock Helm installation, so resourceFilters is whatever's in the Helm chart. According to kubectl get cm -n (non-standard-namespace) kyverno -o yaml:
(No excludeGroupRole) For the user, since we're using EKS, it's using aws-auth to map an IAM role to something that is in group system:masters. Note that between uninstall and reinstall, I manually deleted any left over bits from kyverno to try to get it to be as clean as possible (removed the apiservice, removed all configmaps, secrets, and CRDs), and this is still happening. |
Also for what it's worth, I have Kyverno installed in a completely separate cluster where k9s+kyverno is not exhibiting this behavior. There's is nothing really of note that's different between the 2 clusters (they're both fairly empty testing clusters that were created around the same time, with the same versions of the same system applications installed on it, only difference is user workloads none of which do anything notable with Kubernetes other than use it to host standard containers). |
In kyverno's logs with standard logging level I see this exactly every minute:
These log messages do not show up in the cluster where k9s+kyverno works properly. EDIT: However, after uninstalling and reinstalling metrics server, these error messages no longer appear but k9s still causes the same error as above when kyverno is installed. |
Further progress on this. I got k9s+kyverno working... but the only thing that got it to work was by replacing every node in the cluster (I did a full node rotation). I have suspicions something was going wrong at the lowest level in the cluster that somehow was blocking traffic from getting to something that backed an apiservice (based on the error messages above metrics server was a big suspicion). In any event, I still think there's still something Kyverno related here, in that Kyverno is blocking some API call that k9s uses if something kyverno relies on is broken, even for API calls that are readonly. Unfortunately, I can't help with reproducing any more now that the broken cluster is fixed. |
I am back from my vacation and tried to launch K9s without success on my local k3d cluster using version v1.1.8.
So this is happening again after some time.
|
Closing this issue for now. Feel free to reopen it if needed. |
Hi @realshuting , ca we reopen this issue again? I get the same error mentioned in the posts above
Only removing kyverno is working to get k9s up and running again. |
Which version of Kyverno are you running? |
See: derailed/k9s#718 and #918 for more context. |
Another thing wants to confirm - was Run |
hi @realshuting , there is not init-config (helm install) but a configmap with the name kyverno
Reagardinig adding [SelfSubjectAccessReview,,] to resourceFilters: done |
Do you see the same error again? |
not for now... have to wait :) |
If Also, are you seeing these 3 mutating webhook configurations?
|
And yes i can see the three webhooks :) |
Seems now the error is gone, leaving the issue opening for a few more days. @dirien - Feel free to update the issue if you see it again. |
unfortunalty it come back... i noticed this happens when the kynverno pod is crashing on unknown reasons:
Logs are
kube-metrics is up and running.
|
Unfortunalty a restart of the kyverno pod is not helping at all. Somehow kyverno get stuck... |
Onyl thing what is helping, is to reinstall kyverno. Than everything works fine:
|
Found similar issues kubernetes-sigs/metrics-server#188, kubernetes-sigs/metrics-server#448, is there any error in metric-server? Are you running an EKS cluster? If so, do you have this flag It would be helpful if you can share your metric-server configurations, I'll try to reproduce the same on my end. |
Hi, i stumbled on the same issues :D regarding the metrics server. But this one looks fine just some logs reagarding the ACI virtual nodes
Reinstalling kyverno without touching kube-metrics works fine.. I am running an AKS (1.18.x) stack and the kube-metrics deploy gets done my Microsoft. here the snippet of the deployment:
|
Ok, I installed the metric-server on my cluster, so far it looks all good. I'll get back to you if I see the error.
|
curious to see the results... :) |
Still crashing... with
is this related with #1325 ? |
Yes it's related to #1324, you can track the fix in PR #1342. For the slowness issue, I have left the metric server running in my cluster for over a day, and I could see Kyverno stopped working, or say, behaved relatively slow. We suspect it dues to the flooding throttling requests, see #1319. We are patching several fixes to see if it resolves the slowness issue, I'll test this again after the fix. |
thanks @realshuting for the feedback, the throtteling i could see too under flux2
|
@dirien And we had trimmed some of the throttling requests that would potentially help to solve the slowness issue. |
@dirien Any updates on this? |
Closing this. Feel free to reopen if needed. |
Describe the bug
I tracked down this bug for while and it happens on v1.1.5. I know you have been working hard since and I have not had the time to test the lastest versions but I will.
I used the helm chat install kyverno.
I also installed 3 policies.
What I observed after a while is that K9s (a cli to manage clusters) was failing to start because of some timeout.
I tried to debug it and it happens one POST was taking 2 to 3 seconds.
And this made the tools timedout.
I have uninstalled kyverno and this post is taking 4 milliseconds.
To Reproduce
Steps to reproduce the behavior:
2bf51494
Expected behavior
K9s should be starting fine
Additional context
here is the k9s (k8s client) logs
The text was updated successfully, but these errors were encountered: