-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scale issues over 1k ksvc with an external default domain #13247
Comments
related #13201 Out of curiosity, did you test this with a non-Istio networking layer? (no worries if not, I'll try with kourier later) |
We did so some testing with Kourier and didn't see the same slowness while adding a new ksvc when we had more than 1k ksvc on the cluster, it took around 6 seconds to become ready. We did see that when we removed a ksvc from the cluster that the kourier controller would start to reconcile all of the ksvc and if you tried to add a new ksvc at that point, it would take a long time (around 10 minutes if I recall correctly) before the ksvc would become ready. I'm assuming the controller finishes its current job to reconcile everything before it reconciles the ingress for the new ksvc that was added... At the time we weren't aware that this was only happening when default domains were set to an external domain however so we didn't do any testing with domain mappings at all. |
FYI someone from your team dug into this - here's a slack thread. But thanks for posting an issue as it makes this discussion more accessible Ingress Ready time is heavily dependent on the underlying networking layer - ie. istio/contour/kourier ie. contour had a regression projectcontour/contour#4058 and even with a fix it takes about ~1-2 minutes for a service be ready when the cluster has ~1000 Knative Services. |
We also gate the Ingress being ready by probing proxies and ensuring the networking is rolled out. I'm curious if |
Thanks @dprotaso To add some details from that slack conversation, we did some further testing where we created a ksvc with the default domain as svc.cluster.local and then added a domain mapping with an external domain and waited for both to become ready and the there wasn't actually a noticeable difference between "default domain is svc.cluster.local + external domain mapping, and default domain is external domain". We did see a big improvement in time for ingress to become ready when we upgraded to Istio 1.15 so that has helped a lot. Here's some graphs from testing one of my colleagues did: |
I'm going to close this out as the problem appears to have been Istio related. |
Wow that's a big perf gain! |
What version of Knative?
Expected Behavior
We can use an external domain name as the default domain for our ksvc and it doesn't cause a deterioration in the amount of time it takes for the Ingress to become ready once we get to 1200 ksvc.
Actual Behavior
We have a gke cluster with 1200 ksvc, and now when we create a new revision of any of the KSVC, we have to wait around 3 minutes (it varies and at times can take 10 minutes) for the Ingress to become ready.
We are using Knative 1.5.0 with Istio 1.14.3 on this cluster but have also tested with Knative 1.6.0 on a test cluster.
We upgraded Knative and Istio, one minor release at a time over a period of two weeks so that Knative 0.26.0 was upgraded to 1.5.0, and Istio from 1.12.6 to 1.14.3. We noticed the problem on this cluster around about this time, we have other clusters with several hundred ksvc that were not affected in the same way. Its possible the issue existed to a certain degree previously and we didn't detect it. Our testing on test clusters did seem to point to the issue being during the upgrade from 1.11.x to 1.12.x.
We also noticed that our kube-dns pods went into crashloopbackoff with messages that they were getting too many concurrent requests and out of memory. We increased the number of replicas and this mitigated that issue but there seems to be some connection with the upgrades to the newer versions and the amount of DNS requests on the cluster.
Steps to Reproduce the Problem
We reproduced the problem on a test cluster by creating the cluster with the latest version of Knative and Istio no mesh and setting the default domain in config-domain config map in knative-serving to:
Add 1200 ksvc (usually its somewhere around the 800 mark that things start to slow down) with kperf. While the KSVC are being added the time for the Ingress to become ready gets steadily worse. After it settles back down, when I add one ksvc or change a ksvc, it takes over 60 seconds for the Ingress to become ready.
If I change the default domain in config-domain to
svc.cluster.local
, all the ingress and ksvc reconcile and then I can create a new ksvc and everything including the ingress is ready in 1 second (with zero initial scale set).Checking the routes using the command below, I can see that all the port 8080 and 8443 routes have gone and the ingress no longer has an external route mentioned.
I was able to add back in all of these routes as domain mappings rather however, 1700 of them and the time to ready remains really quick at 1 second if I add a ksvc/revision or add a new domain mapping.
The only difference I can see between the two configurations are the routes in istio:
This is an example of one ksvc with the sslip.io domain as the default domain:
This is with the default domain set to
svc.cluster.local
:The domain mappings are present on the cluster in both of the examples above.
I tried turning on debug mode for the logs on the ingress gateway proxy and could see lots of DNS requests to resolve knative-local-gateway.istio-system.svc.cluster.local. I'm not sure if that's normal or not.
Any help or direction would be appreciated, i'm not sure if this is a bug or a misconfiguration somehow. Thanks!
The text was updated successfully, but these errors were encountered: