Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale issues over 1k ksvc with an external default domain #13247

Closed
daraghlowe opened this issue Aug 23, 2022 · 7 comments
Closed

Scale issues over 1k ksvc with an external default domain #13247

daraghlowe opened this issue Aug 23, 2022 · 7 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@daraghlowe
Copy link

What version of Knative?

1.6.0 Knative
1.14.3 Istio

Expected Behavior

We can use an external domain name as the default domain for our ksvc and it doesn't cause a deterioration in the amount of time it takes for the Ingress to become ready once we get to 1200 ksvc.

Actual Behavior

We have a gke cluster with 1200 ksvc, and now when we create a new revision of any of the KSVC, we have to wait around 3 minutes (it varies and at times can take 10 minutes) for the Ingress to become ready.

We are using Knative 1.5.0 with Istio 1.14.3 on this cluster but have also tested with Knative 1.6.0 on a test cluster.

We upgraded Knative and Istio, one minor release at a time over a period of two weeks so that Knative 0.26.0 was upgraded to 1.5.0, and Istio from 1.12.6 to 1.14.3. We noticed the problem on this cluster around about this time, we have other clusters with several hundred ksvc that were not affected in the same way. Its possible the issue existed to a certain degree previously and we didn't detect it. Our testing on test clusters did seem to point to the issue being during the upgrade from 1.11.x to 1.12.x.

We also noticed that our kube-dns pods went into crashloopbackoff with messages that they were getting too many concurrent requests and out of memory. We increased the number of replicas and this mitigated that issue but there seems to be some connection with the upgrades to the newer versions and the amount of DNS requests on the cluster.

Steps to Reproduce the Problem

We reproduced the problem on a test cluster by creating the cluster with the latest version of Knative and Istio no mesh and setting the default domain in config-domain config map in knative-serving to:

test.XX.XX.XX.XX.sslip.io: ""

Add 1200 ksvc (usually its somewhere around the 800 mark that things start to slow down) with kperf. While the KSVC are being added the time for the Ingress to become ready gets steadily worse. After it settles back down, when I add one ksvc or change a ksvc, it takes over 60 seconds for the Ingress to become ready.

If I change the default domain in config-domain to svc.cluster.local, all the ingress and ksvc reconcile and then I can create a new ksvc and everything including the ingress is ready in 1 second (with zero initial scale set).

Checking the routes using the command below, I can see that all the port 8080 and 8443 routes have gone and the ingress no longer has an external route mentioned.

istioctl proxy-config routes deploy/istio-ingressgateway.istio-system

I was able to add back in all of these routes as domain mappings rather however, 1700 of them and the time to ready remains really quick at 1 second if I add a ksvc/revision or add a new domain mapping.

The only difference I can see between the two configurations are the routes in istio:

This is an example of one ksvc with the sslip.io domain as the default domain:

NAME                                                               DOMAINS                                                                           MATCH                  VIRTUAL SERVICE
https.443.https-server.knative-ingress-gateway.knative-serving     kbase1k-97.default.34.140.49.163.sslip.io, kbase1k-97.headless-customer         /*                     kbase1k-97-ingress.headless-customer
http.8081                                                          kbase1k-97.default.34.140.49.163.sslip.io, kbase1k-97.headless-customer         /*                     kbase1k-97-ingress.headless-customer
http.8080                                                          kbase1k-97.default.34.140.49.163.sslip.io, kbase1k-97.headless-customer         /*                     kbase1k-97-ingress.headless-customer

https.443.https-server.knative-ingress-gateway.knative-serving     kbase1k-97.wpe.34.140.49.163.sslip.io                                            /*                     kbase1k-97.wpe.34.140.49.163.sslip.io-ingress.headless-customer
http.8080                                                          kbase1k-97.wpe.34.140.49.163.sslip.io                                            /*                     kbase1k-97.wpe.34.140.49.163.sslip.io-ingress.headless-customer

This is with the default domain set to svc.cluster.local:

NAME                                                               DOMAINS                                    MATCH                  VIRTUAL SERVICE
http.8081                                                          kbase2k-97.headless-customer               /*                     kbase2k-97-ingress.headless-customer

https.443.https-server.knative-ingress-gateway.knative-serving     kbase2k-97.wpe.34.140.49.163.sslip.io      /*                     kbase2k-97.wpe.34.140.49.163.sslip.io-ingress.headless-customer
http.8080                                                          kbase2k-97.wpe.34.140.49.163.sslip.io      /*                     kbase2k-97.wpe.34.140.49.163.sslip.io-ingress.headless-customer

The domain mappings are present on the cluster in both of the examples above.

I tried turning on debug mode for the logs on the ingress gateway proxy and could see lots of DNS requests to resolve knative-local-gateway.istio-system.svc.cluster.local. I'm not sure if that's normal or not.

Any help or direction would be appreciated, i'm not sure if this is a bug or a misconfiguration somehow. Thanks!

@daraghlowe daraghlowe added the kind/bug Categorizes issue or PR as related to a bug. label Aug 23, 2022
@psschwei
Copy link
Contributor

related #13201

Out of curiosity, did you test this with a non-Istio networking layer? (no worries if not, I'll try with kourier later)

@daraghlowe
Copy link
Author

related #13201

Out of curiosity, did you test this with a non-Istio networking layer? (no worries if not, I'll try with kourier later)

We did so some testing with Kourier and didn't see the same slowness while adding a new ksvc when we had more than 1k ksvc on the cluster, it took around 6 seconds to become ready. We did see that when we removed a ksvc from the cluster that the kourier controller would start to reconcile all of the ksvc and if you tried to add a new ksvc at that point, it would take a long time (around 10 minutes if I recall correctly) before the ksvc would become ready. I'm assuming the controller finishes its current job to reconcile everything before it reconciles the ingress for the new ksvc that was added...

At the time we weren't aware that this was only happening when default domains were set to an external domain however so we didn't do any testing with domain mappings at all.

@dprotaso
Copy link
Member

dprotaso commented Nov 9, 2022

FYI someone from your team dug into this - here's a slack thread. But thanks for posting an issue as it makes this discussion more accessible

Ingress Ready time is heavily dependent on the underlying networking layer - ie. istio/contour/kourier

ie. contour had a regression projectcontour/contour#4058 and even with a fix it takes about ~1-2 minutes for a service be ready when the cluster has ~1000 Knative Services.

@dprotaso
Copy link
Member

dprotaso commented Nov 9, 2022

If I change the default domain in config-domain to svc.cluster.local, all the ingress and ksvc reconcile and then I can create a new ksvc and everything including the ingress is ready in 1 second (with zero initial scale set).

We also gate the Ingress being ready by probing proxies and ensuring the networking is rolled out. I'm curious if zero initial scale set is what's causing this to happen so quickly.

@daraghlowe
Copy link
Author

Thanks @dprotaso

To add some details from that slack conversation, we did some further testing where we created a ksvc with the default domain as svc.cluster.local and then added a domain mapping with an external domain and waited for both to become ready and the there wasn't actually a noticeable difference between "default domain is svc.cluster.local + external domain mapping, and default domain is external domain".

We did see a big improvement in time for ingress to become ready when we upgraded to Istio 1.15 so that has helped a lot. Here's some graphs from testing one of my colleagues did:

Istio 1.14.x
image

Istio 1.15.x
image

@daraghlowe
Copy link
Author

I'm going to close this out as the problem appears to have been Istio related.

@dprotaso
Copy link
Member

dprotaso commented Nov 9, 2022

Wow that's a big perf gain!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants