-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excessive pod startup times #1297
Comments
Any idea where the time is spent? I've been looking into the pod startup latency, which might take a fair share of the observed latency. |
No specifics other than the ideas (Istio init container and sidecar injection) as discussed in slack. It would be great to get some metrics to dig into this. Cold-start times are a concern for FaaS use cases. /assign @georgeharley |
@nikkithurmond and I have been deep diving on this. The pod gets scheduled and put into the Endpoint as unready within a second. Surprisingly, it takes a long time for the readiness checks to go healthy (6 seconds--3 retries at 2 second intervals). The app under test starts in 600ms, so the latency is somewhere in 1) starting docker containers 2) initialization of envoy or 3) docker networking of the containers together into the pod. |
Another part of the latency is 2-5 seconds for iptable programming. Once the pod goes healthy, the endpoint updates the ready set, and the kubeproxy notices the change. Then kubeproxy programs iptables with the pod's IP address behind the service. We will also be digging into this time interval to figure out where we're spending our time. Is it in batch updates from the kubelet? Is it in the actual iptables programming? |
@nikkithurmond, are any of your results in a state where they would be useful to share? It would be nice to get a second opinion on our findings. And maybe to divide the work. |
The iptables programming should happen very quickly on average (some burts is allowed), but is rate-limited and batched if the burst rate is exceeded, so if there's a lot of updates happening it could hit the batch backoff, which I think is 10 seconds, so 5 on average makes sense. |
This could be fuel to the fire for changes in kube-proxy or even recommending IPVS mode rather than iptables mode (which should program faster) |
@thockin thanks for the info - do you know if the iptables programming backoff/retry wait time is configurable? For some use cases (e.g. user-facing functions) a 10s wait will feel like a denial of service. |
https://docs.google.com/spreadsheets/d/1QfhSPvNu_LXzTx3cMjLUkpdr1UGT7krUAKbmU1WLPO4/edit?usp=sharing Here's a detailed timeline of one single call to a scaled-to-zero revision |
Thanks @nikkithurmond, this is really great data! |
/kind bug |
@nikkithurmond @josephburnett hey - can you provide me access to that spreadsheet? jeder@redhat.com. And also, can you share any tooling used to generate the timeline? What was the environment setup? (hardware/software versions) |
This seems related to the more specific bug 1345 "Envoy adds 5-6 seconds to pod startup". |
@jeremyeder, done. And I shared with elafros-dev@googlegroups.com as well. @glyn, #1345 is part of this. But there are other issues well, such as the 3 seconds delay propagating the ready pod's IP address to all nodes. We'll cut a separate issue for that specifically as we get more details. How about we keep this issue for discussing the overall problem? |
+1. Let's post links to the more specific issue here so we can keep track of what's going on. |
@glyn, rodger. I've started a list in the top-level issue description. |
Analysis of In a little more detail, when the Envoy GrpcMuxImpl is started it tries to open a new gRPC bidi stream to request resource discovery details. That initial invocation of Below is a typical snippet of log from an
Not sure if there's any plans to make the Envoy retry timeout delay configurable in the near future. It certainly seems like it would be useful to do so. GrpcMuxImpl source available here. |
Nice find George!
Any idea why? |
@georgeharley this is very nice finding. I've forwarded this to the Istio perf team to see if they can assist in customizing this retry. |
@josephburnett no idea just yet. Will post here if I find anything. |
@georgeharley, actually will you post on #1345? We'll keep this issue to track the whole end-to-end startup latency. Thanks! |
Minor nit: we should be using knative-dev@googlegroups.com for ACLs now. |
For the 3-5 seconds between the time that the Pod becomes Ready and the revision's Service is accessible: I found that if we switch to using a Istio virtual service instead of a real K8s service for the revisions, most of the time I see propagation delay of less than 1 second. |
* Set the activator timeout to 60s. There is a bug in Istio 0.8 preventing the timeout to set more than 15 seconds. The bug is now fixed at HEAD, not yet released. 15 seconds is too short for our 0->1 case ( see #1297). This applies the workaround suggested in http://github.com/istio/istio/issues/6230, to set the activator timeout to 60s. * Update typo
@josephburnett Let's make sure that there are tracking issues coming out of this work and we can close this uber issue. |
I think we have cut this down into several pieces meanwhile and can close this overarching issue. Please reopen if you feel differently. /close |
@markusthoemmes: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Expected Behavior
Revisions scaled to zero should take under a second to start up and serve traffic
(assuming images are cached)
Actual Behavior
Observed startup times for the helloworld sample appear to be >10s
Steps to Reproduce the Problem
Additional Info
I'm running a development build on localhost in docker for Mac - expect performance of release builds on GKE to be different.
Specific issues
The text was updated successfully, but these errors were encountered: