New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
With default settings, a single request causes flipping activator in/out of path with min-scale=2 #11926
Comments
Funky! I'm not opposed to default to Another observation: Should we change the default TBC to 0 if containerConcurrency is 0? Seems useless to have the activator in path like ever in that case. |
Noob question, how can one see this? |
@sdhoward |
I experienced this same bug(?)/config-mishap Using default autoscaler capacity options and This was also causing istiod to do full pushes every time which used a lot of cpu |
this is indeed the exact problem we were debugging which led us to track the above down. FWIW we set the default TBC to -1 (activator always in path) in our environments, reserving TBC!=-1 for use cases where a user knows they need the (pretty minor tbh) latency gain of removing activator from path, and have so far seen no issues with that config. I think I might open a PR to change this default, and we can discuss whether it makes sense to make it -1, or just have it kick in at a higher and less accidentally-available value there. |
Should we have a follow up issue to smooth out/provide optionality how the activator transitions in/out of the data path? |
Would introducing a Hysteresis may help to smooth out that kind of fluctuations ? I mean not having a single value for switching between Also (but not sure that helps), avoiding having cc being a divisor of burst capacity might help to avoid this kind of situation (but this is just a gut feeling, needs to be thought through). Setting burst capacity to -1 effectively complete kills a key feature (that has been documented in detail not only on knative.dev but also e.g. in "Knative in Action") except for insiders who could re-enable. Also, as Knative is a very opinionated approach we should have also an opinion here instead of surrender (and potentially remove the |
I don't really buy that, fwiw. You might as well say that the fact we don't set min-scale, max-scale, init-scale, scale-down-delay, rps or even containerConcurrency by default means the features are removed.
I'm not sure how we can square saying we have an opinionated approach with the ton of other optional and non-default features, knobs and twiddles in autoscaler, to be fair :-). Should we remove RPS and min-scale, too? All I'm really proposing here is that - like min-scale or RPS or scale down delay or containerConcurrency - setting a point where activator should be removed from path should be set based on your actual workload and latency needs. The (Id argue really quite small number of) people with latency requirements so extreme that they can't cope with activator in path (but can handle an ingress/mesh, and queue proxy in the path) probably shouldn't use the default min-scale=0, either, for example. But it's nice we can support them, so I personally wouldn't want to remove the feature. (Having said all this, like I say I'd be OK just bumping the number if people are super attached to keeping the automated removal - my experience though is simpler and more understandable defaults are better, and explaining when, with the current defaults, activator will or will not be affecting the load balancing, for example, is pretty extremely tough, and that's not good)
FWIW the recommendation Id suggest we make, if we make it non-default rather than just bumping it to a higher number, would be very similar to min-scale, RPS, scale-down-delay etc: use it when you have a workload that requires it, ideally after measuring empirically (in this case if your workload can't cope with the latency overhead of having activator in the path you can set TBC to determine when activator - and the features that rely on activator like better load balancing and buffering - will be removed from path) To state the obvious - but worth saying I think - all of this is just about the out of the box defaults in open source knative. Any platform/product/deployment can easily customise this if they know their workloads need super low latency even at the cost of some complexity and endpoint churn, just like they can customise default containerConcurrency etc. |
This issue is stale because it has been open for 90 days with no |
/reopen |
@psschwei: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
More discussion in: knative#11926 knative#12241 With the current defaults, setting min-scale=2 causes activator to flip in/out of path on each request. This PR sets the default TBC to 210 so as to not be a direct multiple of the container concurrency target default, and thus make it harder to trigger the activator flipageddon Signed-off-by: Paul S. Schweigert <paulschw@us.ibm.com>
* change default target burst capacity More discussion in: #11926 #12241 With the current defaults, setting min-scale=2 causes activator to flip in/out of path on each request. This PR sets the default TBC to 210 so as to not be a direct multiple of the container concurrency target default, and thus make it harder to trigger the activator flipageddon Signed-off-by: Paul S. Schweigert <paulschw@us.ibm.com> * typo Signed-off-by: Paul S. Schweigert <paulschw@us.ibm.com> * use prime number for tbc to guarantee no flipping Signed-off-by: Paul S. Schweigert <paulschw@us.ibm.com>
Fixed in #12774 |
@psschwei: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What happened?
This is kind of fun, but also causes legit problems (likely due to istio not handling stuff well under load, to be fair) for us that took a while to track down: with the default settings for TBC we move activator in/out of the path when spare capacity goes over/under 200 requests. 200 requests also happens to be the exact burst capacity of the system when you have 2 pods (e.g. min-scale=2) and cc=0 or 100 (because when cc=0 we take capacity to be 100, and 2 x 100 cap = 200). This means it takes exactly 1 request to push a minscale=2 revision over burst capacity with the default settings, and as soon as that request finishes, we go back under capacity. The result for a basic test in our environment (and, unfortunately, some real workloads too that saw problems) was us rapidly swapping activators in/out of the service, and 503 errors (chasing this bit down, but we think because istio certs get out of sync with the endpoints).
Steps to Reproduce
Suggested resolution
cc @markusthoemmes @vagababov @psschwei
The text was updated successfully, but these errors were encountered: