Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

With default settings, a single request causes flipping activator in/out of path with min-scale=2 #11926

Closed
julz opened this issue Sep 6, 2021 · 13 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@julz
Copy link
Member

julz commented Sep 6, 2021

What happened?

This is kind of fun, but also causes legit problems (likely due to istio not handling stuff well under load, to be fair) for us that took a while to track down: with the default settings for TBC we move activator in/out of the path when spare capacity goes over/under 200 requests. 200 requests also happens to be the exact burst capacity of the system when you have 2 pods (e.g. min-scale=2) and cc=0 or 100 (because when cc=0 we take capacity to be 100, and 2 x 100 cap = 200). This means it takes exactly 1 request to push a minscale=2 revision over burst capacity with the default settings, and as soon as that request finishes, we go back under capacity. The result for a basic test in our environment (and, unfortunately, some real workloads too that saw problems) was us rapidly swapping activators in/out of the service, and 503 errors (chasing this bit down, but we think because istio certs get out of sync with the endpoints).

Steps to Reproduce

  1. Create service with min-scale = 2
  2. Keep everything else default (i.e. cc=0, but you can also set cc=100)
  3. Curl a few times in series. Notice SKS will flap back and forth between Serve and Proxy. In normal cases this won't directly cause user issues, other than unneeded churn and maybe (but I haven't proved this) some perf impact.
  4. Optional: have mesh enabled and a heavily loaded istio gateway, you will likely quite consistently see requests error with 503s and cert errors in the logs as things fail to keep up (this is clearly an istio rather than knative problem, but we probably shouldn't be causing this load - we have set TBC=-1 now which seems to have fixed the problem)

Suggested resolution

  • it may make sense to tweak some of the defaults to make it less easy to accidentally hit cases like this (where one request nudges you over/under) out of the box
  • We might want to debounce the activator swapping, i.e. when we've only just put it in path we may want to keep it there for some period and/or we may want to see that we're within burst capacity for some period of time before taking it out of path again.
  • Probably controversial, but: personally I'm totally unconvinced most users should need to care about TBC or that the potential perf advantages outweigh the UX complexity (and churn, and need for scraping) for most workloads. I'd be in favour of defaulting it to -1 so people can opt in to activators going in/out of path if they have a specific need for it but otherwise we just keep it in path. I think it's great as an advanced setting to give operators the ability to swap it out when they have perf concerns that require that, but not something we should be doing for the common case.

cc @markusthoemmes @vagababov @psschwei

@julz julz added the kind/bug Categorizes issue or PR as related to a bug. label Sep 6, 2021
@markusthoemmes
Copy link
Contributor

Funky!

I'm not opposed to default to -1 in the future, though it'd be neat to quantify the impact that'd have on existing envs (which will be tough as it's very reliant on the workloads running). Maybe it's time to finally try to measure the activator's overhead?

Another observation: Should we change the default TBC to 0 if containerConcurrency is 0? Seems useless to have the activator in path like ever in that case.

@sdhoward
Copy link

Notice SKS will flap back and forth between Serve and Proxy.

Noob question, how can one see this?

@markwinter
Copy link

@sdhoward
You can see with kubectl get --watch sks. You can also check the autoscaler logs

@markwinter
Copy link

markwinter commented Nov 5, 2021

I experienced this same bug(?)/config-mishap

Using default autoscaler capacity options and minScale: 2, the SKS kept flipping between proxy and serve mode with intermittent single requests to a service.

This was also causing istiod to do full pushes every time which used a lot of cpu

@julz
Copy link
Member Author

julz commented Nov 5, 2021

This was also causing istiod to do full pushes every time which used a lot of cpu

this is indeed the exact problem we were debugging which led us to track the above down.

FWIW we set the default TBC to -1 (activator always in path) in our environments, reserving TBC!=-1 for use cases where a user knows they need the (pretty minor tbh) latency gain of removing activator from path, and have so far seen no issues with that config.

I think I might open a PR to change this default, and we can discuss whether it makes sense to make it -1, or just have it kick in at a higher and less accidentally-available value there.

@dprotaso
Copy link
Member

dprotaso commented Nov 9, 2021

Should we have a follow up issue to smooth out/provide optionality how the activator transitions in/out of the data path?

@rhuss
Copy link
Contributor

rhuss commented Nov 9, 2021

Would introducing a Hysteresis may help to smooth out that kind of fluctuations ? I mean not having a single value for switching between Serve and Proxy but different values for the transition Serve -> Proxy and Proxy -> Serve. (maybe add a parameter burstHysteresis = 10 and then switch forth/back at 200-10 and 200+10.

Also (but not sure that helps), avoiding having cc being a divisor of burst capacity might help to avoid this kind of situation (but this is just a gut feeling, needs to be thought through).

Setting burst capacity to -1 effectively complete kills a key feature (that has been documented in detail not only on knative.dev but also e.g. in "Knative in Action") except for insiders who could re-enable. Also, as Knative is a very opinionated approach we should have also an opinion here instead of surrender (and potentially remove the Serve mode if we think it's not helpful instead of leaving the decision to the user). Anyway, if we keep it we need to give a recommendation on when to use a burst capacity and when not.

@julz
Copy link
Member Author

julz commented Nov 9, 2021

Setting burst capacity to -1 effectively complete kills a key feature (that has been documented in detail not only on knative.dev but also e.g. in "Knative in Action") except for insiders who could re-enable.

I don't really buy that, fwiw. You might as well say that the fact we don't set min-scale, max-scale, init-scale, scale-down-delay, rps or even containerConcurrency by default means the features are removed.

Also, as Knative is a very opinionated approach we should have also an opinion here instead of surrender (and potentially remove the Serve mode if we think it's not helpful instead of leaving the decision to the user).

I'm not sure how we can square saying we have an opinionated approach with the ton of other optional and non-default features, knobs and twiddles in autoscaler, to be fair :-). Should we remove RPS and min-scale, too? All I'm really proposing here is that - like min-scale or RPS or scale down delay or containerConcurrency - setting a point where activator should be removed from path should be set based on your actual workload and latency needs.

The (Id argue really quite small number of) people with latency requirements so extreme that they can't cope with activator in path (but can handle an ingress/mesh, and queue proxy in the path) probably shouldn't use the default min-scale=0, either, for example. But it's nice we can support them, so I personally wouldn't want to remove the feature.

(Having said all this, like I say I'd be OK just bumping the number if people are super attached to keeping the automated removal - my experience though is simpler and more understandable defaults are better, and explaining when, with the current defaults, activator will or will not be affecting the load balancing, for example, is pretty extremely tough, and that's not good)

Anyway, if we keep it we need to give a recommendation on when to use a burst capacity and when not.

FWIW the recommendation Id suggest we make, if we make it non-default rather than just bumping it to a higher number, would be very similar to min-scale, RPS, scale-down-delay etc: use it when you have a workload that requires it, ideally after measuring empirically (in this case if your workload can't cope with the latency overhead of having activator in the path you can set TBC to determine when activator - and the features that rely on activator like better load balancing and buffering - will be removed from path)

To state the obvious - but worth saying I think - all of this is just about the out of the box defaults in open source knative. Any platform/product/deployment can easily customise this if they know their workloads need super low latency even at the cost of some complexity and endpoint churn, just like they can customise default containerConcurrency etc.

@github-actions
Copy link

github-actions bot commented Feb 8, 2022

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 8, 2022
@psschwei
Copy link
Contributor

/reopen
/remove-lifecycle-stale

@knative-prow-robot
Copy link
Contributor

@psschwei: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle-stale

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@github-actions github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 11, 2022
psschwei added a commit to psschwei/serving that referenced this issue Mar 23, 2022
More discussion in:
knative#11926
knative#12241

With the current defaults, setting min-scale=2 causes activator to flip in/out of path on each request. This PR sets the default TBC to 210 so as to not be a direct multiple of the container concurrency target default, and thus make it harder to trigger the activator flipageddon

Signed-off-by: Paul S. Schweigert <paulschw@us.ibm.com>
knative-prow-robot pushed a commit that referenced this issue Mar 24, 2022
* change default target burst capacity

More discussion in:
#11926
#12241

With the current defaults, setting min-scale=2 causes activator to flip in/out of path on each request. This PR sets the default TBC to 210 so as to not be a direct multiple of the container concurrency target default, and thus make it harder to trigger the activator flipageddon

Signed-off-by: Paul S. Schweigert <paulschw@us.ibm.com>

* typo

Signed-off-by: Paul S. Schweigert <paulschw@us.ibm.com>

* use prime number for tbc to guarantee no flipping

Signed-off-by: Paul S. Schweigert <paulschw@us.ibm.com>
@psschwei
Copy link
Contributor

Fixed in #12774
/close

@knative-prow-robot
Copy link
Contributor

@psschwei: Closing this issue.

In response to this:

Fixed in #12774
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants