With default settings, a single request causes flipping activator in/out of path with min-scale=2 #11926

julz · 2021-09-06T08:54:03Z

What happened?

This is kind of fun, but also causes legit problems (likely due to istio not handling stuff well under load, to be fair) for us that took a while to track down: with the default settings for TBC we move activator in/out of the path when spare capacity goes over/under 200 requests. 200 requests also happens to be the exact burst capacity of the system when you have 2 pods (e.g. min-scale=2) and cc=0 or 100 (because when cc=0 we take capacity to be 100, and 2 x 100 cap = 200). This means it takes exactly 1 request to push a minscale=2 revision over burst capacity with the default settings, and as soon as that request finishes, we go back under capacity. The result for a basic test in our environment (and, unfortunately, some real workloads too that saw problems) was us rapidly swapping activators in/out of the service, and 503 errors (chasing this bit down, but we think because istio certs get out of sync with the endpoints).

Steps to Reproduce

Create service with min-scale = 2
Keep everything else default (i.e. cc=0, but you can also set cc=100)
Curl a few times in series. Notice SKS will flap back and forth between Serve and Proxy. In normal cases this won't directly cause user issues, other than unneeded churn and maybe (but I haven't proved this) some perf impact.
Optional: have mesh enabled and a heavily loaded istio gateway, you will likely quite consistently see requests error with 503s and cert errors in the logs as things fail to keep up (this is clearly an istio rather than knative problem, but we probably shouldn't be causing this load - we have set TBC=-1 now which seems to have fixed the problem)

Suggested resolution

it may make sense to tweak some of the defaults to make it less easy to accidentally hit cases like this (where one request nudges you over/under) out of the box
We might want to debounce the activator swapping, i.e. when we've only just put it in path we may want to keep it there for some period and/or we may want to see that we're within burst capacity for some period of time before taking it out of path again.
Probably controversial, but: personally I'm totally unconvinced most users should need to care about TBC or that the potential perf advantages outweigh the UX complexity (and churn, and need for scraping) for most workloads. I'd be in favour of defaulting it to -1 so people can opt in to activators going in/out of path if they have a specific need for it but otherwise we just keep it in path. I think it's great as an advanced setting to give operators the ability to swap it out when they have perf concerns that require that, but not something we should be doing for the common case.

cc @markusthoemmes @vagababov @psschwei

markusthoemmes · 2021-09-06T09:18:08Z

Funky!

I'm not opposed to default to -1 in the future, though it'd be neat to quantify the impact that'd have on existing envs (which will be tough as it's very reliant on the workloads running). Maybe it's time to finally try to measure the activator's overhead?

Another observation: Should we change the default TBC to 0 if containerConcurrency is 0? Seems useless to have the activator in path like ever in that case.

sdhoward · 2021-10-26T21:20:22Z

Notice SKS will flap back and forth between Serve and Proxy.

Noob question, how can one see this?

markwinter · 2021-11-05T09:56:45Z

@sdhoward
You can see with kubectl get --watch sks. You can also check the autoscaler logs

markwinter · 2021-11-05T10:03:28Z

I experienced this same bug(?)/config-mishap

Using default autoscaler capacity options and minScale: 2, the SKS kept flipping between proxy and serve mode with intermittent single requests to a service.

This was also causing istiod to do full pushes every time which used a lot of cpu

julz · 2021-11-05T10:27:46Z

This was also causing istiod to do full pushes every time which used a lot of cpu

this is indeed the exact problem we were debugging which led us to track the above down.

FWIW we set the default TBC to -1 (activator always in path) in our environments, reserving TBC!=-1 for use cases where a user knows they need the (pretty minor tbh) latency gain of removing activator from path, and have so far seen no issues with that config.

I think I might open a PR to change this default, and we can discuss whether it makes sense to make it -1, or just have it kick in at a higher and less accidentally-available value there.

dprotaso · 2021-11-09T17:33:29Z

Should we have a follow up issue to smooth out/provide optionality how the activator transitions in/out of the data path?

rhuss · 2021-11-09T21:05:02Z

Would introducing a Hysteresis may help to smooth out that kind of fluctuations ? I mean not having a single value for switching between Serve and Proxy but different values for the transition Serve -> Proxy and Proxy -> Serve. (maybe add a parameter burstHysteresis = 10 and then switch forth/back at 200-10 and 200+10.

Also (but not sure that helps), avoiding having cc being a divisor of burst capacity might help to avoid this kind of situation (but this is just a gut feeling, needs to be thought through).

Setting burst capacity to -1 effectively complete kills a key feature (that has been documented in detail not only on knative.dev but also e.g. in "Knative in Action") except for insiders who could re-enable. Also, as Knative is a very opinionated approach we should have also an opinion here instead of surrender (and potentially remove the Serve mode if we think it's not helpful instead of leaving the decision to the user). Anyway, if we keep it we need to give a recommendation on when to use a burst capacity and when not.

julz · 2021-11-09T22:33:04Z

Setting burst capacity to -1 effectively complete kills a key feature (that has been documented in detail not only on knative.dev but also e.g. in "Knative in Action") except for insiders who could re-enable.

I don't really buy that, fwiw. You might as well say that the fact we don't set min-scale, max-scale, init-scale, scale-down-delay, rps or even containerConcurrency by default means the features are removed.

Also, as Knative is a very opinionated approach we should have also an opinion here instead of surrender (and potentially remove the Serve mode if we think it's not helpful instead of leaving the decision to the user).

I'm not sure how we can square saying we have an opinionated approach with the ton of other optional and non-default features, knobs and twiddles in autoscaler, to be fair :-). Should we remove RPS and min-scale, too? All I'm really proposing here is that - like min-scale or RPS or scale down delay or containerConcurrency - setting a point where activator should be removed from path should be set based on your actual workload and latency needs.

The (Id argue really quite small number of) people with latency requirements so extreme that they can't cope with activator in path (but can handle an ingress/mesh, and queue proxy in the path) probably shouldn't use the default min-scale=0, either, for example. But it's nice we can support them, so I personally wouldn't want to remove the feature.

(Having said all this, like I say I'd be OK just bumping the number if people are super attached to keeping the automated removal - my experience though is simpler and more understandable defaults are better, and explaining when, with the current defaults, activator will or will not be affecting the load balancing, for example, is pretty extremely tough, and that's not good)

Anyway, if we keep it we need to give a recommendation on when to use a burst capacity and when not.

FWIW the recommendation Id suggest we make, if we make it non-default rather than just bumping it to a higher number, would be very similar to min-scale, RPS, scale-down-delay etc: use it when you have a workload that requires it, ideally after measuring empirically (in this case if your workload can't cope with the latency overhead of having activator in the path you can set TBC to determine when activator - and the features that rely on activator like better load balancing and buffering - will be removed from path)

To state the obvious - but worth saying I think - all of this is just about the out of the box defaults in open source knative. Any platform/product/deployment can easily customise this if they know their workloads need super low latency even at the cost of some complexity and endpoint churn, just like they can customise default containerConcurrency etc.

github-actions · 2022-02-08T01:24:12Z

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

psschwei · 2022-03-10T11:52:56Z

/reopen
/remove-lifecycle-stale

knative-prow-robot · 2022-03-10T11:53:03Z

@psschwei: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle-stale

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

More discussion in: knative#11926 knative#12241 With the current defaults, setting min-scale=2 causes activator to flip in/out of path on each request. This PR sets the default TBC to 210 so as to not be a direct multiple of the container concurrency target default, and thus make it harder to trigger the activator flipageddon Signed-off-by: Paul S. Schweigert <paulschw@us.ibm.com>

* change default target burst capacity More discussion in: #11926 #12241 With the current defaults, setting min-scale=2 causes activator to flip in/out of path on each request. This PR sets the default TBC to 210 so as to not be a direct multiple of the container concurrency target default, and thus make it harder to trigger the activator flipageddon Signed-off-by: Paul S. Schweigert <paulschw@us.ibm.com> * typo Signed-off-by: Paul S. Schweigert <paulschw@us.ibm.com> * use prime number for tbc to guarantee no flipping Signed-off-by: Paul S. Schweigert <paulschw@us.ibm.com>

psschwei · 2022-03-24T22:10:04Z

Fixed in #12774
/close

knative-prow-robot · 2022-03-24T22:10:11Z

@psschwei: Closing this issue.

In response to this:

Fixed in #12774
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

julz added the kind/bug Categorizes issue or PR as related to a bug. label Sep 6, 2021

julz mentioned this issue Nov 9, 2021

Change default target burst capacity #12241

Closed

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 8, 2022

github-actions bot closed this as completed Mar 10, 2022

knative-prow-robot reopened this Mar 10, 2022

github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 11, 2022

psschwei mentioned this issue Mar 23, 2022

Change default target burst capacity #12774

Merged

knative-prow-robot closed this as completed Mar 24, 2022

dprotaso mentioned this issue Nov 30, 2022

503s During Activator Add/Remove from the Path and KSVC Scale Events #13525

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

With default settings, a single request causes flipping activator in/out of path with min-scale=2 #11926

With default settings, a single request causes flipping activator in/out of path with min-scale=2 #11926

julz commented Sep 6, 2021 •

edited

markusthoemmes commented Sep 6, 2021

sdhoward commented Oct 26, 2021

markwinter commented Nov 5, 2021

markwinter commented Nov 5, 2021 •

edited

julz commented Nov 5, 2021

dprotaso commented Nov 9, 2021

rhuss commented Nov 9, 2021 •

edited

julz commented Nov 9, 2021 •

edited

github-actions bot commented Feb 8, 2022

psschwei commented Mar 10, 2022

knative-prow-robot commented Mar 10, 2022

psschwei commented Mar 24, 2022

knative-prow-robot commented Mar 24, 2022

With default settings, a single request causes flipping activator in/out of path with min-scale=2 #11926

With default settings, a single request causes flipping activator in/out of path with min-scale=2 #11926

Comments

julz commented Sep 6, 2021 • edited

What happened?

Steps to Reproduce

Suggested resolution

markusthoemmes commented Sep 6, 2021

sdhoward commented Oct 26, 2021

markwinter commented Nov 5, 2021

markwinter commented Nov 5, 2021 • edited

julz commented Nov 5, 2021

dprotaso commented Nov 9, 2021

rhuss commented Nov 9, 2021 • edited

julz commented Nov 9, 2021 • edited

github-actions bot commented Feb 8, 2022

psschwei commented Mar 10, 2022

knative-prow-robot commented Mar 10, 2022

psschwei commented Mar 24, 2022

knative-prow-robot commented Mar 24, 2022

julz commented Sep 6, 2021 •

edited

markwinter commented Nov 5, 2021 •

edited

rhuss commented Nov 9, 2021 •

edited

julz commented Nov 9, 2021 •

edited