Handle Bursty Load by optionally using "Max" rather than "Average" of buckets, or via an all-pod scale down retention window #9092

julz · 2020-08-18T11:24:41Z

(note: edited since the first comment became the actual proposal here).

Background

Currently we scale based on an average all of the "buckets" (1 second periods) over the previous window (60 seconds by default, I'll ignore panic for this discussion for simplicity, I don't think it changes anything important).

This works well in a web-app scenario where load is roughly constant (to be a bit mathematical, where load is essentially continuous rather than discrete), but can lead to under-provisioning for more bursty workloads, like the one reported in #8390. In this case we have a workload with a natural concurrency (of e.g. 10), but the trigger fires every 30 seconds. Because we average over the window, we see one bucket with a concurrency of 10, and 59 buckets with a concurrency of 0. The average therefore ends up way way lower than the number of workers (i.e. 10) we'd ideally want for this workload.

Additionally, many workloads - not just bursty ones - want to keep warm containers around for a while in case more requests come in to avoid paying cold start penalties, but without having to keep them around forever, as minScale requires.

Proposal (hoisted from first comment):

Add a scale-down-delay which works like lastPodRetentionTime but for all pods, not just the last one.

Proposal Doc: https://docs.google.com/document/d/1ECm1Ervw6DxV6__i71NfUsRjO7l6-RYlhYPhcDqPx3A/edit#.

Previous Proposal (for Posterity):

For this type of workload, the Simplest Solution That Could Work may be to take the largest observed 1s bucket concurrency over the stable window, rather than the average. This handles the problem of huge over-provisioning if lots of very small requests happen to overlap, because we're still averaging inside the 1 second buckets, but does over-fit peaks in the data more than you'd want for a web-app workload. Therefore, the proposal here is to add an annotation that lets a user opt-in to this behaviour, where they have a bursty workload.

(Note: simply setting the window smaller doesn't do what you'd want - e.g. a 1s window would avoid the averaging over 60 seconds, but would be even worse because most of the time the average would be 0).

/assign @vagababov @markusthoemmes @duglin

The text was updated successfully, but these errors were encountered:

julz · 2020-08-18T13:17:01Z

thinking out loud, another potential approach here would be to implement something similar to the existing lastPodRetentionTime but for all pods, not just the last one. This would mean we don't scale down until we've observed a lower concurrency for at least N seconds. That way you can set the window so we correctly scale up for the burst (e.g. you could set a 1s window for a very bursty workload), and then avoid accidentally scaling down too quickly afterwards because of the averaging by setting an idleScaleDownTime of - e.g. - 60 seconds.

duglin · 2020-08-18T14:14:01Z

Therefore, the proposal here is to add an annotation that lets a user opt-in to this behaviour, where they have a bursty workload.

Is there a way to avoid this? While there are some auto-scaling flags make sense for the ksvc owner to touch (e.g. cc because if their code just can't handle certain values we need a way to know that), but I'm not sure the user can know what kind of load (burst or not) will hit their ksvc. The actions that cause the load is often out of their control.

julz · 2020-08-18T15:07:45Z

Is there a way to avoid this?

I think the second idea above, of extending the existing scale-to-zero-pod-retention-period flag in to a (configurable, defaultable) all-pods retention period, so that we wouldn't scale down any pod until we've seen reduced concurrency for that number of seconds would be a way of handling bursty and non-bursty loads without a new flag (other than the retention time flag, but that seems like something an operator could reasonably set - to e.g. 60 seconds - for both types of workloads, avoiding a user needing to care unless they want to).

vagababov · 2020-08-18T16:14:26Z

Well it stills requires some configuration, but @duglin I don't think there's a magic bullet exactly due to the fact that you mentioned — we cannot predict really all possible shapes of traffic, so there's no one size fits all .

julz added the kind/feature Well-understood/specified features, ready for coding. label Aug 18, 2020

knative-prow-robot assigned duglin, markusthoemmes and vagababov Aug 18, 2020

julz mentioned this issue Aug 18, 2020

Distinguish zero concurrency from slow/failed scraping when bucketing #8610

Open

julz changed the title ~~Handle Bursty Load by optionally using "Max" rather than "Average" of buckets~~ Handle Bursty Load by optionally using "Max" rather than "Average" of buckets, or via an all-pod scale down retention window Aug 18, 2020

julz mentioned this issue Aug 21, 2020

pod number decreased extremely when panic-window passed #9121

Closed

julz mentioned this issue Sep 2, 2020

Add ScaleDownDelay to autoscaler config #9266

Merged

This was referenced Sep 22, 2020

Add window max package for scale down delay #9516

Merged

Teach Autoscaler about Scale Down Delay #9565

Merged

Document Scale Down Delay knative/docs#2857

Merged

Wire up Scale Down Delay config and annotation #9568

Merged

julz mentioned this issue Sep 30, 2020

Document scale-down-delay in config map example #9626

Merged

knative-prow-robot closed this as completed in #9626 Sep 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle Bursty Load by optionally using "Max" rather than "Average" of buckets, or via an all-pod scale down retention window #9092

Handle Bursty Load by optionally using "Max" rather than "Average" of buckets, or via an all-pod scale down retention window #9092

julz commented Aug 18, 2020 •

edited

julz commented Aug 18, 2020

duglin commented Aug 18, 2020

julz commented Aug 18, 2020

vagababov commented Aug 18, 2020

Handle Bursty Load by optionally using "Max" rather than "Average" of buckets, or via an all-pod scale down retention window #9092

Handle Bursty Load by optionally using "Max" rather than "Average" of buckets, or via an all-pod scale down retention window #9092

Comments

julz commented Aug 18, 2020 • edited

Background

Proposal (hoisted from first comment):

Previous Proposal (for Posterity):

julz commented Aug 18, 2020

duglin commented Aug 18, 2020

julz commented Aug 18, 2020

vagababov commented Aug 18, 2020

julz commented Aug 18, 2020 •

edited