Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Bursty Load by optionally using "Max" rather than "Average" of buckets, or via an all-pod scale down retention window #9092

Closed
julz opened this issue Aug 18, 2020 · 4 comments · Fixed by #9626
Assignees
Labels
kind/feature Well-understood/specified features, ready for coding.

Comments

@julz
Copy link
Member

julz commented Aug 18, 2020

(note: edited since the first comment became the actual proposal here).

Background

Currently we scale based on an average all of the "buckets" (1 second periods) over the previous window (60 seconds by default, I'll ignore panic for this discussion for simplicity, I don't think it changes anything important).

This works well in a web-app scenario where load is roughly constant (to be a bit mathematical, where load is essentially continuous rather than discrete), but can lead to under-provisioning for more bursty workloads, like the one reported in #8390. In this case we have a workload with a natural concurrency (of e.g. 10), but the trigger fires every 30 seconds. Because we average over the window, we see one bucket with a concurrency of 10, and 59 buckets with a concurrency of 0. The average therefore ends up way way lower than the number of workers (i.e. 10) we'd ideally want for this workload.

Additionally, many workloads - not just bursty ones - want to keep warm containers around for a while in case more requests come in to avoid paying cold start penalties, but without having to keep them around forever, as minScale requires.

Proposal (hoisted from first comment):

Add a scale-down-delay which works like lastPodRetentionTime but for all pods, not just the last one.

Proposal Doc: https://docs.google.com/document/d/1ECm1Ervw6DxV6__i71NfUsRjO7l6-RYlhYPhcDqPx3A/edit#.

Previous Proposal (for Posterity):

For this type of workload, the Simplest Solution That Could Work may be to take the largest observed 1s bucket concurrency over the stable window, rather than the average. This handles the problem of huge over-provisioning if lots of very small requests happen to overlap, because we're still averaging inside the 1 second buckets, but does over-fit peaks in the data more than you'd want for a web-app workload. Therefore, the proposal here is to add an annotation that lets a user opt-in to this behaviour, where they have a bursty workload.

(Note: simply setting the window smaller doesn't do what you'd want - e.g. a 1s window would avoid the averaging over 60 seconds, but would be even worse because most of the time the average would be 0).

/assign @vagababov @markusthoemmes @duglin

@julz
Copy link
Member Author

julz commented Aug 18, 2020

thinking out loud, another potential approach here would be to implement something similar to the existing lastPodRetentionTime but for all pods, not just the last one. This would mean we don't scale down until we've observed a lower concurrency for at least N seconds. That way you can set the window so we correctly scale up for the burst (e.g. you could set a 1s window for a very bursty workload), and then avoid accidentally scaling down too quickly afterwards because of the averaging by setting an idleScaleDownTime of - e.g. - 60 seconds.

@duglin
Copy link

duglin commented Aug 18, 2020

Therefore, the proposal here is to add an annotation that lets a user opt-in to this behaviour, where they have a bursty workload.

Is there a way to avoid this? While there are some auto-scaling flags make sense for the ksvc owner to touch (e.g. cc because if their code just can't handle certain values we need a way to know that), but I'm not sure the user can know what kind of load (burst or not) will hit their ksvc. The actions that cause the load is often out of their control.

@julz
Copy link
Member Author

julz commented Aug 18, 2020

Is there a way to avoid this?

I think the second idea above, of extending the existing scale-to-zero-pod-retention-period flag in to a (configurable, defaultable) all-pods retention period, so that we wouldn't scale down any pod until we've seen reduced concurrency for that number of seconds would be a way of handling bursty and non-bursty loads without a new flag (other than the retention time flag, but that seems like something an operator could reasonably set - to e.g. 60 seconds - for both types of workloads, avoiding a user needing to care unless they want to).

@vagababov
Copy link
Contributor

Well it stills requires some configuration, but @duglin I don't think there's a magic bullet exactly due to the fact that you mentioned — we cannot predict really all possible shapes of traffic, so there's no one size fits all .

@julz julz changed the title Handle Bursty Load by optionally using "Max" rather than "Average" of buckets Handle Bursty Load by optionally using "Max" rather than "Average" of buckets, or via an all-pod scale down retention window Aug 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Well-understood/specified features, ready for coding.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants