-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle Bursty Load by optionally using "Max" rather than "Average" of buckets, or via an all-pod scale down retention window #9092
Comments
thinking out loud, another potential approach here would be to implement something similar to the existing lastPodRetentionTime but for all pods, not just the last one. This would mean we don't scale down until we've observed a lower concurrency for at least N seconds. That way you can set the window so we correctly scale up for the burst (e.g. you could set a 1s window for a very bursty workload), and then avoid accidentally scaling down too quickly afterwards because of the averaging by setting an idleScaleDownTime of - e.g. - 60 seconds. |
Is there a way to avoid this? While there are some auto-scaling flags make sense for the ksvc owner to touch (e.g. cc because if their code just can't handle certain values we need a way to know that), but I'm not sure the user can know what kind of load (burst or not) will hit their ksvc. The actions that cause the load is often out of their control. |
I think the second idea above, of extending the existing |
Well it stills requires some configuration, but @duglin I don't think there's a magic bullet exactly due to the fact that you mentioned — we cannot predict really all possible shapes of traffic, so there's no one size fits all . |
(note: edited since the first comment became the actual proposal here).
Background
Currently we scale based on an average all of the "buckets" (1 second periods) over the previous window (60 seconds by default, I'll ignore panic for this discussion for simplicity, I don't think it changes anything important).
This works well in a web-app scenario where load is roughly constant (to be a bit mathematical, where load is essentially continuous rather than discrete), but can lead to under-provisioning for more bursty workloads, like the one reported in #8390. In this case we have a workload with a natural concurrency (of e.g. 10), but the trigger fires every 30 seconds. Because we average over the window, we see one bucket with a concurrency of 10, and 59 buckets with a concurrency of 0. The average therefore ends up way way lower than the number of workers (i.e. 10) we'd ideally want for this workload.
Additionally, many workloads - not just bursty ones - want to keep warm containers around for a while in case more requests come in to avoid paying cold start penalties, but without having to keep them around forever, as minScale requires.
Proposal (hoisted from first comment):
Add a
scale-down-delay
which works like lastPodRetentionTime but for all pods, not just the last one.Proposal Doc: https://docs.google.com/document/d/1ECm1Ervw6DxV6__i71NfUsRjO7l6-RYlhYPhcDqPx3A/edit#.
Previous Proposal (for Posterity):
For this type of workload, the Simplest Solution That Could Work may be to take the largest observed 1s bucket concurrency over the stable window, rather than the average. This handles the problem of huge over-provisioning if lots of very small requests happen to overlap, because we're still averaging inside the 1 second buckets, but does over-fit peaks in the data more than you'd want for a web-app workload. Therefore, the proposal here is to add an annotation that lets a user opt-in to this behaviour, where they have a bursty workload.
(Note: simply setting the window smaller doesn't do what you'd want - e.g. a 1s window would avoid the averaging over 60 seconds, but would be even worse because most of the time the average would be 0).
/assign @vagababov @markusthoemmes @duglin
The text was updated successfully, but these errors were encountered: