Autoscaler High Availability #2930

josephburnett · 2019-01-17T23:27:55Z

Design Proposal

The KPA-class Autoscaler implementation does a fair amount of data processing, in proportion to the number of Revisions and their relative scale. It processes summary statistics from each Pod every 1 second and computes average concurrency for every Revision every 2 seconds.

The Autoscaler also plays a critical part in the data plane (even though it's part of the control plane) because it ensures revisions are scaled up from 0. Maintaining high availability of the Autoscaler is important.

For these reason, the Autoscaler must be horizontally scalable. Unlike the Activator, the Autoscaler is stateful because it calculates a running average of concurrency per Revision.

Proposal

I propose that we separate the KPA-class PodAutoscaler reconciler into a separate binary from the HPA-class PodAutoscaler reconciler. Then we add a status field controller to the PodAutoscaler into which the KPA autoscaler will write its Pod name.

When reconciling PodAutoscalers, the KPA will filter for only PA's with either an empty controller or those with its name on them. As a special case, KPAs will also reconcile PA's whose lastUpdated field is more than 60 minutes in the past.

This allows the Autoscaler job to be replicated. KPAs will fight over new PAs by trying to write their name on them. One will win as a result of Kubernetes optimistic concurrency control. The autoscaler that wins will start scraping that PodAutoscaler's Pods and start scaling the Deployment. At every resync (30 minutes) KPAs will write their name again on the PA (or update the lastUpdated field somehow).

Scaling Up

When we have some idea about how many Revisions a given KPA can handle, we can surface a custom metric revisionCount. We can point a K8s HPA at the custom metric to scale up when that count gets too high on average. Each KPA will stop claiming PAs when its revisionCount has reached the predetermined maximum. The K8s HPA will target a value lower than the maximum so new KPAs will be created and will claim any leftover, unclaimed PAs.

Scaling Down

When the average revisionCount gets low enough, the K8s HPA will scale down the number of KPAs.
When a KPA pod receives SIGTERM, it proactively removes its name from all the PodAutoscalers it claimed, putting them up for grabs again.

The text was updated successfully, but these errors were encountered:

lvjing2 · 2019-01-24T05:02:14Z

I think there are 3 major problems we need to consider when there are more than one instance of autoscaler:

The stats from the same revision should send stat to the same autoscaler
Each instance of autoscaler aggregate the stats for a small and stable part of revisions.
Auto scale up or down of autoscaler should not cause the stats send to another instance of autoscaler if the previous one is available.

Shall we consider to deploy the autoscaler as a statefulSet rather than a deployment? and using a consistent hash ring to decide which pod of autoscaler each revision will send stats to, then queue proxy could use the pod dns to set up the connection to one pod of autoscaler.

markusthoemmes · 2019-01-24T07:25:38Z

@lvjing2 Note that @josephburnett mentioned that the Autoscaler will "start scraping" the deployment. I think the assumption of the proposal is based on the inversion of control the metrics flow is undergoing.

Soonish, we're not sending metrics actively anymore but scrape them from the autoscaler's side.

lvjing2 · 2019-01-25T13:24:50Z

Oh, That's nice. If so, the high availability of autoscaler will be much easier than before. Then the problems may change as:

Each instance of autoscaler should grap the stat from a small and different subset of all the deployment.
Auto scale up or down of autoscaler should not cause the existing autoscaler to scrape another subset of deployments.

I think we could not introduce the leader elector to the duplicate autoscale for the same subset of deployment firstly.

We can try to deploy the autoscaler as a statefulSet
Hashing the user service fqdn name to find the order of instance of autoscaler
Each instance of autoscaler only scape the stats from the hashed deployment
Introduce the consistent hashing algorithm to not muss the connection between autoscale and the subset of deployments when scale up/down autoscaler.
If the statefulSet works well, then maybe we needn't to introduce multiple instance autoscaler for same deployment.

josephburnett · 2019-01-28T19:13:35Z

@lvjing2 exactly.

Each instance of autoscaler should grap the stat from a small and different subset of all the deployment.

Auto scale up or down of autoscaler should not cause the existing autoscaler to scrape another subset of deployments.

My proposal above should satisfy these two requirements. What do you think?

This is why I think it satisfies the requirements: only one Autoscaler will be handling each Revision (by winning the race to write its name on the PodAutoscaler). When new Autoscalers are added, this doesn't disrupt the set of Revisions the previous Autoscalers are responsible for. They will be less inclined to grab more because they will be closer to their threshold than the new Autoscaler. When an existing Autoscaler goes away, it will release all it's Revisions and they will be up for grabs.

yanweiguo · 2019-02-08T23:40:38Z

I think we want to do autoscaling for HPA controller as well after we support scaling to 0 for HPA(#3064). We can also use the idea of this proposal for HPA-class PodAutoscaler.

ZhiminXiang · 2019-02-19T19:25:56Z

How do we handle the case of scaling from 0 to 1? Currently Activator sends metrics to Autoscaler service through websocket. After adding more pods under Autoscaler service, metrics will be distributed into different pods. Will this affect scaling from 0?

k4leung4 · 2019-02-19T19:27:29Z

here is the proposal for scale from zero #3064

…

On Tue, Feb 19, 2019 at 11:26 AM Zhimin Xiang ***@***.***> wrote: How do we handle the case of scaling from 0 to 1? Currently Activator sends metrics to Autoscaler service through websocket. After adding more pods under Autoscaler service, metrics will be distributed into different pods. Will this affect scaling from 0? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2930 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/APcc4W4fMn8g-tSAMnCHwQ5gP-IO9Uutks5vPE_IgaJpZM4aG8FE> .

fatkun · 2019-02-28T07:49:21Z

If one autoscaler crash(such kill -9 or panic), the revision managed by this autoscaler can't scale in 60 minutes.

I suggest each autoscaler watch all autoscaler, If one autoscaler exit, other autoscaler can filter the exited autoscaler pod name and change it.

markusthoemmes · 2019-02-28T07:56:36Z

@fatkun Yes I think this was part of the discussion in the autoscaling group.

We can watch the autoscaler deployment and reconcile if its state changes unexpectedly.

tony-shi · 2019-03-21T10:22:58Z

What will happen if we have multiple auto-scalers and multiple activators and starting to scale from 0?
Since in #3064 has mentioned, it is possible to quickly set the pod size as 1 and rely on scrape user pod metrics to scale from 1 to N.

But if only one auto-scaler has the responsibility to scrape the metrics from the user pod, will the chosen auto-scaler be blind to see the pending requests buffered in the other activators which paired through WebSocket to other auto-scalers?

I think the activator acts as a mock deployment of user pod, so when there are multiple activators, all of the pending requests in activators should be added up to generate the summarised desire pod count. However, if the activator communicates to auto-scaler with WebSocket in pair style and only one auto-scaler in charge of a revision, will this influence the 1 to N stage at the very beginning(0->1 has already been taking place)?

vagababov · 2019-07-09T18:18:25Z

This seems out of scale for .8.
And for 1.0 we'll probably just try to add VPA to it.

markusthoemmes · 2019-07-10T12:50:28Z

Discussed this with @mattmoor a bit and the thought was to make sure the autoscaler behaves well when being restarted (see #2250) and then test and document different failure models of the autoscaler.

Making it truly high available is quite a complex task and there might be quite a bit of low-hanging fruit on the way there.

vagababov · 2019-07-10T17:09:41Z

Yep, I'll take a look later this quarter for the better restart behavior and perhaps VPA parameters. But everything else is beyond GA.

mattmoor · 2019-07-10T18:25:40Z

We already throw a few knobs like the annotations to avoid pre-emption Victor added previously.

I still need to test that the autoscaler is sane across restarts when a ksvc is under load, but I did test yesterday that if I delete all knative components except activator and autoscaler that we successfully 0->1->0. The other test I want to run is: 0 -> 12 -> (kill the autoscaler, while under load) -> 12.

If we make a bad decision as the autoscaler comes back up, then we need to fix that. Otherwise, I think that in the short term having the autoscaler be down for short periods isn't the end of the world. There should not be data loss because the activator will still hold the request, but it may impact latency to add capacity.

I'd much rather have us focus on ELIMINATING the autoscaler from this path than hardening it just because it's there (for now). My motivation for that is that even K8s master's aren't as available as what I think we ultimately want (even when "HA"), so even with an "HA" controller we're not where we need to be.

Note that cold-start latency is a nice byproduct of node-local scheduling, but it was never my primary motivation for it.

vagababov · 2019-07-10T18:35:30Z

The problem is that after restart we're losing all the metrics history. Which means if the instantaneous observed concurrency is low we'll immediately scale down even though during the next few iterations we might want to scale up back, when we observe more of the traffic.

The simple solution Markus and I talked about is to launch all non-zero deployments in panic mode, so until we aggregate enough information we don't scale down.

mattmoor · 2019-07-10T18:40:27Z

It sounds very reasonable to me to treat everything as panic mode on autoscaler boot-ups.

@mattmoor

When autoscaler restarts it has no memory of the previous stats and immediately scales the deployment down. Under my load test conditions, Autoscaler goes from 7 to 1 then to 2, only then to 6 pods, before finally reaching 7 after a few cycles. This is obviously not a desired property, so this change fixes that behavior, by starting Autoscaler in panic mode with panic pod count equal to the current pod count if we have more than 1 serving pods right now. Obviously if the deployment is scaled to 0, there's no reason to change logic and if it is 1, then we won't scale below 1 anyway for the next stable window, so no need to panic either. /assign @mattmoor @markusthoemmes This is the GA scope of knative#2930

@mattmoor

* Start autoscaler in panic mode. When autoscaler restarts it has no memory of the previous stats and immediately scales the deployment down. Under my load test conditions, Autoscaler goes from 7 to 1 then to 2, only then to 6 pods, before finally reaching 7 after a few cycles. This is obviously not a desired property, so this change fixes that behavior, by starting Autoscaler in panic mode with panic pod count equal to the current pod count if we have more than 1 serving pods right now. Obviously if the deployment is scaled to 0, there's no reason to change logic and if it is 1, then we won't scale below 1 anyway for the next stable window, so no need to panic either. /assign @mattmoor @markusthoemmes This is the GA scope of #2930 * unit tests * updates to the imports * merge

yanweiguo · 2020-08-24T18:40:03Z

The new feature track doc: https://docs.google.com/document/d/1FyjB1GIJW259nMaNOc7nlQVjvgAZHlQ-fD6tbfUsaz8/edit#heading=h.n8a530nnrb

yanweiguo · 2020-11-24T20:20:38Z

The first iteration was done and released in 0.19.0.

/close

knative-prow-robot · 2020-11-24T20:20:45Z

@yanweiguo: Closing this issue.

In response to this:

The first iteration was done and released in 0.19.0.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

knative-prow-robot added area/autoscale kind/feature Well-understood/specified features, ready for coding. labels Jan 17, 2019

mattmoor added this to the Serving "v1" (ready for production) milestone Jan 28, 2019

josephburnett added this to To do in Scaling: Autoscaler Availability Jan 30, 2019

josephburnett added this to To do in Scaling: Autoscaler Horizontal Scalability via automation Feb 14, 2019

yanweiguo mentioned this issue Mar 6, 2019

leader selection for components like autoscaler and controller #2111

Closed

lvjing2 mentioned this issue Apr 26, 2019

The concurrency metric scraped from prometheus exporter may not right #3582

Closed

eallred-google added the P1 P1 label Jun 6, 2019

mattmoor modified the milestones: Serving "v1" (ready for production), Serving 0.8 Jun 24, 2019

mattmoor assigned vagababov Jun 26, 2019

vagababov mentioned this issue Jul 18, 2019

Start autoscaler in panic mode. #4795

Merged

markusthoemmes modified the milestones: Ice Box, Serving 0.16.x May 27, 2020

markusthoemmes mentioned this issue May 27, 2020

Make KPA autoscaler highly available #7998

Closed

yanweiguo mentioned this issue Jun 23, 2020

Introduce a StatefulSet type elector builder knative/pkg#1438

Merged

dprotaso modified the milestones: Serving 0.16.x, 0.17.x Jul 8, 2020

yanweiguo added this to Needs Triage in Knative "HA" control planes Aug 6, 2020

yanweiguo mentioned this issue Aug 25, 2020

Introduce StatForwarder #9170

Merged

mattmoor modified the milestones: 0.17.x, 0.18.x Aug 31, 2020

This was referenced Aug 31, 2020

Forwarder can forward metrics via IP address #9233

Merged

Close the WS connection if the Server is not the owner of a bucket #9281

Merged

This was referenced Sep 9, 2020

Refactor StatForwarder #9336

Merged

Add retry for processing a stat in Forwarder #9404

Merged

This was referenced Sep 17, 2020

[WIP] Enable HA for autoscaler #9465

Closed

Support forwarding metrics via SVC when there's mesh #9472

Merged

dprotaso modified the milestones: 0.18.x, 0.19.x Sep 30, 2020

yanweiguo mentioned this issue Oct 2, 2020

Enable leader election for autoscaler #9682

Merged

dprotaso modified the milestones: 0.19.x, 0.20.x Nov 16, 2020

knative-prow-robot closed this as completed Nov 24, 2020

Scaling: Autoscaler Availability automation moved this from To do to Done Nov 24, 2020

mattmoor moved this from Needs Triage to Beta in Knative "HA" control planes Jan 20, 2021

mattmoor moved this from Beta to Needs Triage in Knative "HA" control planes Jan 20, 2021

julz mentioned this issue Mar 11, 2021

Autoscaler tests should be runnable under chaos #10936

Open

7 tasks

dprotaso mentioned this issue Nov 12, 2022

Autoscaler runs two leadership elector goroutines #13447

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoscaler High Availability #2930

Autoscaler High Availability #2930

josephburnett commented Jan 17, 2019

lvjing2 commented Jan 24, 2019 •

edited

markusthoemmes commented Jan 24, 2019

lvjing2 commented Jan 25, 2019

josephburnett commented Jan 28, 2019 •

edited

yanweiguo commented Feb 8, 2019

ZhiminXiang commented Feb 19, 2019

k4leung4 commented Feb 19, 2019 via email

fatkun commented Feb 28, 2019

markusthoemmes commented Feb 28, 2019

tony-shi commented Mar 21, 2019

vagababov commented Jul 9, 2019

markusthoemmes commented Jul 10, 2019

vagababov commented Jul 10, 2019

mattmoor commented Jul 10, 2019

vagababov commented Jul 10, 2019

mattmoor commented Jul 10, 2019

yanweiguo commented Aug 24, 2020

yanweiguo commented Nov 24, 2020

knative-prow-robot commented Nov 24, 2020

Autoscaler High Availability #2930

Autoscaler High Availability #2930

Comments

josephburnett commented Jan 17, 2019

Design Proposal

Proposal

Scaling Up

Scaling Down

lvjing2 commented Jan 24, 2019 • edited

markusthoemmes commented Jan 24, 2019

lvjing2 commented Jan 25, 2019

josephburnett commented Jan 28, 2019 • edited

yanweiguo commented Feb 8, 2019

ZhiminXiang commented Feb 19, 2019

k4leung4 commented Feb 19, 2019 via email

fatkun commented Feb 28, 2019

markusthoemmes commented Feb 28, 2019

tony-shi commented Mar 21, 2019

vagababov commented Jul 9, 2019

markusthoemmes commented Jul 10, 2019

vagababov commented Jul 10, 2019

mattmoor commented Jul 10, 2019

vagababov commented Jul 10, 2019

mattmoor commented Jul 10, 2019

yanweiguo commented Aug 24, 2020

yanweiguo commented Nov 24, 2020

knative-prow-robot commented Nov 24, 2020

lvjing2 commented Jan 24, 2019 •

edited

josephburnett commented Jan 28, 2019 •

edited