Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaler High Availability #2930

Closed
josephburnett opened this issue Jan 17, 2019 · 26 comments
Closed

Autoscaler High Availability #2930

josephburnett opened this issue Jan 17, 2019 · 26 comments

Comments

@josephburnett
Copy link
Contributor

Design Proposal

The KPA-class Autoscaler implementation does a fair amount of data processing, in proportion to the number of Revisions and their relative scale. It processes summary statistics from each Pod every 1 second and computes average concurrency for every Revision every 2 seconds.

The Autoscaler also plays a critical part in the data plane (even though it's part of the control plane) because it ensures revisions are scaled up from 0. Maintaining high availability of the Autoscaler is important.

For these reason, the Autoscaler must be horizontally scalable. Unlike the Activator, the Autoscaler is stateful because it calculates a running average of concurrency per Revision.

Proposal

I propose that we separate the KPA-class PodAutoscaler reconciler into a separate binary from the HPA-class PodAutoscaler reconciler. Then we add a status field controller to the PodAutoscaler into which the KPA autoscaler will write its Pod name.

When reconciling PodAutoscalers, the KPA will filter for only PA's with either an empty controller or those with its name on them. As a special case, KPAs will also reconcile PA's whose lastUpdated field is more than 60 minutes in the past.

This allows the Autoscaler job to be replicated. KPAs will fight over new PAs by trying to write their name on them. One will win as a result of Kubernetes optimistic concurrency control. The autoscaler that wins will start scraping that PodAutoscaler's Pods and start scaling the Deployment. At every resync (30 minutes) KPAs will write their name again on the PA (or update the lastUpdated field somehow).

image

Scaling Up

When we have some idea about how many Revisions a given KPA can handle, we can surface a custom metric revisionCount. We can point a K8s HPA at the custom metric to scale up when that count gets too high on average. Each KPA will stop claiming PAs when its revisionCount has reached the predetermined maximum. The K8s HPA will target a value lower than the maximum so new KPAs will be created and will claim any leftover, unclaimed PAs.

Scaling Down

When the average revisionCount gets low enough, the K8s HPA will scale down the number of KPAs.
When a KPA pod receives SIGTERM, it proactively removes its name from all the PodAutoscalers it claimed, putting them up for grabs again.

@knative-prow-robot knative-prow-robot added area/autoscale kind/feature Well-understood/specified features, ready for coding. labels Jan 17, 2019
@lvjing2
Copy link
Contributor

lvjing2 commented Jan 24, 2019

I think there are 3 major problems we need to consider when there are more than one instance of autoscaler:

  1. The stats from the same revision should send stat to the same autoscaler
  2. Each instance of autoscaler aggregate the stats for a small and stable part of revisions.
  3. Auto scale up or down of autoscaler should not cause the stats send to another instance of autoscaler if the previous one is available.

Shall we consider to deploy the autoscaler as a statefulSet rather than a deployment? and using a consistent hash ring to decide which pod of autoscaler each revision will send stats to, then queue proxy could use the pod dns to set up the connection to one pod of autoscaler.

@markusthoemmes
Copy link
Contributor

@lvjing2 Note that @josephburnett mentioned that the Autoscaler will "start scraping" the deployment. I think the assumption of the proposal is based on the inversion of control the metrics flow is undergoing.

Soonish, we're not sending metrics actively anymore but scrape them from the autoscaler's side.

@lvjing2
Copy link
Contributor

lvjing2 commented Jan 25, 2019

Oh, That's nice. If so, the high availability of autoscaler will be much easier than before. Then the problems may change as:

  1. Each instance of autoscaler should grap the stat from a small and different subset of all the deployment.
  2. Auto scale up or down of autoscaler should not cause the existing autoscaler to scrape another subset of deployments.

I think we could not introduce the leader elector to the duplicate autoscale for the same subset of deployment firstly.

  1. We can try to deploy the autoscaler as a statefulSet
  2. Hashing the user service fqdn name to find the order of instance of autoscaler
  3. Each instance of autoscaler only scape the stats from the hashed deployment
  4. Introduce the consistent hashing algorithm to not muss the connection between autoscale and the subset of deployments when scale up/down autoscaler.
    If the statefulSet works well, then maybe we needn't to introduce multiple instance autoscaler for same deployment.

@josephburnett
Copy link
Contributor Author

josephburnett commented Jan 28, 2019

@lvjing2 exactly.

  1. Each instance of autoscaler should grap the stat from a small and different subset of all the deployment.
  2. Auto scale up or down of autoscaler should not cause the existing autoscaler to scrape another subset of deployments.

My proposal above should satisfy these two requirements. What do you think?

This is why I think it satisfies the requirements: only one Autoscaler will be handling each Revision (by winning the race to write its name on the PodAutoscaler). When new Autoscalers are added, this doesn't disrupt the set of Revisions the previous Autoscalers are responsible for. They will be less inclined to grab more because they will be closer to their threshold than the new Autoscaler. When an existing Autoscaler goes away, it will release all it's Revisions and they will be up for grabs.

@yanweiguo
Copy link
Contributor

I think we want to do autoscaling for HPA controller as well after we support scaling to 0 for HPA(#3064). We can also use the idea of this proposal for HPA-class PodAutoscaler.

@ZhiminXiang
Copy link

How do we handle the case of scaling from 0 to 1? Currently Activator sends metrics to Autoscaler service through websocket. After adding more pods under Autoscaler service, metrics will be distributed into different pods. Will this affect scaling from 0?

@k4leung4
Copy link
Contributor

k4leung4 commented Feb 19, 2019 via email

@fatkun
Copy link
Contributor

fatkun commented Feb 28, 2019

If one autoscaler crash(such kill -9 or panic), the revision managed by this autoscaler can't scale in 60 minutes.

I suggest each autoscaler watch all autoscaler, If one autoscaler exit, other autoscaler can filter the exited autoscaler pod name and change it.

@markusthoemmes
Copy link
Contributor

@fatkun Yes I think this was part of the discussion in the autoscaling group.

We can watch the autoscaler deployment and reconcile if its state changes unexpectedly.

@tony-shi
Copy link

What will happen if we have multiple auto-scalers and multiple activators and starting to scale from 0?
Since in #3064 has mentioned, it is possible to quickly set the pod size as 1 and rely on scrape user pod metrics to scale from 1 to N.

But if only one auto-scaler has the responsibility to scrape the metrics from the user pod, will the chosen auto-scaler be blind to see the pending requests buffered in the other activators which paired through WebSocket to other auto-scalers?

I think the activator acts as a mock deployment of user pod, so when there are multiple activators, all of the pending requests in activators should be added up to generate the summarised desire pod count. However, if the activator communicates to auto-scaler with WebSocket in pair style and only one auto-scaler in charge of a revision, will this influence the 1 to N stage at the very beginning(0->1 has already been taking place)?

@vagababov
Copy link
Contributor

This seems out of scale for .8.
And for 1.0 we'll probably just try to add VPA to it.

@markusthoemmes
Copy link
Contributor

Discussed this with @mattmoor a bit and the thought was to make sure the autoscaler behaves well when being restarted (see #2250) and then test and document different failure models of the autoscaler.

Making it truly high available is quite a complex task and there might be quite a bit of low-hanging fruit on the way there.

@vagababov
Copy link
Contributor

Yep, I'll take a look later this quarter for the better restart behavior and perhaps VPA parameters. But everything else is beyond GA.

@mattmoor
Copy link
Member

We already throw a few knobs like the annotations to avoid pre-emption Victor added previously.

I still need to test that the autoscaler is sane across restarts when a ksvc is under load, but I did test yesterday that if I delete all knative components except activator and autoscaler that we successfully 0->1->0. The other test I want to run is: 0 -> 12 -> (kill the autoscaler, while under load) -> 12.

If we make a bad decision as the autoscaler comes back up, then we need to fix that. Otherwise, I think that in the short term having the autoscaler be down for short periods isn't the end of the world. There should not be data loss because the activator will still hold the request, but it may impact latency to add capacity.

I'd much rather have us focus on ELIMINATING the autoscaler from this path than hardening it just because it's there (for now). My motivation for that is that even K8s master's aren't as available as what I think we ultimately want (even when "HA"), so even with an "HA" controller we're not where we need to be.

Note that cold-start latency is a nice byproduct of node-local scheduling, but it was never my primary motivation for it.

@vagababov
Copy link
Contributor

The problem is that after restart we're losing all the metrics history. Which means if the instantaneous observed concurrency is low we'll immediately scale down even though during the next few iterations we might want to scale up back, when we observe more of the traffic.

The simple solution Markus and I talked about is to launch all non-zero deployments in panic mode, so until we aggregate enough information we don't scale down.

@mattmoor
Copy link
Member

It sounds very reasonable to me to treat everything as panic mode on autoscaler boot-ups.

vagababov added a commit to vagababov/serving that referenced this issue Jul 18, 2019
When autoscaler restarts it has no memory of the previous stats and immediately scales the deployment down. Under my load test conditions,
Autoscaler goes from 7 to 1 then to 2, only then to 6 pods, before finally reaching 7 after a few cycles.
This is obviously not a desired property, so this change fixes that behavior, by
starting Autoscaler in panic mode with panic pod count equal to the current pod count
if we have more than 1 serving pods right now.

Obviously if the deployment is scaled to 0, there's no reason to change logic
and if it is 1, then we won't scale below 1 anyway for the next stable window, so no need to panic either.

/assign @mattmoor @markusthoemmes

This is the GA scope of knative#2930
knative-prow-robot pushed a commit that referenced this issue Jul 19, 2019
* Start autoscaler in panic mode.

When autoscaler restarts it has no memory of the previous stats and immediately scales the deployment down. Under my load test conditions,
Autoscaler goes from 7 to 1 then to 2, only then to 6 pods, before finally reaching 7 after a few cycles.
This is obviously not a desired property, so this change fixes that behavior, by
starting Autoscaler in panic mode with panic pod count equal to the current pod count
if we have more than 1 serving pods right now.

Obviously if the deployment is scaled to 0, there's no reason to change logic
and if it is 1, then we won't scale below 1 anyway for the next stable window, so no need to panic either.

/assign @mattmoor @markusthoemmes

This is the GA scope of #2930

* unit tests

* updates to the imports

* merge
@yanweiguo
Copy link
Contributor

@yanweiguo
Copy link
Contributor

The first iteration was done and released in 0.19.0.

/close

@knative-prow-robot
Copy link
Contributor

@yanweiguo: Closing this issue.

In response to this:

The first iteration was done and released in 0.19.0.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/autoscale kind/feature Well-understood/specified features, ready for coding. P1 P1
Development

No branches or pull requests