-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autoscaler High Availability #2930
Comments
I think there are 3 major problems we need to consider when there are more than one instance of autoscaler:
Shall we consider to deploy the autoscaler as a statefulSet rather than a deployment? and using a consistent hash ring to decide which pod of autoscaler each revision will send stats to, then queue proxy could use the pod dns to set up the connection to one pod of autoscaler. |
@lvjing2 Note that @josephburnett mentioned that the Autoscaler will "start scraping" the deployment. I think the assumption of the proposal is based on the inversion of control the metrics flow is undergoing. Soonish, we're not sending metrics actively anymore but scrape them from the autoscaler's side. |
Oh, That's nice. If so, the high availability of autoscaler will be much easier than before. Then the problems may change as:
I think we could not introduce the leader elector to the duplicate autoscale for the same subset of deployment firstly.
|
@lvjing2 exactly.
My proposal above should satisfy these two requirements. What do you think? This is why I think it satisfies the requirements: only one Autoscaler will be handling each Revision (by winning the race to write its name on the PodAutoscaler). When new Autoscalers are added, this doesn't disrupt the set of Revisions the previous Autoscalers are responsible for. They will be less inclined to grab more because they will be closer to their threshold than the new Autoscaler. When an existing Autoscaler goes away, it will release all it's Revisions and they will be up for grabs. |
I think we want to do autoscaling for HPA controller as well after we support scaling to 0 for HPA(#3064). We can also use the idea of this proposal for HPA-class PodAutoscaler. |
How do we handle the case of scaling from 0 to 1? Currently Activator sends metrics to Autoscaler service through websocket. After adding more pods under Autoscaler service, metrics will be distributed into different pods. Will this affect scaling from 0? |
here is the proposal for scale from zero
#3064
…On Tue, Feb 19, 2019 at 11:26 AM Zhimin Xiang ***@***.***> wrote:
How do we handle the case of scaling from 0 to 1? Currently Activator
sends metrics to Autoscaler service through websocket. After adding more
pods under Autoscaler service, metrics will be distributed into different
pods. Will this affect scaling from 0?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2930 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/APcc4W4fMn8g-tSAMnCHwQ5gP-IO9Uutks5vPE_IgaJpZM4aG8FE>
.
|
If one autoscaler crash(such kill -9 or panic), the revision managed by this autoscaler can't scale in 60 minutes. I suggest each autoscaler watch all autoscaler, If one autoscaler exit, other autoscaler can filter the exited autoscaler pod name and change it. |
@fatkun Yes I think this was part of the discussion in the autoscaling group. We can watch the autoscaler deployment and reconcile if its state changes unexpectedly. |
What will happen if we have multiple auto-scalers and multiple activators and starting to scale from 0? But if only one auto-scaler has the responsibility to scrape the metrics from the user pod, will the chosen auto-scaler be blind to see the pending requests buffered in the other activators which paired through WebSocket to other auto-scalers? I think the activator acts as a mock deployment of user pod, so when there are multiple activators, all of the pending requests in activators should be added up to generate the summarised desire pod count. However, if the activator communicates to auto-scaler with WebSocket in pair style and only one auto-scaler in charge of a revision, will this influence the 1 to N stage at the very beginning(0->1 has already been taking place)? |
This seems out of scale for .8. |
Discussed this with @mattmoor a bit and the thought was to make sure the autoscaler behaves well when being restarted (see #2250) and then test and document different failure models of the autoscaler. Making it truly high available is quite a complex task and there might be quite a bit of low-hanging fruit on the way there. |
Yep, I'll take a look later this quarter for the better restart behavior and perhaps VPA parameters. But everything else is beyond GA. |
We already throw a few knobs like the annotations to avoid pre-emption Victor added previously. I still need to test that the autoscaler is sane across restarts when a If we make a bad decision as the autoscaler comes back up, then we need to fix that. Otherwise, I think that in the short term having the autoscaler be down for short periods isn't the end of the world. There should not be data loss because the activator will still hold the request, but it may impact latency to add capacity. I'd much rather have us focus on ELIMINATING the autoscaler from this path than hardening it just because it's there (for now). My motivation for that is that even K8s master's aren't as available as what I think we ultimately want (even when "HA"), so even with an "HA" controller we're not where we need to be. Note that cold-start latency is a nice byproduct of node-local scheduling, but it was never my primary motivation for it. |
The problem is that after restart we're losing all the metrics history. Which means if the instantaneous observed concurrency is low we'll immediately scale down even though during the next few iterations we might want to scale up back, when we observe more of the traffic. The simple solution Markus and I talked about is to launch all non-zero deployments in panic mode, so until we aggregate enough information we don't scale down. |
It sounds very reasonable to me to treat everything as panic mode on autoscaler boot-ups. |
When autoscaler restarts it has no memory of the previous stats and immediately scales the deployment down. Under my load test conditions, Autoscaler goes from 7 to 1 then to 2, only then to 6 pods, before finally reaching 7 after a few cycles. This is obviously not a desired property, so this change fixes that behavior, by starting Autoscaler in panic mode with panic pod count equal to the current pod count if we have more than 1 serving pods right now. Obviously if the deployment is scaled to 0, there's no reason to change logic and if it is 1, then we won't scale below 1 anyway for the next stable window, so no need to panic either. /assign @mattmoor @markusthoemmes This is the GA scope of knative#2930
* Start autoscaler in panic mode. When autoscaler restarts it has no memory of the previous stats and immediately scales the deployment down. Under my load test conditions, Autoscaler goes from 7 to 1 then to 2, only then to 6 pods, before finally reaching 7 after a few cycles. This is obviously not a desired property, so this change fixes that behavior, by starting Autoscaler in panic mode with panic pod count equal to the current pod count if we have more than 1 serving pods right now. Obviously if the deployment is scaled to 0, there's no reason to change logic and if it is 1, then we won't scale below 1 anyway for the next stable window, so no need to panic either. /assign @mattmoor @markusthoemmes This is the GA scope of #2930 * unit tests * updates to the imports * merge
The new feature track doc: https://docs.google.com/document/d/1FyjB1GIJW259nMaNOc7nlQVjvgAZHlQ-fD6tbfUsaz8/edit#heading=h.n8a530nnrb |
The first iteration was done and released in 0.19.0. /close |
@yanweiguo: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Design Proposal
The KPA-class Autoscaler implementation does a fair amount of data processing, in proportion to the number of Revisions and their relative scale. It processes summary statistics from each Pod every 1 second and computes average concurrency for every Revision every 2 seconds.
The Autoscaler also plays a critical part in the data plane (even though it's part of the control plane) because it ensures revisions are scaled up from 0. Maintaining high availability of the Autoscaler is important.
For these reason, the Autoscaler must be horizontally scalable. Unlike the Activator, the Autoscaler is stateful because it calculates a running average of concurrency per Revision.
Proposal
I propose that we separate the KPA-class PodAutoscaler reconciler into a separate binary from the HPA-class PodAutoscaler reconciler. Then we add a status field
controller
to the PodAutoscaler into which the KPA autoscaler will write its Pod name.When reconciling PodAutoscalers, the KPA will filter for only PA's with either an empty
controller
or those with its name on them. As a special case, KPAs will also reconcile PA's whoselastUpdated
field is more than 60 minutes in the past.This allows the Autoscaler job to be replicated. KPAs will fight over new PAs by trying to write their name on them. One will win as a result of Kubernetes optimistic concurrency control. The autoscaler that wins will start scraping that PodAutoscaler's Pods and start scaling the Deployment. At every resync (30 minutes) KPAs will write their name again on the PA (or update the
lastUpdated
field somehow).Scaling Up
When we have some idea about how many Revisions a given KPA can handle, we can surface a custom metric
revisionCount
. We can point a K8s HPA at the custom metric to scale up when that count gets too high on average. Each KPA will stop claiming PAs when itsrevisionCount
has reached the predetermined maximum. The K8s HPA will target a value lower than the maximum so new KPAs will be created and will claim any leftover, unclaimed PAs.Scaling Down
When the average
revisionCount
gets low enough, the K8s HPA will scale down the number of KPAs.When a KPA pod receives SIGTERM, it proactively removes its name from all the PodAutoscalers it claimed, putting them up for grabs again.
The text was updated successfully, but these errors were encountered: