HPA scheduling implementation doesn't scale well with custom/external metrics - all checks are sequential by one goroutine #96242

scr-oath · 2020-11-05T01:20:03Z

What happened:
I created an external metric that took roughly 100ms; then to see what happened if there were many of them, I created 1,000 HPAs using it. Usually every HPA checks every 15s, but this resulted in each HPA check checking every 120 seconds - a drift/delay of over 100 seconds.

What you expected to happen:

Each HPA with a custom/external metric should only drift by maximum 100ms every 15s or delay a second every 3-5 minutes
Other HPAs not using custom/external metrics should not be impacted by duration of custom/external metrics
The quantity of deployments with HPAs should not degrade each HPAs ability to respond to signals requiring scaling up (spikes e.g.)

How to reproduce it (as minimally and precisely as possible):

Create and install the https://github.com/kubernetes-sigs/custom-metrics-apiserver and alter its code to add a delay in the GetExternalMetric method of the Testing Provider
Also add logging to the method with timestamps logrus supports timestamps, e.g.
Create 1,000 HPA objects referring to that test metric - possibly 1,000 deployments as well, but one will work fine too even if not a real use case.
Pick one of your HPA's and watch for the time between its call to GetExternalMetric.

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", BuildDate:"2020-09-16T21:51:49Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:04:18Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration:
Docker CE for Mac
OS (e.g: cat /etc/os-release):
Mac OS Catalina 10.15.7
Kernel (e.g. uname -a):
Darwin 19.6.0 Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64 x86_64
Install tools:
Network plugin and version (if this is a network-related bug):
Others:

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2020-11-05T01:20:11Z

@scr-oath: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

scr-oath · 2020-11-05T01:24:03Z

/sig autoscaling

itonyli · 2020-11-06T14:57:57Z

/assign

itonyli · 2020-11-07T14:05:56Z

Other metrics not using custom/external metrics should not be impacted by duration of custom/external metrics

In my opinion, different scenarios need different options. Can a configurable option be added to realize the timeout degradation of extended indicators? Prevent excessive drift of indicators. @deads2k @luxas @mtaufen

itonyli · 2020-11-07T14:10:26Z

Other metrics not using custom/external metrics should not be impacted by duration of custom/external metrics

In my opinion, different scenarios need different options. Can a configurable option be added to realize the timeout degradation of extended indicators? Prevent excessive drift of indicators. @deads2k @luxas @mtaufen

If the calculation time difference between indicators is large, there will be frequent adjustment of pod quantity in the next copy index calculation.

scr-oath · 2020-11-07T18:37:29Z

As I understand it, the current implementation just has a queue/chan, a timer of some sort, and a single goroutine to process the queue of work. There is a comment "start a single worker (we may wish to start more in the future)"

I would imagine that many scheduling implementations solve for requirements such as

Scheduling is separate from running - this may be solved by current implementation
Run isolation (a run of one thing doesn't block/delay another from running)
Wrap-around protection - don't start another run of a recurring task if it's still running the last one

I don't think that "timeout" is really the issue - what if you had some external metric like - what's the temperature on the moon - and it took 8 minutes to get an update (just contrived example)… That should still be possible and not drag down all the other HPA checks in the system, right?

arjunrn · 2020-11-17T17:40:28Z

@scr-oath As I see it there are 2 issues here:

Time drift caused by sequential execution.
Independent evaluation of each metric source.

With regards to 1 I suppose increasing the worker count should alleviate the problem. Maybe even making it configurable for larger clusters like yours.
However 2 is not possible because the HPA always make a scaling proposal based on the highest value from each of the metric sources. Consider the situation where one of the metrics, which has currently the highest proposed value is down for a few seconds. The HPA would then ignore it and start scaling down the target. Subsequently when the metric source is available again the HPA controller would start scaling up the target. This could be too disruptive for many workloads.

scr-oath · 2020-11-17T17:49:09Z

I don't recall describing 2 in quite the same terms as you - I was saying more that each HPA should make its call outs independent from each other - sure you could also parallelize all of the metrics of each HPA, but as you point out, would need to join/wait for them all to return; I'm more concerned with isolating/parallelizing the work of each HPA from each other, or at least as you've also suggested adding more workers to drain the queue faster rather than piling up.

arjunrn · 2020-11-17T17:58:51Z

Other metrics not using custom/external metrics should not be impacted by duration of custom/external metrics

I guess you meant "Other HPAs not using custom/external metrics should not be impacted..."

scr-oath · 2020-11-17T18:19:33Z

ohhhh… thanks clarifying for the semantic typo - you got it! other HPAs - updated description.

fejta-bot · 2021-02-15T20:00:05Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-03-17T20:46:38Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

fejta-bot · 2021-04-16T21:36:00Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-04-16T21:36:11Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

scr-oath · 2021-09-21T22:13:19Z

/reopen

k8s-ci-robot · 2021-09-21T22:13:34Z

@scr-oath: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

scr-oath · 2021-09-21T22:14:18Z

/remove-lifecycle rotten

lorenzo-biava · 2021-09-29T12:24:15Z

Just faced the very same issue (and it took me a while to discover that the delay in scaling was caused by this - somehow I expected the HPA to work in parallel, and there are no debug logs to tell otherwise).

In our case we're using Keda to setup HPAs and provide external metrics, which in turn come from Prometheus. Unfortunately Prometheus got overloaded (see prometheus/prometheus#8014) and started responding with >2s delay.
So with around 130 HPAs that creates a scale up delay of more than 4 minutes.
Of course we need to tune Prometheus on our side to get it back to reasonable performance, but it definitely should not have such a snowball effect on the scale up time of every HPA (15sec vs 260sec).

@itonyli / @arjunrn Can you consider actually implementing the multiple workers approach?

scr-oath · 2021-10-12T00:42:36Z

I tried a test of external metrics that only has a 100µs response time, added 1000 keda metrics, and saw only 25 fetched in an 8m period (I would have expected 4x8=32) so that's something like a 20% loss of metric gathering points over time. When using that external metrics directly, I see about 3% loss (one drop in 8m), so clearly Keda doesn't completely workaround the HPA limitations; it would be really great if HPA, itself would do the right thing and not drift, skip, or penalize other metrics gathering for one HPA that has a slow one.

bpinske · 2021-12-03T22:31:43Z

I've unfortunately independently discovered this very same problem with external metrics scaling very poorly, particularly under Keda querying prometheus.

Is there any particular reason that the solution cannot be a naive increase to the number of goroutines spawned to process the HPA objects concurrently on every 15 second loop execution? There was an acknowledgement above that an HPA object may itself hold multiple metrics to be calculated: so at the very least we would need to ensure that each metric on a given hpa object remains to be processed serially, but each HPA object should be independent such that they can be concurrently processed.

I'm shocked to see that the HPA loop is single threaded on an io bound operation, serially calculating each HPA in core Kubernetes.

k8s-triage-robot · 2022-03-03T23:12:23Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

zroubalik · 2022-03-04T06:52:23Z

/remove-lifecycle stale

k8s-triage-robot · 2022-06-02T07:33:13Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-07-02T07:45:05Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

everpeace · 2022-07-03T13:20:53Z

/remove-lifecycle rotten

jjcaballero · 2022-07-11T12:28:08Z

Does anyone know if this limitation affects the vanilla CPU/memory metrics? or this only has effect when checking custom metrics?

pbetkier · 2022-08-26T07:49:32Z

@jjcaballero if you only have HPAs with CPU/memory metrics the effect should be less visible. It's still serial, blocking calls to fetch the metrics, but the response times on these calls should be lower – the metrics-server component responds immediately with what it has in its cache as opposed to making a synchronous call to the custom metrics source (e.g. Prometheus).

scr-oath · 2022-08-26T15:19:44Z

That makes me think... it would be a really interesting feature if there could be a feature for async metrics. The idea being...

What if things aware of metrics - whether scanners at some frequency or notified from events could update some status fields of the HPA (or otherwise update the HPA cache)?

Then there would be zero latency added to the system (beyond having more metrics to consider - but at least no I/O scale latency)

Would that be an easier pill to swallow than making the schedule run with more concurrency/parallelism?

pbetkier · 2022-09-02T10:59:17Z

(...) it would be a really interesting feature if there could be a feature for async metrics (...) could update some status fields of the HPA (...)

Do you mean update with metric values directly or update with a flag saying "new metrics are available to read from metrics-server"?

I believe storing metric values would be prohibitively costly in large clusters, but theoretically we could store a map (config map?) of hpa name to new metric readiness. In this case this problem is IMO orthogonal to concurrent processing: async metrics would make the HPA run as soon as metrics are available to read (saving 7.5s on average, 15s worst case) while concurrent processing would help with scaling the calls to metrics-server.

WDYT @scr-oath? Would you be willing to pursue this idea further, see what would need to happen for this to work?

scr-oath · 2022-09-02T15:19:06Z

Do you mean update with metric values directly or update with a flag saying "new metrics are available to read from metrics-server"?

I meant altering the HPA with the info... HPA already stores it's external fetch result in the status field doesn't it? So just having access to store that or something like it asynchronously would seem fine. Essentially, I'm suggesting that external metrics have a way of communicating updates - maybe write to a kevent like endpoint - so that when the HPA considers that object next, it can look at a field rather than doing more I/O that may be slow.

One contrived example is, say you have a sensor on the moon - query would take roughly 5s over and back… but if it could send updates, you wouldn't care how much the delivery overhead/latency was.

I believe storing metric values would be prohibitively costly in large clusters

See above - isn't the data stored with the HPA object anyway in status field for each external metric? (I don't know the answer for sure but I suspect it does)

pbetkier · 2022-09-09T08:18:03Z

HPA already stores it's external fetch result in the status field doesn't it?

It stores an aggregate: the sum of all metric values. If your external metric has a single series (e.g. queue size) then it's the same as what we store in status, but you can define metrics having multiple series, e.g. reported for each application instance.

Also I thought you meant change the architecture to async for all metric types, not only external. This would allow us to recalculate HPA as soon as new metrics arrive, not having to wait 15s intervals. If we're speaking just external and not affecting reconciliation trigger then IMO parallel HPA is a better way to move forward: requires less change and helps with all metric types.

pbetkier · 2022-10-14T16:35:18Z

Another idea from talking with @mwielgus: have custom metrics adapter observe what queries it receives, pre-fetch metrics for common queries just before next request (e.g. 14s after the last request) and serve from cache. A bit hacky, but the HPA architecture doesn't change.

lanshunfang · 2023-06-28T18:27:59Z

For you guys who are not aware, here is the new flag meant to solve this issue

--concurrent-horizontal-pod-autoscaler-syncs

scr-oath added the kind/bug Categorizes issue or PR as related to a bug. label Nov 5, 2020

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 5, 2020

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 5, 2020

k8s-ci-robot added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 5, 2020

k8s-ci-robot assigned itonyli Nov 6, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 15, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 17, 2021

k8s-ci-robot closed this as completed Apr 16, 2021

k8s-ci-robot reopened this Sep 21, 2021

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 21, 2021

bpinske mentioned this issue Dec 4, 2021

Upstream core Kubernetes HPA loop is single-threaded kedacore/keda#2382

Closed

zroubalik mentioned this issue Mar 3, 2022

add --concurrent-horizontal-pod-autoscaler-syncs flag to kube-controller-manager #108501

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 3, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 4, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 2, 2022

eero-t mentioned this issue Jun 2, 2022

HPA: scaling replica count directly based on a metric #109920

Closed

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 2, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 3, 2022

itonyli removed their assignment Sep 29, 2022

k8s-ci-robot closed this as completed in #108501 Oct 17, 2022

pbetkier mentioned this issue Oct 28, 2022

can we concurrently get metrics in hpa-controller? #109496

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPA scheduling implementation doesn't scale well with custom/external metrics - all checks are sequential by one goroutine #96242

HPA scheduling implementation doesn't scale well with custom/external metrics - all checks are sequential by one goroutine #96242

scr-oath commented Nov 5, 2020 •

edited

k8s-ci-robot commented Nov 5, 2020

scr-oath commented Nov 5, 2020

itonyli commented Nov 6, 2020

itonyli commented Nov 7, 2020

itonyli commented Nov 7, 2020

scr-oath commented Nov 7, 2020

arjunrn commented Nov 17, 2020

scr-oath commented Nov 17, 2020 •

edited

arjunrn commented Nov 17, 2020

scr-oath commented Nov 17, 2020 •

edited

fejta-bot commented Feb 15, 2021

fejta-bot commented Mar 17, 2021

fejta-bot commented Apr 16, 2021

k8s-ci-robot commented Apr 16, 2021

scr-oath commented Sep 21, 2021

k8s-ci-robot commented Sep 21, 2021

scr-oath commented Sep 21, 2021

lorenzo-biava commented Sep 29, 2021

scr-oath commented Oct 12, 2021

bpinske commented Dec 3, 2021 •

edited

k8s-triage-robot commented Mar 3, 2022

zroubalik commented Mar 4, 2022

k8s-triage-robot commented Jun 2, 2022

k8s-triage-robot commented Jul 2, 2022

everpeace commented Jul 3, 2022

jjcaballero commented Jul 11, 2022

pbetkier commented Aug 26, 2022

scr-oath commented Aug 26, 2022

pbetkier commented Sep 2, 2022

scr-oath commented Sep 2, 2022

pbetkier commented Sep 9, 2022

pbetkier commented Oct 14, 2022

lanshunfang commented Jun 28, 2023 •

edited

HPA scheduling implementation doesn't scale well with custom/external metrics - all checks are sequential by one goroutine #96242

HPA scheduling implementation doesn't scale well with custom/external metrics - all checks are sequential by one goroutine #96242

Comments

scr-oath commented Nov 5, 2020 • edited

k8s-ci-robot commented Nov 5, 2020

scr-oath commented Nov 5, 2020

itonyli commented Nov 6, 2020

itonyli commented Nov 7, 2020

itonyli commented Nov 7, 2020

scr-oath commented Nov 7, 2020

arjunrn commented Nov 17, 2020

scr-oath commented Nov 17, 2020 • edited

arjunrn commented Nov 17, 2020

scr-oath commented Nov 17, 2020 • edited

fejta-bot commented Feb 15, 2021

fejta-bot commented Mar 17, 2021

fejta-bot commented Apr 16, 2021

k8s-ci-robot commented Apr 16, 2021

scr-oath commented Sep 21, 2021

k8s-ci-robot commented Sep 21, 2021

scr-oath commented Sep 21, 2021

lorenzo-biava commented Sep 29, 2021

scr-oath commented Oct 12, 2021

bpinske commented Dec 3, 2021 • edited

k8s-triage-robot commented Mar 3, 2022

zroubalik commented Mar 4, 2022

k8s-triage-robot commented Jun 2, 2022

k8s-triage-robot commented Jul 2, 2022

everpeace commented Jul 3, 2022

jjcaballero commented Jul 11, 2022

pbetkier commented Aug 26, 2022

scr-oath commented Aug 26, 2022

pbetkier commented Sep 2, 2022

scr-oath commented Sep 2, 2022

pbetkier commented Sep 9, 2022

pbetkier commented Oct 14, 2022

lanshunfang commented Jun 28, 2023 • edited

scr-oath commented Nov 5, 2020 •

edited

scr-oath commented Nov 17, 2020 •

edited

scr-oath commented Nov 17, 2020 •

edited

bpinske commented Dec 3, 2021 •

edited

lanshunfang commented Jun 28, 2023 •

edited