Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPA scheduling implementation doesn't scale well with custom/external metrics - all checks are sequential by one goroutine #96242

Closed
scr-oath opened this issue Nov 5, 2020 · 33 comments · Fixed by #108501
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling.

Comments

@scr-oath
Copy link

scr-oath commented Nov 5, 2020

What happened:
I created an external metric that took roughly 100ms; then to see what happened if there were many of them, I created 1,000 HPAs using it. Usually every HPA checks every 15s, but this resulted in each HPA check checking every 120 seconds - a drift/delay of over 100 seconds.

What you expected to happen:

  • Each HPA with a custom/external metric should only drift by maximum 100ms every 15s or delay a second every 3-5 minutes
  • Other HPAs not using custom/external metrics should not be impacted by duration of custom/external metrics
  • The quantity of deployments with HPAs should not degrade each HPAs ability to respond to signals requiring scaling up (spikes e.g.)

How to reproduce it (as minimally and precisely as possible):

  • Create and install the https://github.com/kubernetes-sigs/custom-metrics-apiserver and alter its code to add a delay in the GetExternalMetric method of the Testing Provider
  • Also add logging to the method with timestamps logrus supports timestamps, e.g.
  • Create 1,000 HPA objects referring to that test metric - possibly 1,000 deployments as well, but one will work fine too even if not a real use case.
  • Pick one of your HPA's and watch for the time between its call to GetExternalMetric.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", BuildDate:"2020-09-16T21:51:49Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:04:18Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
    Docker CE for Mac
  • OS (e.g: cat /etc/os-release):
    Mac OS Catalina 10.15.7
  • Kernel (e.g. uname -a):
    Darwin 19.6.0 Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64 x86_64
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:
@scr-oath scr-oath added the kind/bug Categorizes issue or PR as related to a bug. label Nov 5, 2020
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 5, 2020
@k8s-ci-robot
Copy link
Contributor

@scr-oath: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 5, 2020
@scr-oath
Copy link
Author

scr-oath commented Nov 5, 2020

/sig autoscaling

@k8s-ci-robot k8s-ci-robot added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 5, 2020
@itonyli
Copy link

itonyli commented Nov 6, 2020

/assign

@itonyli
Copy link

itonyli commented Nov 7, 2020

  • Other metrics not using custom/external metrics should not be impacted by duration of custom/external metrics

In my opinion, different scenarios need different options. Can a configurable option be added to realize the timeout degradation of extended indicators? Prevent excessive drift of indicators. @deads2k @luxas @mtaufen

@itonyli
Copy link

itonyli commented Nov 7, 2020

  • Other metrics not using custom/external metrics should not be impacted by duration of custom/external metrics

In my opinion, different scenarios need different options. Can a configurable option be added to realize the timeout degradation of extended indicators? Prevent excessive drift of indicators. @deads2k @luxas @mtaufen

If the calculation time difference between indicators is large, there will be frequent adjustment of pod quantity in the next copy index calculation.

@scr-oath
Copy link
Author

scr-oath commented Nov 7, 2020

As I understand it, the current implementation just has a queue/chan, a timer of some sort, and a single goroutine to process the queue of work. There is a comment "start a single worker (we may wish to start more in the future)"

I would imagine that many scheduling implementations solve for requirements such as

  • Scheduling is separate from running - this may be solved by current implementation
  • Run isolation (a run of one thing doesn't block/delay another from running)
  • Wrap-around protection - don't start another run of a recurring task if it's still running the last one

I don't think that "timeout" is really the issue - what if you had some external metric like - what's the temperature on the moon - and it took 8 minutes to get an update (just contrived example)… That should still be possible and not drag down all the other HPA checks in the system, right?

@arjunrn
Copy link
Contributor

arjunrn commented Nov 17, 2020

@scr-oath As I see it there are 2 issues here:

  1. Time drift caused by sequential execution.
  2. Independent evaluation of each metric source.

With regards to 1 I suppose increasing the worker count should alleviate the problem. Maybe even making it configurable for larger clusters like yours.
However 2 is not possible because the HPA always make a scaling proposal based on the highest value from each of the metric sources. Consider the situation where one of the metrics, which has currently the highest proposed value is down for a few seconds. The HPA would then ignore it and start scaling down the target. Subsequently when the metric source is available again the HPA controller would start scaling up the target. This could be too disruptive for many workloads.

@scr-oath
Copy link
Author

scr-oath commented Nov 17, 2020

I don't recall describing 2 in quite the same terms as you - I was saying more that each HPA should make its call outs independent from each other - sure you could also parallelize all of the metrics of each HPA, but as you point out, would need to join/wait for them all to return; I'm more concerned with isolating/parallelizing the work of each HPA from each other, or at least as you've also suggested adding more workers to drain the queue faster rather than piling up.

@arjunrn
Copy link
Contributor

arjunrn commented Nov 17, 2020

Other metrics not using custom/external metrics should not be impacted by duration of custom/external metrics

I guess you meant "Other HPAs not using custom/external metrics should not be impacted..."

@scr-oath
Copy link
Author

scr-oath commented Nov 17, 2020

ohhhh… thanks clarifying for the semantic typo - you got it! other HPAs - updated description.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 15, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 17, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@scr-oath
Copy link
Author

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Sep 21, 2021
@k8s-ci-robot
Copy link
Contributor

@scr-oath: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@scr-oath
Copy link
Author

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 21, 2021
@lorenzo-biava
Copy link

Just faced the very same issue (and it took me a while to discover that the delay in scaling was caused by this - somehow I expected the HPA to work in parallel, and there are no debug logs to tell otherwise).

In our case we're using Keda to setup HPAs and provide external metrics, which in turn come from Prometheus. Unfortunately Prometheus got overloaded (see prometheus/prometheus#8014) and started responding with >2s delay.
So with around 130 HPAs that creates a scale up delay of more than 4 minutes.
Of course we need to tune Prometheus on our side to get it back to reasonable performance, but it definitely should not have such a snowball effect on the scale up time of every HPA (15sec vs 260sec).

@itonyli / @arjunrn Can you consider actually implementing the multiple workers approach?

@scr-oath
Copy link
Author

I tried a test of external metrics that only has a 100µs response time, added 1000 keda metrics, and saw only 25 fetched in an 8m period (I would have expected 4x8=32) so that's something like a 20% loss of metric gathering points over time. When using that external metrics directly, I see about 3% loss (one drop in 8m), so clearly Keda doesn't completely workaround the HPA limitations; it would be really great if HPA, itself would do the right thing and not drift, skip, or penalize other metrics gathering for one HPA that has a slow one.

@bpinske
Copy link

bpinske commented Dec 3, 2021

I've unfortunately independently discovered this very same problem with external metrics scaling very poorly, particularly under Keda querying prometheus.

Is there any particular reason that the solution cannot be a naive increase to the number of goroutines spawned to process the HPA objects concurrently on every 15 second loop execution? There was an acknowledgement above that an HPA object may itself hold multiple metrics to be calculated: so at the very least we would need to ensure that each metric on a given hpa object remains to be processed serially, but each HPA object should be independent such that they can be concurrently processed.

I'm shocked to see that the HPA loop is single threaded on an io bound operation, serially calculating each HPA in core Kubernetes.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 3, 2022
@zroubalik
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 4, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 2, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 2, 2022
@everpeace
Copy link
Contributor

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 3, 2022
@jjcaballero
Copy link

Does anyone know if this limitation affects the vanilla CPU/memory metrics? or this only has effect when checking custom metrics?

@pbetkier
Copy link
Contributor

@jjcaballero if you only have HPAs with CPU/memory metrics the effect should be less visible. It's still serial, blocking calls to fetch the metrics, but the response times on these calls should be lower – the metrics-server component responds immediately with what it has in its cache as opposed to making a synchronous call to the custom metrics source (e.g. Prometheus).

@scr-oath
Copy link
Author

That makes me think... it would be a really interesting feature if there could be a feature for async metrics. The idea being...

What if things aware of metrics - whether scanners at some frequency or notified from events could update some status fields of the HPA (or otherwise update the HPA cache)?

Then there would be zero latency added to the system (beyond having more metrics to consider - but at least no I/O scale latency)

Would that be an easier pill to swallow than making the schedule run with more concurrency/parallelism?

@pbetkier
Copy link
Contributor

pbetkier commented Sep 2, 2022

(...) it would be a really interesting feature if there could be a feature for async metrics (...) could update some status fields of the HPA (...)

Do you mean update with metric values directly or update with a flag saying "new metrics are available to read from metrics-server"?

I believe storing metric values would be prohibitively costly in large clusters, but theoretically we could store a map (config map?) of hpa name to new metric readiness. In this case this problem is IMO orthogonal to concurrent processing: async metrics would make the HPA run as soon as metrics are available to read (saving 7.5s on average, 15s worst case) while concurrent processing would help with scaling the calls to metrics-server.

WDYT @scr-oath? Would you be willing to pursue this idea further, see what would need to happen for this to work?

@scr-oath
Copy link
Author

scr-oath commented Sep 2, 2022

Do you mean update with metric values directly or update with a flag saying "new metrics are available to read from metrics-server"?

I meant altering the HPA with the info... HPA already stores it's external fetch result in the status field doesn't it? So just having access to store that or something like it asynchronously would seem fine. Essentially, I'm suggesting that external metrics have a way of communicating updates - maybe write to a kevent like endpoint - so that when the HPA considers that object next, it can look at a field rather than doing more I/O that may be slow.

One contrived example is, say you have a sensor on the moon - query would take roughly 5s over and back… but if it could send updates, you wouldn't care how much the delivery overhead/latency was.

I believe storing metric values would be prohibitively costly in large clusters

See above - isn't the data stored with the HPA object anyway in status field for each external metric? (I don't know the answer for sure but I suspect it does)

@pbetkier
Copy link
Contributor

pbetkier commented Sep 9, 2022

HPA already stores it's external fetch result in the status field doesn't it?

It stores an aggregate: the sum of all metric values. If your external metric has a single series (e.g. queue size) then it's the same as what we store in status, but you can define metrics having multiple series, e.g. reported for each application instance.

Also I thought you meant change the architecture to async for all metric types, not only external. This would allow us to recalculate HPA as soon as new metrics arrive, not having to wait 15s intervals. If we're speaking just external and not affecting reconciliation trigger then IMO parallel HPA is a better way to move forward: requires less change and helps with all metric types.

@itonyli itonyli removed their assignment Sep 29, 2022
@pbetkier
Copy link
Contributor

Another idea from talking with @mwielgus: have custom metrics adapter observe what queries it receives, pre-fetch metrics for common queries just before next request (e.g. 14s after the last request) and serve from cache. A bit hacky, but the HPA architecture doesn't change.

@lanshunfang
Copy link

lanshunfang commented Jun 28, 2023

For you guys who are not aware, here is the new flag meant to solve this issue

--concurrent-horizontal-pod-autoscaler-syncs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling.
Projects
None yet