-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPA scheduling implementation doesn't scale well with custom/external metrics - all checks are sequential by one goroutine #96242
Comments
@scr-oath: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig autoscaling |
/assign |
In my opinion, different scenarios need different options. Can a configurable option be added to realize the timeout degradation of extended indicators? Prevent excessive drift of indicators. @deads2k @luxas @mtaufen |
If the calculation time difference between indicators is large, there will be frequent adjustment of pod quantity in the next copy index calculation. |
As I understand it, the current implementation just has a queue/chan, a timer of some sort, and a single goroutine to process the queue of work. There is a comment "start a single worker (we may wish to start more in the future)" I would imagine that many scheduling implementations solve for requirements such as
I don't think that "timeout" is really the issue - what if you had some external metric like - what's the temperature on the moon - and it took 8 minutes to get an update (just contrived example)… That should still be possible and not drag down all the other HPA checks in the system, right? |
@scr-oath As I see it there are 2 issues here:
With regards to 1 I suppose increasing the worker count should alleviate the problem. Maybe even making it configurable for larger clusters like yours. |
I don't recall describing 2 in quite the same terms as you - I was saying more that each HPA should make its call outs independent from each other - sure you could also parallelize all of the metrics of each HPA, but as you point out, would need to join/wait for them all to return; I'm more concerned with isolating/parallelizing the work of each HPA from each other, or at least as you've also suggested adding more workers to drain the queue faster rather than piling up. |
I guess you meant "Other HPAs not using custom/external metrics should not be impacted..." |
ohhhh… thanks clarifying for the semantic typo - you got it! other HPAs - updated description. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
Rotten issues close after 30d of inactivity. Send feedback to sig-contributor-experience at kubernetes/community. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@scr-oath: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/remove-lifecycle rotten |
Just faced the very same issue (and it took me a while to discover that the delay in scaling was caused by this - somehow I expected the HPA to work in parallel, and there are no debug logs to tell otherwise). In our case we're using Keda to setup HPAs and provide external metrics, which in turn come from Prometheus. Unfortunately Prometheus got overloaded (see prometheus/prometheus#8014) and started responding with >2s delay. @itonyli / @arjunrn Can you consider actually implementing the multiple workers approach? |
I tried a test of external metrics that only has a 100µs response time, added 1000 keda metrics, and saw only 25 fetched in an 8m period (I would have expected 4x8=32) so that's something like a 20% loss of metric gathering points over time. When using that external metrics directly, I see about 3% loss (one drop in 8m), so clearly Keda doesn't completely workaround the HPA limitations; it would be really great if HPA, itself would do the right thing and not drift, skip, or penalize other metrics gathering for one HPA that has a slow one. |
I've unfortunately independently discovered this very same problem with external metrics scaling very poorly, particularly under Keda querying prometheus. Is there any particular reason that the solution cannot be a naive increase to the number of goroutines spawned to process the HPA objects concurrently on every 15 second loop execution? There was an acknowledgement above that an HPA object may itself hold multiple metrics to be calculated: so at the very least we would need to ensure that each metric on a given hpa object remains to be processed serially, but each HPA object should be independent such that they can be concurrently processed. I'm shocked to see that the HPA loop is single threaded on an io bound operation, serially calculating each HPA in core Kubernetes. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
Does anyone know if this limitation affects the vanilla CPU/memory metrics? or this only has effect when checking custom metrics? |
@jjcaballero if you only have HPAs with CPU/memory metrics the effect should be less visible. It's still serial, blocking calls to fetch the metrics, but the response times on these calls should be lower – the metrics-server component responds immediately with what it has in its cache as opposed to making a synchronous call to the custom metrics source (e.g. Prometheus). |
That makes me think... it would be a really interesting feature if there could be a feature for async metrics. The idea being... What if things aware of metrics - whether scanners at some frequency or notified from events could update some status fields of the HPA (or otherwise update the HPA cache)? Then there would be zero latency added to the system (beyond having more metrics to consider - but at least no I/O scale latency) Would that be an easier pill to swallow than making the schedule run with more concurrency/parallelism? |
Do you mean update with metric values directly or update with a flag saying "new metrics are available to read from metrics-server"? I believe storing metric values would be prohibitively costly in large clusters, but theoretically we could store a map (config map?) of hpa name to new metric readiness. In this case this problem is IMO orthogonal to concurrent processing: async metrics would make the HPA run as soon as metrics are available to read (saving 7.5s on average, 15s worst case) while concurrent processing would help with scaling the calls to metrics-server. WDYT @scr-oath? Would you be willing to pursue this idea further, see what would need to happen for this to work? |
I meant altering the HPA with the info... HPA already stores it's external fetch result in the status field doesn't it? So just having access to store that or something like it asynchronously would seem fine. Essentially, I'm suggesting that external metrics have a way of communicating updates - maybe write to a kevent like endpoint - so that when the HPA considers that object next, it can look at a field rather than doing more I/O that may be slow. One contrived example is, say you have a sensor on the moon - query would take roughly 5s over and back… but if it could send updates, you wouldn't care how much the delivery overhead/latency was.
See above - isn't the data stored with the HPA object anyway in status field for each external metric? (I don't know the answer for sure but I suspect it does) |
It stores an aggregate: the sum of all metric values. If your external metric has a single series (e.g. queue size) then it's the same as what we store in status, but you can define metrics having multiple series, e.g. reported for each application instance. Also I thought you meant change the architecture to async for all metric types, not only external. This would allow us to recalculate HPA as soon as new metrics arrive, not having to wait 15s intervals. If we're speaking just external and not affecting reconciliation trigger then IMO parallel HPA is a better way to move forward: requires less change and helps with all metric types. |
Another idea from talking with @mwielgus: have custom metrics adapter observe what queries it receives, pre-fetch metrics for common queries just before next request (e.g. 14s after the last request) and serve from cache. A bit hacky, but the HPA architecture doesn't change. |
For you guys who are not aware, here is the new flag meant to solve this issue
|
What happened:
I created an external metric that took roughly 100ms; then to see what happened if there were many of them, I created 1,000 HPAs using it. Usually every HPA checks every 15s, but this resulted in each HPA check checking every 120 seconds - a drift/delay of over 100 seconds.
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
GetExternalMetric
method of the Testing ProviderGetExternalMetric
.Anything else we need to know?:
Environment:
kubectl version
):Docker CE for Mac
cat /etc/os-release
):Mac OS Catalina 10.15.7
uname -a
):Darwin 19.6.0 Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64 x86_64
The text was updated successfully, but these errors were encountered: