New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distinguish zero concurrency from slow/failed scraping when bucketing #8610
Comments
cc @markusthoemmes @vagababov for thoughts |
Actually it's the same as the @duglin issue we revisited earlier :) |
Want to find it and dupe? |
Done. |
FWIW I think this isn't totally the same as #8390. In #8390 @duglin is scraping at the correct rate, but the pockets of zero concurrency lead us to end up with less replicas than we actually need for peak load (because the simulated workload is a GitHub trigger firing multiple parallel events every 10 seconds or so, and we average over the full window). That one can potentially be fixed by the max-vs-average flag we've been informally chatting about: I'll pull out a top level issue for that now. This one, I think, is slightly different. When the network is slow, or blips, we can get zero scaling data for a few seconds (or longer) and our current behaviour is to treat any gaps in data as if we'd actually seen concurrency zero. This means if the scraper loses connectivity to the pods, or the network is temporarily congested, we can start to rapidly scale down the workload as fast as max-scale-down-rate will let us. The 'max' switch described above that would potentially help bursty loads would cope with this slightly better, but I think it's a cross-cutting problem we should solve in both cases: for example by assuming the rolling average rather than 0 when we miss a scrape. Edit: Spun out #9092. |
Since #8390 was closed, I want to add my testcase to this one because I'm still seeing odd behavior even after #9092 is merged. Script:
And output I see today:
Notice how the # of pods isn't consistent and it going below 50 doesn't seem right. But the 2x latency at times is obviously the biggest concern. |
Using:
helped w.r.t. latency - it was around 10 seconds consistently. However, I had 72 pods the entire time, which just doesn't seem right when I only have 50 requests. Yes I know that TU (70%) is probably why I get an extra 12 pods, but from a user's POV it's hard to explain. I wonder if we need to make it more clear that this "utilization" isn't just per pod, but across all pods and really should be look at like some kind of "over provisioning" flag. Then it's clear that anything other than 100% means they're asking for "extra" unused space. And this space is calculated across all pods, not just within one. |
Meaning, (# of requests) * (CC/TU%) == # of pods they should see |
Run with TU=95%? :)
…On Sat, Oct 17, 2020 at 9:39 AM Doug Davis ***@***.***> wrote:
Meaning, (# of requests) * (CC/TU%) == # of pods they should see
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8610 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAF2WXZ7YOGNK6CFTB73KLLSLHCCZANCNFSM4OYJZFJQ>
.
|
Just FYI:
|
This issue is stale because it has been open for 90 days with no |
/reopen |
Is this still an issue? Would this be a "good first issue" in the autoscaling area? /triage needs-user-input (I'll also point out that this bug timed out, so if it's a major issue, we may need to reconsider our priorities. If it's not a major issue, we may want to consider allowing it to time out again.) |
Unfortunately not. What sounds easy is actually a bit tricky because of how we do metric aggregation. Having said that it's possible the new pluggable aggregation stuff @vagababov has added may make this more tractable 🤔 . |
/remove-triage needs-user-input I'm not sure that Victor is going to land anything here; is this still an issue, and what priority? |
I do think it's a legit issue ("we do not distinguish failed/slow scrapes from zero concurrency and we should or we'll potentially scale down due to network blips") that needs more work to progress than a "good first issue" should. If we had a 'this is something someone who wants something meaty could work on' tag, Id add that to this. |
/help |
@evankanderson: Please ensure the request meets the requirements listed here. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign |
Describe the feature
Currently we do not differentiate between a scrape that actually reports zero concurrency from a replica and just not having data for a particular bucket. This is fine if the network is fast and autoscaler is not overloaded because we will have data ~every second, but on a slow or overloaded network (or e.g. with a resource constrained host => slow QP response to scrapes) it could cause issues: when we average over the bucket we could think we have lower load than we do, and scale down (or fail to scale up) replicas incorrectly.
(This is somewhat related to #8377 in that if we introduce a work pool there's a greater danger of things backed up in the queue not getting stats every second).
The text was updated successfully, but these errors were encountered: