New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
code_verb:apiserver_request_total:increase30d
loads (too) many samples
#411
Comments
Brainstormed idea: - record: code_verb:apiserver_request_total:increase1h
expr: sum by(code, verb) (increase(apiserver_request_total{job="default/kubernetes"}[1h]))
- record: code_verb:apiserver_request_total:increase30d
expr: avg_over_time(code_verb:apiserver_request_total:increase1h[30d]) * 24 * 30 |
Did you try the fixes I added in #403? |
#403 was merged just a few seconds too late to make it into my production deploy. I'll try it out. My idea above will actually reduce the work done in total a lot, but it has the problem that you need to let it run for a month to get the result (or wait for rule backfilling…). So I guess, for now #403 is the preferred approach. I'll try it out. |
Not all of the queries in #403 work for me, specifically |
I guess that's the problem with the approach: The "sharding" is not uniform. With my over-time idea, you are cutting the 30d into 720 equal hourly pieces (at the price of requiring that rule being present for a month before you get proper results – or wait for retroactive rule evaluation). |
What are you suggesting? Not sure I can get an idea how to improve things from your comment. |
I had to go down to 3d to make our production servers not choke. (In general, I'd be happy to tweak my servers for heavy production use cases, but in this case, the heavy weight query is for a single dashboard and not even for the alerts. That's why I'm kind of keen to make it much more affordable.) |
My over-time idea was up there. Quote: - record: code_verb:apiserver_request_total:increase1h
expr: sum by(code, verb) (increase(apiserver_request_total{job="default/kubernetes"}[1h]))
- record: code_verb:apiserver_request_total:increase30d
expr: avg_over_time(code_verb:apiserver_request_total:increase1h[30d]) * 24 * 30 As soon as I find a few moments, I'll play with it. |
I have had success using the intermediary rules @beorn7 suggests in similar applications. Being forced to wait 30 days is a bit of a bummer, but hasn't been too big of a deal the other times I have done this. |
even applying @beorn7 suggestion didn't help much. started to use couple percents less. |
As an additional data point: Today, I deployed the current state of master as is (697afa2) to our beefiest production clusters, and it just worked. I couldn't even see any dramatic increase in CPU usage (but our Prometheis are fairly busy with many recording rules anyway, the dramatic increase seen by @den-is might be so visible because there seemed to be very little ongoing on the server before). |
Indeed, my JFYI. I was running Prometheus 2.17.x much before upgrading to this version of the "rules". |
And another data point: Apparently, we are very close to the default limit of 50M samples loaded because occasionally the rule evaluation fails. I'll bump up the limit for now, but the general concern stays: Should we have a fairly expensive query in the mixin by default that is only used for a few panels in one dashboard? |
Hey, I'm still thinking about this problem and sometimes don't know whether or not to simply remove the rules for now. |
Great. Let me know how it is going and where I can help. |
Me too... getting the same issue even with #403 |
@beorn7, correct me if I'm wrong but with your approach, we only ever get the average of requests made over 30d for 1h windows, no? Having an outage will show significantly different in terms of availability percentage because the average of 1h won't be the same of the count overall 30d. After all, my goal is to have a rather details and correct measurement of availability to be able to calculate the error budget appropriately, which is what a lot of discussions should be based on. |
My suggestion is to replace record: code_verb:apiserver_request_total:increase30d
expr: sum by(code, verb) (increase(apiserver_request_total{job="default/kubernetes"}[30d])) with - record: code_verb:apiserver_request_total:increase1h
expr: sum by(code, verb) (increase(apiserver_request_total{job="default/kubernetes"}[1h]))
- record: code_verb:apiserver_request_total:increase30d
expr: avg_over_time(code_verb:apiserver_request_total:increase1h[30d]) * 24 * 30 Once Have you seen differences in practice, or is your conclusion just based on theoretical considerations? If the latter, could you explain in more detail? The new approach is even more flexible because it's now probably cheap enough to calculate other time ranges than 30d ad hoc, e.g. you could do |
@metalmatze when you ran these tests, how many kube-apiservers where there? When going beyond 3 do we pass the threshold of too many samples? |
@beorn7 we ran a small experiment with @metalmatze, to see if your approach was precise enough to be able to replace the current recording rule. The goal of this experiment was to run both recording rules in a cluster where we could simulate apiserver outages, to check how the new recording rule was behaving compare to the one we currently have. So far, after 3 days, we've gather the following data:
Some key datapoint:
After ~8h and 1 outage: new rule reached the same availability with a precision of a hundredth. Based on these results, the new recording rule should be precise enough to replace the current one considering its use case. We also noticed some great improvements regarding the evaluation time of the recording rule. At the start of the experiment, the new rule was evaluated 30 times faster than the current one and now after 3 days and much more data, almost 60 times faster. We weren't too sure how to mesure the number of loaded samples but I think these numbers are already proving a significant improvement. |
Thank you a lot for helping with this experiment and driving it home! |
I'm glad the math worked out here. Thank you very much for the research. |
This is about the following rule:
apiserver_request_total has a surprisingly high cardinality (easily into the thousands for moderate cluster sizes). The rule above therefore loads 30d of samples of thousands of series into memory, which easily breaches the default 50M limit.
The rule is only used in the apiserver dashboard (to display the availability over the last 30d). It's not even used for the “important” part of alerting.
It would be good to find a way to work around this problem.
The text was updated successfully, but these errors were encountered: