Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

code_verb:apiserver_request_total:increase30d loads (too) many samples #411

Closed
beorn7 opened this issue Apr 29, 2020 · 23 comments · Fixed by #537
Closed

code_verb:apiserver_request_total:increase30d loads (too) many samples #411

beorn7 opened this issue Apr 29, 2020 · 23 comments · Fixed by #537

Comments

@beorn7
Copy link
Contributor

beorn7 commented Apr 29, 2020

This is about the following rule:

record: code_verb:apiserver_request_total:increase30d
expr: sum by(code, verb) (increase(apiserver_request_total{job="default/kubernetes"}[30d]))

apiserver_request_total has a surprisingly high cardinality (easily into the thousands for moderate cluster sizes). The rule above therefore loads 30d of samples of thousands of series into memory, which easily breaches the default 50M limit.

The rule is only used in the apiserver dashboard (to display the availability over the last 30d). It's not even used for the “important” part of alerting.

It would be good to find a way to work around this problem.

@beorn7
Copy link
Contributor Author

beorn7 commented Apr 29, 2020

Brainstormed idea:

- record: code_verb:apiserver_request_total:increase1h
  expr: sum by(code, verb) (increase(apiserver_request_total{job="default/kubernetes"}[1h]))
- record: code_verb:apiserver_request_total:increase30d
  expr: avg_over_time(code_verb:apiserver_request_total:increase1h[30d]) * 24 * 30

@metalmatze
Copy link
Member

Did you try the fixes I added in #403?
Even with those improvements the problems still persists?

@beorn7
Copy link
Contributor Author

beorn7 commented Apr 30, 2020

#403 was merged just a few seconds too late to make it into my production deploy. I'll try it out.

My idea above will actually reduce the work done in total a lot, but it has the problem that you need to let it run for a month to get the result (or wait for rule backfilling…).

So I guess, for now #403 is the preferred approach. I'll try it out.

@csmarchbanks
Copy link
Member

csmarchbanks commented Apr 30, 2020

Not all of the queries in #403 work for me, specifically sum by (code, verb) (increase(apiserver_request_total{code=~"2..", job="apiserver", verb="LIST"}[30d])) loads too many samples. That query with more than [8d] is failing on the server I tried.

@beorn7
Copy link
Contributor Author

beorn7 commented May 4, 2020

I guess that's the problem with the approach: The "sharding" is not uniform. With my over-time idea, you are cutting the 30d into 720 equal hourly pieces (at the price of requiring that rule being present for a month before you get proper results – or wait for retroactive rule evaluation).

@metalmatze
Copy link
Member

What are you suggesting? Not sure I can get an idea how to improve things from your comment.
I guess for a lot of people 7d availability would be good enough too?
I could dedicate some time again so that we can compute the windows according to days in the availability. Then people could simply run with 7d for example.

@beorn7
Copy link
Contributor Author

beorn7 commented May 5, 2020

I had to go down to 3d to make our production servers not choke.

(In general, I'd be happy to tweak my servers for heavy production use cases, but in this case, the heavy weight query is for a single dashboard and not even for the alerts. That's why I'm kind of keen to make it much more affordable.)

@beorn7
Copy link
Contributor Author

beorn7 commented May 5, 2020

My over-time idea was up there. Quote:

- record: code_verb:apiserver_request_total:increase1h
  expr: sum by(code, verb) (increase(apiserver_request_total{job="default/kubernetes"}[1h]))
- record: code_verb:apiserver_request_total:increase30d
  expr: avg_over_time(code_verb:apiserver_request_total:increase1h[30d]) * 24 * 30

As soon as I find a few moments, I'll play with it.

@csmarchbanks
Copy link
Member

I have had success using the intermediary rules @beorn7 suggests in similar applications. Being forced to wait 30 days is a bit of a bummer, but hasn't been too big of a deal the other times I have done this.

@den-is
Copy link

den-is commented May 6, 2020

even applying @beorn7 suggestion didn't help much. started to use couple percents less.
Screen Shot 2020-05-06 at 20 51 37
probably will have to remove that rules

@beorn7
Copy link
Contributor Author

beorn7 commented May 6, 2020

As an additional data point:

Today, I deployed the current state of master as is (697afa2) to our beefiest production clusters, and it just worked. I couldn't even see any dramatic increase in CPU usage (but our Prometheis are fairly busy with many recording rules anyway, the dramatic increase seen by @den-is might be so visible because there seemed to be very little ongoing on the server before).

@den-is
Copy link

den-is commented May 7, 2020

Indeed, my stable/prometheus-operator deploy is 99% pure your work, guys. Almost not adding anything. And that deploy/cluster is not doing anything epic.
Not only I see such a drastic change in CPU usage, but many others, in 1001 similar issues opened helm/charts#22003 (comment)

JFYI. I was running Prometheus 2.17.x much before upgrading to this version of the "rules".
Yesterday I was "forced" to upgrade to the latest version + Prometheus 2.18.0

@beorn7
Copy link
Contributor Author

beorn7 commented May 7, 2020

And another data point: Apparently, we are very close to the default limit of 50M samples loaded because occasionally the rule evaluation fails. I'll bump up the limit for now, but the general concern stays: Should we have a fairly expensive query in the mixin by default that is only used for a few panels in one dashboard?

@metalmatze
Copy link
Member

Hey, I'm still thinking about this problem and sometimes don't know whether or not to simply remove the rules for now.
On the other hand, I think having the availability calculated is very helpful, and only with the availability are we able to talk about our remaining error budget.
I'll try to update the rules and check if I can add @beorn7 original idea about splitting the query into two with 1h and 30d :)

@beorn7
Copy link
Contributor Author

beorn7 commented May 29, 2020

Great. Let me know how it is going and where I can help.

@songrijie
Copy link

Me too... getting the same issue even with #403

@metalmatze
Copy link
Member

@beorn7, correct me if I'm wrong but with your approach, we only ever get the average of requests made over 30d for 1h windows, no? Having an outage will show significantly different in terms of availability percentage because the average of 1h won't be the same of the count overall 30d.

After all, my goal is to have a rather details and correct measurement of availability to be able to calculate the error budget appropriately, which is what a lot of discussions should be based on.

@beorn7
Copy link
Contributor Author

beorn7 commented Jun 9, 2020

My suggestion is to replace

record: code_verb:apiserver_request_total:increase30d
expr: sum by(code, verb) (increase(apiserver_request_total{job="default/kubernetes"}[30d]))

with

- record: code_verb:apiserver_request_total:increase1h
  expr: sum by(code, verb) (increase(apiserver_request_total{job="default/kubernetes"}[1h]))
- record: code_verb:apiserver_request_total:increase30d
  expr: avg_over_time(code_verb:apiserver_request_total:increase1h[30d]) * 24 * 30

Once code_verb:apiserver_request_total:increase1h has been populated for 30d, the old and the new code_verb:apiserver_request_total:increase30d should yield very similar results. The only difference is a 1h-long "fade out" at the beginning and end of the 30d range, which shouldn't really matter in practice.

Have you seen differences in practice, or is your conclusion just based on theoretical considerations? If the latter, could you explain in more detail?

The new approach is even more flexible because it's now probably cheap enough to calculate other time ranges than 30d ad hoc, e.g. you could do avg_over_time(code_verb:apiserver_request_total:increase1h[7d]) * 24 * 7 without a recording rule if you wanted to inspect the last week.

@metalmatze
Copy link
Member

Right, I think I'm just trying to understand the different approached in-depth.
So far I have almost 2 weeks worth of data for the new recording rule and indeed it looks very similar 👍 The numbers are quite different still and we probably need to adjust a bunch of the calculations for availability, including the duration / latency parts. But that's a problem that can be solved.

Current recording rules:
increase30d
Updated recording rules (with 2 weeks for data so far)
increase30d_test

@dsexton
Copy link

dsexton commented Jul 9, 2020

@metalmatze when you ran these tests, how many kube-apiservers where there? When going beyond 3 do we pass the threshold of too many samples?

@dgrisonnet
Copy link
Contributor

@beorn7 we ran a small experiment with @metalmatze, to see if your approach was precise enough to be able to replace the current recording rule. The goal of this experiment was to run both recording rules in a cluster where we could simulate apiserver outages, to check how the new recording rule was behaving compare to the one we currently have.

So far, after 3 days, we've gather the following data:

graph
The first graph being the apiserver availability calculated with the new recording and the second one with the current approach.

Some key datapoint:

Timestamp Event New rule Current rule Distance
2020-12-15 13:27:00 start of the experiment 1 1 0
2020-12-15 14:46:00 apiserver outage 0.99999 0.99988 0.00011
2020-12-15 15:10:40 apiserver available 0.83569 0.84935 0.01366
2020-12-15 15:49:30 min(NewRule) 0.76717 0.88793 0.12076
2020-12-16 09:23:30 apiserver outage 0.98388 0.98532 0.00144
2020-12-16 10:09:30 apiserver available 0.97111 0.95558 0.01553
2020-12-16 11:08:00 min(NewRule) 0.95356 0.95719 0.00363
2020-12-17 14:29:30 apiserver outage 0.98020 0.98082 0.00062
2020-12-17 16:36:00 apiserver available 0.95340 0.94771 0.00569
2020-12-17 17:27:15 min(NewRule) 0.94698 0.94856 0.00157
2020-12-18 15:19:00 end of the experiment 0.96289 0.96368 0.00078

After ~8h and 1 outage: new rule reached the same availability with a precision of a hundredth.
After ~42h and 2 outages: new rule reached the same availability with a precision of a thousandth.
At the end of the experiment, after 3 days and 3 outages, the new rules reached a precision of a thousandth compared to the existing rule. We can expect both values to converge even more over time considering how the new rule is built.

Based on these results, the new recording rule should be precise enough to replace the current one considering its use case.

We also noticed some great improvements regarding the evaluation time of the recording rule. At the start of the experiment, the new rule was evaluated 30 times faster than the current one and now after 3 days and much more data, almost 60 times faster. We weren't too sure how to mesure the number of loaded samples but I think these numbers are already proving a significant improvement.

evaluation

@metalmatze
Copy link
Member

Thank you a lot for helping with this experiment and driving it home!
Looking forward to your upcoming PR. 👍

@beorn7
Copy link
Contributor Author

beorn7 commented Dec 18, 2020

I'm glad the math worked out here. Thank you very much for the research.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants