`code_verb:apiserver_request_total:increase30d` loads (too) many samples #411

beorn7 · 2020-04-29T18:19:11Z

This is about the following rule:

record: code_verb:apiserver_request_total:increase30d
expr: sum by(code, verb) (increase(apiserver_request_total{job="default/kubernetes"}[30d]))

apiserver_request_total has a surprisingly high cardinality (easily into the thousands for moderate cluster sizes). The rule above therefore loads 30d of samples of thousands of series into memory, which easily breaches the default 50M limit.

The rule is only used in the apiserver dashboard (to display the availability over the last 30d). It's not even used for the “important” part of alerting.

It would be good to find a way to work around this problem.

beorn7 · 2020-04-29T18:23:14Z

Brainstormed idea:

- record: code_verb:apiserver_request_total:increase1h
  expr: sum by(code, verb) (increase(apiserver_request_total{job="default/kubernetes"}[1h]))
- record: code_verb:apiserver_request_total:increase30d
  expr: avg_over_time(code_verb:apiserver_request_total:increase1h[30d]) * 24 * 30

metalmatze · 2020-04-30T12:08:12Z

Did you try the fixes I added in #403?
Even with those improvements the problems still persists?

beorn7 · 2020-04-30T12:19:36Z

#403 was merged just a few seconds too late to make it into my production deploy. I'll try it out.

My idea above will actually reduce the work done in total a lot, but it has the problem that you need to let it run for a month to get the result (or wait for rule backfilling…).

So I guess, for now #403 is the preferred approach. I'll try it out.

csmarchbanks · 2020-04-30T23:55:11Z

Not all of the queries in #403 work for me, specifically sum by (code, verb) (increase(apiserver_request_total{code=~"2..", job="apiserver", verb="LIST"}[30d])) loads too many samples. That query with more than [8d] is failing on the server I tried.

beorn7 · 2020-05-04T13:54:34Z

I guess that's the problem with the approach: The "sharding" is not uniform. With my over-time idea, you are cutting the 30d into 720 equal hourly pieces (at the price of requiring that rule being present for a month before you get proper results – or wait for retroactive rule evaluation).

metalmatze · 2020-05-05T09:36:09Z

What are you suggesting? Not sure I can get an idea how to improve things from your comment.
I guess for a lot of people 7d availability would be good enough too?
I could dedicate some time again so that we can compute the windows according to days in the availability. Then people could simply run with 7d for example.

beorn7 · 2020-05-05T16:40:24Z

I had to go down to 3d to make our production servers not choke.

(In general, I'd be happy to tweak my servers for heavy production use cases, but in this case, the heavy weight query is for a single dashboard and not even for the alerts. That's why I'm kind of keen to make it much more affordable.)

beorn7 · 2020-05-05T16:41:18Z

My over-time idea was up there. Quote:

- record: code_verb:apiserver_request_total:increase1h
  expr: sum by(code, verb) (increase(apiserver_request_total{job="default/kubernetes"}[1h]))
- record: code_verb:apiserver_request_total:increase30d
  expr: avg_over_time(code_verb:apiserver_request_total:increase1h[30d]) * 24 * 30

As soon as I find a few moments, I'll play with it.

csmarchbanks · 2020-05-05T20:40:50Z

I have had success using the intermediary rules @beorn7 suggests in similar applications. Being forced to wait 30 days is a bit of a bummer, but hasn't been too big of a deal the other times I have done this.

den-is · 2020-05-06T16:51:54Z

even applying @beorn7 suggestion didn't help much. started to use couple percents less.

probably will have to remove that rules

beorn7 · 2020-05-06T19:12:33Z

As an additional data point:

Today, I deployed the current state of master as is (697afa2) to our beefiest production clusters, and it just worked. I couldn't even see any dramatic increase in CPU usage (but our Prometheis are fairly busy with many recording rules anyway, the dramatic increase seen by @den-is might be so visible because there seemed to be very little ongoing on the server before).

den-is · 2020-05-07T03:50:34Z

Indeed, my stable/prometheus-operator deploy is 99% pure your work, guys. Almost not adding anything. And that deploy/cluster is not doing anything epic.
Not only I see such a drastic change in CPU usage, but many others, in 1001 similar issues opened helm/charts#22003 (comment)

JFYI. I was running Prometheus 2.17.x much before upgrading to this version of the "rules".
Yesterday I was "forced" to upgrade to the latest version + Prometheus 2.18.0

beorn7 · 2020-05-07T12:18:23Z

And another data point: Apparently, we are very close to the default limit of 50M samples loaded because occasionally the rule evaluation fails. I'll bump up the limit for now, but the general concern stays: Should we have a fairly expensive query in the mixin by default that is only used for a few panels in one dashboard?

metalmatze · 2020-05-26T13:19:51Z

Hey, I'm still thinking about this problem and sometimes don't know whether or not to simply remove the rules for now.
On the other hand, I think having the availability calculated is very helpful, and only with the availability are we able to talk about our remaining error budget.
I'll try to update the rules and check if I can add @beorn7 original idea about splitting the query into two with 1h and 30d :)

beorn7 · 2020-05-29T17:20:14Z

Great. Let me know how it is going and where I can help.

songrijie · 2020-06-04T06:56:16Z

Me too... getting the same issue even with #403

metalmatze · 2020-06-09T10:34:24Z

@beorn7, correct me if I'm wrong but with your approach, we only ever get the average of requests made over 30d for 1h windows, no? Having an outage will show significantly different in terms of availability percentage because the average of 1h won't be the same of the count overall 30d.

After all, my goal is to have a rather details and correct measurement of availability to be able to calculate the error budget appropriately, which is what a lot of discussions should be based on.

beorn7 · 2020-06-09T11:47:21Z

My suggestion is to replace

record: code_verb:apiserver_request_total:increase30d
expr: sum by(code, verb) (increase(apiserver_request_total{job="default/kubernetes"}[30d]))

with

- record: code_verb:apiserver_request_total:increase1h
  expr: sum by(code, verb) (increase(apiserver_request_total{job="default/kubernetes"}[1h]))
- record: code_verb:apiserver_request_total:increase30d
  expr: avg_over_time(code_verb:apiserver_request_total:increase1h[30d]) * 24 * 30

Once code_verb:apiserver_request_total:increase1h has been populated for 30d, the old and the new code_verb:apiserver_request_total:increase30d should yield very similar results. The only difference is a 1h-long "fade out" at the beginning and end of the 30d range, which shouldn't really matter in practice.

Have you seen differences in practice, or is your conclusion just based on theoretical considerations? If the latter, could you explain in more detail?

The new approach is even more flexible because it's now probably cheap enough to calculate other time ranges than 30d ad hoc, e.g. you could do avg_over_time(code_verb:apiserver_request_total:increase1h[7d]) * 24 * 7 without a recording rule if you wanted to inspect the last week.

metalmatze · 2020-06-09T12:42:11Z

Right, I think I'm just trying to understand the different approached in-depth.
So far I have almost 2 weeks worth of data for the new recording rule and indeed it looks very similar 👍 The numbers are quite different still and we probably need to adjust a bunch of the calculations for availability, including the duration / latency parts. But that's a problem that can be solved.

Current recording rules:

Updated recording rules (with 2 weeks for data so far)

dsexton · 2020-07-09T13:22:28Z

@metalmatze when you ran these tests, how many kube-apiservers where there? When going beyond 3 do we pass the threshold of too many samples?

dgrisonnet · 2020-12-18T15:48:37Z

@beorn7 we ran a small experiment with @metalmatze, to see if your approach was precise enough to be able to replace the current recording rule. The goal of this experiment was to run both recording rules in a cluster where we could simulate apiserver outages, to check how the new recording rule was behaving compare to the one we currently have.

So far, after 3 days, we've gather the following data:

The first graph being the apiserver availability calculated with the new recording and the second one with the current approach.

Some key datapoint:

Timestamp	Event	New rule	Current rule	Distance
2020-12-15 13:27:00	start of the experiment	1	1	0
2020-12-15 14:46:00	apiserver outage	0.99999	0.99988	0.00011
2020-12-15 15:10:40	apiserver available	0.83569	0.84935	0.01366
2020-12-15 15:49:30	min(NewRule)	0.76717	0.88793	0.12076
2020-12-16 09:23:30	apiserver outage	0.98388	0.98532	0.00144
2020-12-16 10:09:30	apiserver available	0.97111	0.95558	0.01553
2020-12-16 11:08:00	min(NewRule)	0.95356	0.95719	0.00363
2020-12-17 14:29:30	apiserver outage	0.98020	0.98082	0.00062
2020-12-17 16:36:00	apiserver available	0.95340	0.94771	0.00569
2020-12-17 17:27:15	min(NewRule)	0.94698	0.94856	0.00157
2020-12-18 15:19:00	end of the experiment	0.96289	0.96368	0.00078

After ~8h and 1 outage: new rule reached the same availability with a precision of a hundredth.
After ~42h and 2 outages: new rule reached the same availability with a precision of a thousandth.
At the end of the experiment, after 3 days and 3 outages, the new rules reached a precision of a thousandth compared to the existing rule. We can expect both values to converge even more over time considering how the new rule is built.

Based on these results, the new recording rule should be precise enough to replace the current one considering its use case.

We also noticed some great improvements regarding the evaluation time of the recording rule. At the start of the experiment, the new rule was evaluated 30 times faster than the current one and now after 3 days and much more data, almost 60 times faster. We weren't too sure how to mesure the number of loaded samples but I think these numbers are already proving a significant improvement.

metalmatze · 2020-12-18T15:54:11Z

Thank you a lot for helping with this experiment and driving it home!
Looking forward to your upcoming PR. 👍

beorn7 · 2020-12-18T15:59:04Z

I'm glad the math worked out here. Thank you very much for the research.

beorn7 mentioned this issue Apr 29, 2020

SLO dashboards/alerts are not working in 1.18+ #394

Open

brancz mentioned this issue May 5, 2020

Rules in kube-apiserver.rules are taking too long for Prometheus to evaluate them #413

Closed

This was referenced May 6, 2020

Update kubernetes-mixin for kube-apiserver-availability fixes prometheus-operator/kube-prometheus#516

Closed

[prometheus-operator] code_verb:apiserver_request_total:increase30d is failing to evaluate since chart version 8.12.14 helm/charts#22081

Closed

metalmatze mentioned this issue May 26, 2020

API server SLO recording rules produce too many metrics prometheus-operator/prometheus-operator#3036

Closed

metalmatze mentioned this issue May 28, 2020

Moving apiserver_request metrics to stable status kubernetes/kubernetes#91536

Closed

olegmayko mentioned this issue Jun 6, 2020

query timed out in expression evaluation prometheus-operator/kube-prometheus#558

Closed

surajssd mentioned this issue Aug 13, 2020

PrometheusRule causes "too many samples error" kinvolk/lokomotive#801

Open

dgrisonnet mentioned this issue Dec 18, 2020

Fix code_verb:apiserver_request_total:increase30d loading too many samples #537

Merged

metalmatze closed this as completed in #537 Jan 11, 2021

axeal mentioned this issue Mar 31, 2021

Improve slow performing kube-apiserver-availability.rules rancher/rancher#31865

Open

rmak-cpi mentioned this issue Nov 9, 2021

[kube-prometheus-stack] Sync latest kube-prometheusRules.yaml to pick up availability30d rule performance fix prometheus-community/helm-charts#1497

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`code_verb:apiserver_request_total:increase30d` loads (too) many samples #411

`code_verb:apiserver_request_total:increase30d` loads (too) many samples #411

beorn7 commented Apr 29, 2020

beorn7 commented Apr 29, 2020

metalmatze commented Apr 30, 2020

beorn7 commented Apr 30, 2020

csmarchbanks commented Apr 30, 2020 •

edited

beorn7 commented May 4, 2020

metalmatze commented May 5, 2020

beorn7 commented May 5, 2020

beorn7 commented May 5, 2020

csmarchbanks commented May 5, 2020

den-is commented May 6, 2020 •

edited

beorn7 commented May 6, 2020

den-is commented May 7, 2020 •

edited

beorn7 commented May 7, 2020

metalmatze commented May 26, 2020

beorn7 commented May 29, 2020

songrijie commented Jun 4, 2020

metalmatze commented Jun 9, 2020

beorn7 commented Jun 9, 2020

metalmatze commented Jun 9, 2020

dsexton commented Jul 9, 2020

dgrisonnet commented Dec 18, 2020

metalmatze commented Dec 18, 2020

beorn7 commented Dec 18, 2020

code_verb:apiserver_request_total:increase30d loads (too) many samples #411

code_verb:apiserver_request_total:increase30d loads (too) many samples #411

Comments

beorn7 commented Apr 29, 2020

beorn7 commented Apr 29, 2020

metalmatze commented Apr 30, 2020

beorn7 commented Apr 30, 2020

csmarchbanks commented Apr 30, 2020 • edited

beorn7 commented May 4, 2020

metalmatze commented May 5, 2020

beorn7 commented May 5, 2020

beorn7 commented May 5, 2020

csmarchbanks commented May 5, 2020

den-is commented May 6, 2020 • edited

beorn7 commented May 6, 2020

den-is commented May 7, 2020 • edited

beorn7 commented May 7, 2020

metalmatze commented May 26, 2020

beorn7 commented May 29, 2020

songrijie commented Jun 4, 2020

metalmatze commented Jun 9, 2020

beorn7 commented Jun 9, 2020

metalmatze commented Jun 9, 2020

dsexton commented Jul 9, 2020

dgrisonnet commented Dec 18, 2020

metalmatze commented Dec 18, 2020

beorn7 commented Dec 18, 2020

`code_verb:apiserver_request_total:increase30d` loads (too) many samples #411

`code_verb:apiserver_request_total:increase30d` loads (too) many samples #411

csmarchbanks commented Apr 30, 2020 •

edited

den-is commented May 6, 2020 •

edited

den-is commented May 7, 2020 •

edited