Getting "query processing would load too many samples" error for simple count(metric) query #7281

pstibrany · 2020-05-21T15:50:33Z

What did you do? Run count(metric) for metric with high number of series (200k) over 1h range in Prometheus UI.

What did you expect to see? Result.

What did you see instead? Under which circumstances? Instant query returns 208040, but trying to create graph (using Prometheus Graph tab) for range as low as 10 minutes fails with error saying Error executing query: query processing would load too many samples into memory in query execution.

Running Prometheus version: 2.18.1, ecee9c8, with query.max-samples=50000000.

The text was updated successfully, but these errors were encountered:

brian-brazil · 2020-05-21T16:07:53Z

This all is as expected, if you try to have an intermediate range vector that has over 50M samples the limit is protecting you as designed.

pstibrany · 2020-05-21T16:52:14Z

This all is as expected, if you try to have an intermediate range vector that has over 50M samples the limit is protecting you as designed.

The problem as I see it is that it needs to load entire intermediate range vector before processing it, where range here is really the entire input window :-(

How difficult (or even possible) would it be to avoid this limitation/step?

brian-brazil · 2020-05-21T16:55:08Z

That'd be a significant rewrite of PromQL, which would likely increase memory and cpu usage for common queries which'd be kinda self-defeating. If you need to work with high-cardinality series like this you should use recording rules.

pstibrany · 2020-05-22T08:18:52Z

Thanks for reply.

brian-brazil · 2020-05-22T08:31:07Z

Looks like there's nothing to do here.

gouthamve · 2020-05-22T08:50:34Z

Hrm, I'd disagree @brian-brazil. While it would require significant rewrite of PromQL, it doesn't mean its not an issue. Maybe someone looking to rewrite PromQL in the future could look at fixing this as well.

We're not doing something(something_over_time(high_carindality[1m])) but just a simple count(collectd_cpu_total), see the message I posted in internal slack:

If I understand this correctly, then we check for #(samples_in_result_set) + #(samples_in_current_step) < 50e6. As count(collectd_cpu_total) returns only one series per step, then #(samples_in_result_set) becomes #steps = 12hrs with a 20sec step = (12 * 60 * 60 / 20) = 2160. So the only reason we are hitting the 50e6 samples limit is because of #(samples_in_current_step)
Now that doesn't seem possible because the #series is 90K

I would see this as a bug in PromQL with not-as-easy-as-it-looks but not something to be closed.

brian-brazil · 2020-05-22T09:00:57Z

I don't see how this can be classed as bug, this is working exactly as designed. This query does attempt to have more than 50M samples in memory and it is correctly stopped.

codesome · 2020-05-22T09:13:48Z

This can be an enhancement in the way we count series. If it's worth it, we can have a special case for count(metric) which only counts the number of series from Index (with some additions in TSDB to do so) and not load samples.

brian-brazil · 2020-05-22T09:17:27Z

That wouldn't be correct as it wouldn't handle gaps or staleness correctly, and we already process only a sliding window of one series at a time when evaluating instant vectors.

Note that count isn't the issue here, it doesn't get that far. It's attempting to do effectively high_cardinality[10m:2s], and 200k*300 > 50M.

pstibrany · 2020-05-22T09:23:33Z

Please let's not focus on count, I used it only as an example. Real query that I was investigating is using multiple sum(irate(high_cardinality_metric[5m])) expressions over 12h, I just reduced it to smallest example I could find.

pstibrany · 2020-05-22T09:25:23Z

Note that count isn't the issue here, it doesn't get that far

Exactly.

roidelapluie · 2020-05-22T10:27:37Z

If you play with such cardinality you are probably counting events and you should look at something else than Prometheus to compute that.

pracucci · 2020-05-22T10:30:32Z

If you play with such cardinality you are probably counting events and you should look at something else than Prometheus to compute that.

Not necessarily. We frequently see it for basic metrics, like container CPU/memory across all containers in a datacenter. It's not uncommon to have tens of thousands of containers.

brian-brazil · 2020-05-22T10:37:48Z

Yeah, the use case is fine - though you can't expect things to work the same way without having to do any additional work with such high cardinality compared to doing the same with low cardinality.

roidelapluie · 2020-05-22T10:39:40Z

@gouthamve please remove the bug label.

This works exactly as designed in #4414.

That limit works for many Prometheus users. What would be your proposal to change the current behavior?

cc @tomwilkie and @cstyan who worked on setting the limit in the original issue.

roidelapluie · 2020-05-22T10:49:33Z

By rereading the issue I know understand it better & understand it, discard my previous message.

roidelapluie · 2020-05-22T19:50:01Z

I have been looking at the code and this is a bug

brian-brazil · 2020-05-22T20:41:09Z

Can you expand? The math indicates that there's just too much data.

roidelapluie · 2020-05-22T21:06:07Z

For instance selection yes

But we could do better for ranges: #7285

pstibrany · 2020-05-27T08:19:55Z

We have just realized that this also depends on resolution (step), so increasing step interval decreases number of needed samples, since PromQL only needs one sample per step per series.

brian-brazil · 2020-05-28T12:37:35Z

#'7307 should fix the incorrect accounting. I think we can close now?

brian-brazil · 2020-06-29T14:33:56Z

We looked at this in the bug scrub, and the bug discovered was fixed in #7307.

pstibrany added the kind/bug label May 21, 2020

cstyan added the component/promql label May 21, 2020

brian-brazil added kind/question and removed kind/bug labels May 21, 2020

brian-brazil closed this as completed May 22, 2020

gouthamve reopened this May 22, 2020

gouthamve added kind/bug not-as-easy-as-it-looks labels May 22, 2020

gouthamve added priority/Pmaybe and removed kind/bug labels May 27, 2020

brian-brazil closed this as completed Jun 29, 2020

surajssd mentioned this issue Aug 13, 2020

PrometheusRule causes "too many samples error" kinvolk/lokomotive#801

Open

prometheus locked as resolved and limited conversation to collaborators Nov 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting "query processing would load too many samples" error for simple count(metric) query #7281

Getting "query processing would load too many samples" error for simple count(metric) query #7281

pstibrany commented May 21, 2020 •

edited

Loading

brian-brazil commented May 21, 2020

pstibrany commented May 21, 2020 •

edited

Loading

brian-brazil commented May 21, 2020

pstibrany commented May 22, 2020

brian-brazil commented May 22, 2020

gouthamve commented May 22, 2020 •

edited

Loading

brian-brazil commented May 22, 2020 •

edited

Loading

codesome commented May 22, 2020

brian-brazil commented May 22, 2020 •

edited

Loading

pstibrany commented May 22, 2020 •

edited

Loading

pstibrany commented May 22, 2020

roidelapluie commented May 22, 2020

pracucci commented May 22, 2020 •

edited

Loading

brian-brazil commented May 22, 2020

roidelapluie commented May 22, 2020

roidelapluie commented May 22, 2020

roidelapluie commented May 22, 2020

brian-brazil commented May 22, 2020

roidelapluie commented May 22, 2020

pstibrany commented May 27, 2020

brian-brazil commented May 28, 2020

brian-brazil commented Jun 29, 2020

Getting "query processing would load too many samples" error for simple count(metric) query #7281

Getting "query processing would load too many samples" error for simple count(metric) query #7281

Comments

pstibrany commented May 21, 2020 • edited Loading

brian-brazil commented May 21, 2020

pstibrany commented May 21, 2020 • edited Loading

brian-brazil commented May 21, 2020

pstibrany commented May 22, 2020

brian-brazil commented May 22, 2020

gouthamve commented May 22, 2020 • edited Loading

brian-brazil commented May 22, 2020 • edited Loading

codesome commented May 22, 2020

brian-brazil commented May 22, 2020 • edited Loading

pstibrany commented May 22, 2020 • edited Loading

pstibrany commented May 22, 2020

roidelapluie commented May 22, 2020

pracucci commented May 22, 2020 • edited Loading

brian-brazil commented May 22, 2020

roidelapluie commented May 22, 2020

roidelapluie commented May 22, 2020

roidelapluie commented May 22, 2020

brian-brazil commented May 22, 2020

roidelapluie commented May 22, 2020

pstibrany commented May 27, 2020

brian-brazil commented May 28, 2020

brian-brazil commented Jun 29, 2020

pstibrany commented May 21, 2020 •

edited

Loading

pstibrany commented May 21, 2020 •

edited

Loading

gouthamve commented May 22, 2020 •

edited

Loading

brian-brazil commented May 22, 2020 •

edited

Loading

brian-brazil commented May 22, 2020 •

edited

Loading

pstibrany commented May 22, 2020 •

edited

Loading

pracucci commented May 22, 2020 •

edited

Loading