Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting "query processing would load too many samples" error for simple count(metric) query #7281

Closed
pstibrany opened this issue May 21, 2020 · 22 comments

Comments

@pstibrany
Copy link
Contributor

pstibrany commented May 21, 2020

What did you do? Run count(metric) for metric with high number of series (200k) over 1h range in Prometheus UI.

What did you expect to see? Result.

What did you see instead? Under which circumstances? Instant query returns 208040, but trying to create graph (using Prometheus Graph tab) for range as low as 10 minutes fails with error saying Error executing query: query processing would load too many samples into memory in query execution.

Running Prometheus version: 2.18.1, ecee9c8, with query.max-samples=50000000.

Screenshot 2020-05-21 at 17 56 51

@brian-brazil
Copy link
Contributor

This all is as expected, if you try to have an intermediate range vector that has over 50M samples the limit is protecting you as designed.

@pstibrany
Copy link
Contributor Author

pstibrany commented May 21, 2020

This all is as expected, if you try to have an intermediate range vector that has over 50M samples the limit is protecting you as designed.

The problem as I see it is that it needs to load entire intermediate range vector before processing it, where range here is really the entire input window :-(

How difficult (or even possible) would it be to avoid this limitation/step?

@brian-brazil
Copy link
Contributor

That'd be a significant rewrite of PromQL, which would likely increase memory and cpu usage for common queries which'd be kinda self-defeating. If you need to work with high-cardinality series like this you should use recording rules.

@pstibrany
Copy link
Contributor Author

Thanks for reply.

@brian-brazil
Copy link
Contributor

Looks like there's nothing to do here.

@gouthamve
Copy link
Member

gouthamve commented May 22, 2020

Hrm, I'd disagree @brian-brazil. While it would require significant rewrite of PromQL, it doesn't mean its not an issue. Maybe someone looking to rewrite PromQL in the future could look at fixing this as well.

We're not doing something(something_over_time(high_carindality[1m])) but just a simple count(collectd_cpu_total), see the message I posted in internal slack:

If I understand this correctly, then we check for #(samples_in_result_set) + #(samples_in_current_step) < 50e6. As count(collectd_cpu_total) returns only one series per step, then #(samples_in_result_set) becomes #steps = 12hrs with a 20sec step = (12 * 60 * 60 / 20) = 2160. So the only reason we are hitting the 50e6 samples limit is because of #(samples_in_current_step)
Now that doesn't seem possible because the #series is 90K

I would see this as a bug in PromQL with not-as-easy-as-it-looks but not something to be closed.

@brian-brazil
Copy link
Contributor

brian-brazil commented May 22, 2020

I don't see how this can be classed as bug, this is working exactly as designed. This query does attempt to have more than 50M samples in memory and it is correctly stopped.

@codesome
Copy link
Member

This can be an enhancement in the way we count series. If it's worth it, we can have a special case for count(metric) which only counts the number of series from Index (with some additions in TSDB to do so) and not load samples.

@brian-brazil
Copy link
Contributor

brian-brazil commented May 22, 2020

That wouldn't be correct as it wouldn't handle gaps or staleness correctly, and we already process only a sliding window of one series at a time when evaluating instant vectors.

Note that count isn't the issue here, it doesn't get that far. It's attempting to do effectively high_cardinality[10m:2s], and 200k*300 > 50M.

@pstibrany
Copy link
Contributor Author

pstibrany commented May 22, 2020

Please let's not focus on count, I used it only as an example. Real query that I was investigating is using multiple sum(irate(high_cardinality_metric[5m])) expressions over 12h, I just reduced it to smallest example I could find.

@pstibrany
Copy link
Contributor Author

Note that count isn't the issue here, it doesn't get that far

Exactly.

@roidelapluie
Copy link
Member

If you play with such cardinality you are probably counting events and you should look at something else than Prometheus to compute that.

@pracucci
Copy link
Contributor

pracucci commented May 22, 2020

If you play with such cardinality you are probably counting events and you should look at something else than Prometheus to compute that.

Not necessarily. We frequently see it for basic metrics, like container CPU/memory across all containers in a datacenter. It's not uncommon to have tens of thousands of containers.

@brian-brazil
Copy link
Contributor

Yeah, the use case is fine - though you can't expect things to work the same way without having to do any additional work with such high cardinality compared to doing the same with low cardinality.

@roidelapluie
Copy link
Member

@gouthamve please remove the bug label.

This works exactly as designed in #4414.

That limit works for many Prometheus users. What would be your proposal to change the current behavior?

cc @tomwilkie and @cstyan who worked on setting the limit in the original issue.

@roidelapluie
Copy link
Member

By rereading the issue I know understand it better & understand it, discard my previous message.

@roidelapluie
Copy link
Member

I have been looking at the code and this is a bug

@brian-brazil
Copy link
Contributor

Can you expand? The math indicates that there's just too much data.

@roidelapluie
Copy link
Member

For instance selection yes

But we could do better for ranges: #7285

@pstibrany
Copy link
Contributor Author

We have just realized that this also depends on resolution (step), so increasing step interval decreases number of needed samples, since PromQL only needs one sample per step per series.

@brian-brazil
Copy link
Contributor

#'7307 should fix the incorrect accounting. I think we can close now?

@brian-brazil
Copy link
Contributor

We looked at this in the bug scrub, and the bug discovered was fixed in #7307.

@prometheus prometheus locked as resolved and limited conversation to collaborators Nov 22, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants