-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"api/series" uses too much memory #5547
Comments
What version of Prometheus is this? |
|
How many series does that match, and how many do you have in total? Have you tried 2.9.2? |
@brian-brazil We don't know how many metrics match the source query (because it can't finish). But if pass start and end(100 seconds range):
That's for only 100 seconds range. If start/end is omitted query will run for all range (2 weeks). I assume that the source query won't return much more metrics, but it scans many metrics( |
Sounds like what I fixed in prometheus-junkyard/tsdb#595, which will be in 2.10 |
Hey, @brian-brazil |
Given how the TSDB is currently implemented, it wouldn't make a difference. |
we're also experiencing this issue in 2.15.2 |
We are hitting this issue on production. It seems like like /series endpoint can run for a very long time for certain queries, eventually OOMing the database (most of the memory is allocated for labels that would be returned by the endpoint). Based on the current implementation (v2.16) it looks like there is no timeout mechanism for the /series endpoint. What is the recommended solution for this problem? Top 25 cumulative allocations https://gist.github.com/mateuszlitwin/523be876fb8c568df6f0472829a3ec1d This is the loop that is doing the allocations ( metrics := []labels.Labels{}
for set.Next() {
metrics = append(metrics, set.At().Labels())
} |
I looked into this recently, and as it stands metadata queries are handled the same as normal queries which results in creating a chunk iterator which will page in the first chunk - taking a long time and potentially trashing the page cache. https://github.com/prometheus/prometheus/tree/series-hint has a potential fix. |
Thanks @brian-brazil for taking a look at this. I'm wondering if there's any specific reason why a PR (based on your potential fix) hasn't been opened yet (eg. you found any issue). Thanks! |
I've been looking into it and the latest code in master doesn't page in the chunks anymore. I think we might be in the clear and this can be closed. I'll write a benchmark soon for this. |
I didn't get around to it.
I'm not sure a typical benchmark will catch this, as it's about page cache. |
Also, even if this happens to have been fixed in TSDB we'll still want something like this for the sake of remote read. |
Fixes prometheus#5547 Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
Fixes prometheus#5547 Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
Fixes prometheus#5547 Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
Fixes #5547 Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
Bug Report
What did you do?
Send series query like:
/api/v1/series?match[]={le!=""}
What did you expect to see?
Well, this is a very heavy query, so I expect Prometheus to hit timeout or other limits.
What did you see instead? Under which circumstances?
The query is executing for ~30 mins until Prometheus is OOM killed.
Normally Prometheus use ~25Gb of memory. This query forces to use more than 90Gb.
Here is memory usage graph:
Disk read:
Environment
We deploy Prometheus to Kubernetes using Prometheus Operator
@brian-brazil
The text was updated successfully, but these errors were encountered: