Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"api/series" uses too much memory #5547

Closed
vitkovskii opened this issue May 8, 2019 · 14 comments · Fixed by #8050
Closed

"api/series" uses too much memory #5547

vitkovskii opened this issue May 8, 2019 · 14 comments · Fixed by #8050

Comments

@vitkovskii
Copy link

Bug Report

What did you do?
Send series query like: /api/v1/series?match[]={le!=""}

What did you expect to see?
Well, this is a very heavy query, so I expect Prometheus to hit timeout or other limits.

What did you see instead? Under which circumstances?
The query is executing for ~30 mins until Prometheus is OOM killed.
Normally Prometheus use ~25Gb of memory. This query forces to use more than 90Gb.

Here is memory usage graph:
image

Disk read:
image

Environment
We deploy Prometheus to Kubernetes using Prometheus Operator

@brian-brazil

@brian-brazil
Copy link
Contributor

What version of Prometheus is this?

@d-ulyanov
Copy link

2.9.1 build with go1.12.4

@brian-brazil
Copy link
Contributor

How many series does that match, and how many do you have in total? Have you tried 2.9.2?

@vitkovskii
Copy link
Author

@brian-brazil We don't know how many metrics match the source query (because it can't finish). But if pass start and end(100 seconds range): /api/v1/series?match[]={le!=""}&start=1557303607&end=1557303707. The query will:

  1. Run for ~20 sec for the first byte;
  2. Additionally, run for ~2 mins 20 sec to output 497Mb of data;
  3. Return the result with 1357676 metrics.

That's for only 100 seconds range. If start/end is omitted query will run for all range (2 weeks). I assume that the source query won't return much more metrics, but it scans many metrics(le!="") for a long time range. And where is no memory/time limits for the query.

@brian-brazil
Copy link
Contributor

Sounds like what I fixed in prometheus-junkyard/tsdb#595, which will be in 2.10

@d-ulyanov
Copy link

Hey, @brian-brazil
It seems that /series endpoint implementation has no contexts propagation, maybe better to add ctx there and add timeouts? What do you think?

@brian-brazil
Copy link
Contributor

Given how the TSDB is currently implemented, it wouldn't make a difference.

@zemek
Copy link
Contributor

zemek commented Jan 23, 2020

we're also experiencing this issue in 2.15.2
(i think this is actually what was causing our issues in #6595 )

@mateuszlitwin
Copy link

mateuszlitwin commented Feb 29, 2020

We are hitting this issue on production. It seems like like /series endpoint can run for a very long time for certain queries, eventually OOMing the database (most of the memory is allocated for labels that would be returned by the endpoint). Based on the current implementation (v2.16) it looks like there is no timeout mechanism for the /series endpoint. What is the recommended solution for this problem?

Top 25 cumulative allocations https://gist.github.com/mateuszlitwin/523be876fb8c568df6f0472829a3ec1d

This is the loop that is doing the allocations (func (api *API) series(r *http.Request) apiFuncResult).

metrics := []labels.Labels{}
for set.Next() {
  metrics = append(metrics, set.At().Labels())
}

@brian-brazil
Copy link
Contributor

I looked into this recently, and as it stands metadata queries are handled the same as normal queries which results in creating a chunk iterator which will page in the first chunk - taking a long time and potentially trashing the page cache. https://github.com/prometheus/prometheus/tree/series-hint has a potential fix.

@pracucci
Copy link
Contributor

I looked into this recently, and as it stands metadata queries are handled the same as normal queries which results in creating a chunk iterator which will page in the first chunk - taking a long time and potentially trashing the page cache. https://github.com/prometheus/prometheus/tree/series-hint has a potential fix.

Thanks @brian-brazil for taking a look at this. I'm wondering if there's any specific reason why a PR (based on your potential fix) hasn't been opened yet (eg. you found any issue). Thanks!

@gouthamve
Copy link
Member

I've been looking into it and the latest code in master doesn't page in the chunks anymore. I think we might be in the clear and this can be closed. I'll write a benchmark soon for this.

@brian-brazil
Copy link
Contributor

I'm wondering if there's any specific reason why a PR (based on your potential fix) hasn't been opened yet (eg. you found any issue).

I didn't get around to it.

I'll write a benchmark soon for this.

I'm not sure a typical benchmark will catch this, as it's about page cache.

@brian-brazil
Copy link
Contributor

Also, even if this happens to have been fixed in TSDB we'll still want something like this for the sake of remote read.

gouthamve added a commit to gouthamve/prometheus that referenced this issue Oct 13, 2020
Fixes prometheus#5547

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
gouthamve added a commit to gouthamve/prometheus that referenced this issue Oct 14, 2020
Fixes prometheus#5547

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
gouthamve added a commit to gouthamve/prometheus that referenced this issue Oct 14, 2020
Fixes prometheus#5547

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
codesome pushed a commit that referenced this issue Oct 14, 2020
Fixes #5547

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
@prometheus prometheus locked as resolved and limited conversation to collaborators Nov 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants