-
Notifications
You must be signed in to change notification settings - Fork 11.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increased memory usage when querying Prometheus datasources since 8.3.x #43369
Comments
@gabor as discussed, here's the issue. Let me know if you need further information. |
In testing this, the memory usage seems scale linearly with the number of active sessions, so this could cause significant memory usage in some circumstances. |
i did some measurements using a large prometheus JSON response (4MB).
possible improvements to the codebase:
|
@gabor I think ultimately we'd want something like [2] or [3], because it's the only possible solution to make memory usage bounded, without completely breaking large dataset results like in [4]. However, that would require us to refactor signification portion of the code, because AFAIK our current datasource API is not streaming-friendly. Another thing that we could do short-term is to verify that our resolution calculation logic (the one that calculates the grafana/pkg/tsdb/prometheus/time_series_query.go Lines 235 to 265 in 50fabe8
I have a hunch that we might find some improvements there (i.e. make sure we that no matter the time range, we always return the same amount of time points). That way we could at least solve the issue for queries with too high of resolution. |
i agree that [2] and [3] is a larger scale change. about modifying the also, sometimes the problem is the cardinality. for example, if the prometheus response return 300 separate time-series blocks, the response can be quite big, even if the number of data points for 1 time-series is smaller. |
Yeah, this sounds like a good first step to me.
How about making said limit configurable and set to 11000 by default? That way we could look into fine-tuning it and that will maintain backward compatibility.
Yup, I understand, but I don't see any low-hanging meaningful improvements that we could do here. At the very least having the ability to bound the dataset temporally is a good start. Let me know if you'd like me to work on the changes to the datapoints limit. |
@radiohead sorry, i probably wrote that in an ambiguous way about the 11000-limit. the 11000-limit is currently in the code, it is live. this has been the behavior for a long time. |
@radiohead hmm.. reading the discussion again, maybe there was no misunderstanding, sorry 😄 anyway, if you think making that limit configurable is worth the effort, please contact the @grafana/observability-metrics squad, they are currently responsible for the prometheus-data-source (i am moving more to Loki these days). |
Not sure if this is an alternative/useful, but in case you're not aware you can configure a global response limit to limit the size of responses from outgoing HTTP requests. |
Based on some discussions with @ryantxu created this discussion. Feel free to provide any feedback/thoughts/ideas there. Thanks |
@marefr does this apply to requests to external plugins as well? |
@toddtreece no, we have this issue #39096 where the idea is to enforce a max limit on data frames rows. |
This would prevent instances from being OOMKilled, but unfortunately it doesn't solve the underlying problem of large query results not fitting in memory. |
The Metrics squad is not currently working on this so we're moving to the backlog. |
@bohandley will reach out to @toddtreece / @ryantxu to gather context / state on this issue. |
My updated status is now at the top pf this issue. Once we safely and responsibly remove the old client this will help with memory usage. @toddtreece and @ryantxu put in a lot of work on this, @aocenas put in a lot of work and with the help of @obetomuniz and @itsmylife we have continued on this work. This is Q3 goal for Observability Metrics. Thank you! |
I am happy to say that due to the hard work of @toddtreece, @itsmylife and many other people by implementing the streaming parser, the memory usage for the Prometheus datasource plugin has dropped significantly. See the following queries on |
@bohandley update September 12, 2022
Description: Memory usage increased with Prometheus queries
Acceptance Criteria: Improve performance of Prometheus query memory usage by successfully implementing the streaming parser.
Status:
@toddtreece introduced the streaming parser to prometheus and began working on bring it to parity with the old prom client.
#49858
#50206
@aocenas helped our squad with a plan to bring the streaming to parity by comparing it with the old client.
#52738
@ismail is currently assigned the tasks to bring it to parity and remove the old client.
@toddtreece and @ryantxu have a plan to test the memory usage for Prometheus queries using real world testing as well as testing in staging and ops using conprof/parca (and now pyroscope?). This work is in progress and we are working to align everyone so that we can improve memory usage for Prometheus queries.
What happened:
When querying Prometheus datasources the memory usage of Grafana server has increased since Grafana
8.3.x
when compared to8.2.x
.Depending on the size of the result set, the memory usage has increased by 1.5x to 3x times, when comparing
8.3.3
to8.2.7
.What you expected to happen:
Memory usage to not increase, or to not increase as sharply.
How to reproduce it (as minimally and precisely as possible):
go_memstats_alloc_bytes
for instance A and instance BAnything else we need to know?:
The issue has been caused by the fact that Prometheus datasource has been refactored from a frontend datasource to a backend datasource and since 8.3 all queries have to be processed in Grafana server:
Environment:
The text was updated successfully, but these errors were encountered: