Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upProposal: Add ability to (approximately) limit memory use by queries #4383
Comments
This was referenced Jul 13, 2018
This comment has been minimized.
This comment has been minimized.
|
Related: #4513 (in v2.5.0-rc.0). Can probably achieve a sufficient approximation of this behaviour using this config setting. |
This comment has been minimized.
This comment has been minimized.
|
Yeah, this is basically done. |
brian-brazil
closed this
Dec 7, 2018
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
roganartu commentedJul 13, 2018
Proposal
Add an ability to set an (approximate) per-query limit on the amount of memory a query can consume and kill queries exceeding it, returning an error.
Problem
Prometheus queries are currently able to consume as much memory as the system has available. It is possible for uneducated (or experimenting) users to issue a pathological query that takes down an entire node. Since there is no separation of querying and scraping components, a query OOMing prom results in scrape downtime. With a large WAL this may be in the order of 5-10 mins or longer as the service restarts and the WAL is read.
This is operationally difficult to support when designing for high availability while simultaneously encouraging new users to explore available metrics and create dashboards for their own use. Even if you run multiple prometheus nodes scraping the same targets with a load balancer in front, a pathological query being refreshed by the user can easily take down multiple nodes. The likelihood of this is perhaps increased by the fact that pathological queries are slow, making them more likely to be refreshed by the end user while the old query continues running until it times out.
The existing maximum query count protection doesn't help against this type of query, and other system-level protections like cgroups will still result in an OOM of the process. The hard limit on number of points per series in a result (11k atm I believe?) helps somewhat, but it is still possible to load significant amounts of data in a subquery with eg large ranges as this depends more on scrape intervals than anything configurable at the query level. The query timeout helps a little insofar as it becomes a little more difficult to load in large amounts of data with the default 2 minute timeout, but it is still very much possible to OOM a 100GB server inside that window.
Example
Over a period of a day or so is a good example of a common metric that has high cardinality (and hence probably a lot of data). If you have thousands of machines with many CPUs each you could easily have 100k unique labelsets per time point which quickly adds up to tens of GB of data or more depending on your scrape interval. There are many others, and since a common process for metric exploration is to use the Prometheus query page for adhoc querying before moving to Grafana and refining, it is very easy to accidentally issue a bad query against these kinds of inherently high-cardinality metrics. If you have a default query timeout configured (2 minutes) and any reasonable number of multi-cpu machines, this query will very likely timeout, but depending on how big your prom node is it might OOM first. I've seen it take upwards of 40GB of RAM before timing out on one of our prom nodes.
Proposed Implementation
Summary
Reflection would allow a very accurate measure of size, but has the downside that it is slow. Instead my proposed implementation maintains a counter of approximate current query size in bytes, incrementing as samples are loaded and/or added to the query result. This uses hardcoded size approximations of the different value types:
Point,Sample,Vector,Series, andMatrix, based on the expected memory size of their elements.Performance
I haven't done any kind of benchmarking before/after so I am unsure of the performance implications of this approach. The main areas of concern would be the
Series,Vector, andMatrixSize()methods, which iterate over theirLabels,Samples, andSeriesrespectively. Calculating sizes for large objects may present performance issues here, but I am unsure how best to create a new benchmark of this worst case to confirm. I have attempted to otherwise include size calculations in existing loops instead of creating new ones after-the-fact.Known Shortcomings
Aggregations
Aggregations are likely to take up significantly more space than non-aggregations due to label.Builder. There is a PR about this #4248, but without it you might find query size limiting too restrictive.
For example, while experimenting with poison pill queries as examples for this proposal I did some profiling and found that aggregations with a large number of labelsets do indeed end up with a very big
label.Builderinstance:This happened for the following query (where there's a few thousand instances with tens of CPUs each):
over a two hour window. Removing the increase avoids the large allocation of label.Builder. Query size limiting here works as expected but may be a bit overzealous for some until #4248 lands. On the plus side, this might help raise some places where promql is a bit heavy on memory.
Large ranges with short scrape intervals
If you query a large range (where the definition of large varies greatly depending on what your scrape interval is), the input+output vectors are likely to take significant space. This may make queries over long time frames fail.
eg, using Grafana to query the query size histogram with an auto interval variable in Grafana, and bumping the time range out to 7 days might yield a query like:
which is a little more than 1GB for a node with 5s scrape interval, but only ~20MB for a node with a 5m scrape interval. I've found that ~5GB limit works for most everyting we genuinely need to query, while still giving us sufficient protection from bad queries but this is going to be very subjective. Queries that hit the limit I've found tend to do so quite quickly, at which point their resources are released anyway so even with a theoretical upper-bound of 100GB total query memory configured (20x max connections, 5GB max query size) we haven't seen aggregate usage higher than 5-10GB at any time in practice.
Alternative Approach
It looks like InfluxDB is planning on decoupling querying from ingestion: https://www.dotconferences.com/blog/interview-paul-dix-2018
Additional Benefits
This approach incidentally also gives better visibility into query size distributions through three new metrics:
prometheus_engine_queries_size_max_bytes: gauge indicating the currently configured query size limitprometheus_engine_query_size_bytes_bucket: histogram of query sizes, up to and including the configured limitprometheus_engine_queries_oversize_total: counter of the number of queries that returned an error due to exceeding the configured limitThe long tail of this histogram is far more interesting than 90% or lower, and has helped us identify several unnecessarily resource-intensive queries.
Another benefit is that large queries which exceed the configured query size limit, but may not have individually OOM'd the server are often short-circuited instead of timing out attempting to load large volumes of data. This is especially true with queries that load in large range vectors of metrics of small scrape intervals, as the size required for the points (without labels) is asserted before attempting to load all the points+labels required.
What's Missing
Documentation. I think there should at least be some mention of this feature at the bottom of https://prometheus.io/docs/prometheus/latest/querying/basics/, but also definitely in the README.
The info about the query size could be added to the query stats that are currently available by setting
stats=truein the url query params. This could also be included in the UI.