Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Add ability to (approximately) limit memory use by queries #4383

Closed
roganartu opened this Issue Jul 13, 2018 · 2 comments

Comments

Projects
None yet
2 participants
@roganartu
Copy link
Contributor

roganartu commented Jul 13, 2018

Proposal

Add an ability to set an (approximate) per-query limit on the amount of memory a query can consume and kill queries exceeding it, returning an error.

Problem

Prometheus queries are currently able to consume as much memory as the system has available. It is possible for uneducated (or experimenting) users to issue a pathological query that takes down an entire node. Since there is no separation of querying and scraping components, a query OOMing prom results in scrape downtime. With a large WAL this may be in the order of 5-10 mins or longer as the service restarts and the WAL is read.

This is operationally difficult to support when designing for high availability while simultaneously encouraging new users to explore available metrics and create dashboards for their own use. Even if you run multiple prometheus nodes scraping the same targets with a load balancer in front, a pathological query being refreshed by the user can easily take down multiple nodes. The likelihood of this is perhaps increased by the fact that pathological queries are slow, making them more likely to be refreshed by the end user while the old query continues running until it times out.

The existing maximum query count protection doesn't help against this type of query, and other system-level protections like cgroups will still result in an OOM of the process. The hard limit on number of points per series in a result (11k atm I believe?) helps somewhat, but it is still possible to load significant amounts of data in a subquery with eg large ranges as this depends more on scrape intervals than anything configurable at the query level. The query timeout helps a little insofar as it becomes a little more difficult to load in large amounts of data with the default 2 minute timeout, but it is still very much possible to OOM a 100GB server inside that window.

Example

increase(node_interrupts_total[1h])

Over a period of a day or so is a good example of a common metric that has high cardinality (and hence probably a lot of data). If you have thousands of machines with many CPUs each you could easily have 100k unique labelsets per time point which quickly adds up to tens of GB of data or more depending on your scrape interval. There are many others, and since a common process for metric exploration is to use the Prometheus query page for adhoc querying before moving to Grafana and refining, it is very easy to accidentally issue a bad query against these kinds of inherently high-cardinality metrics. If you have a default query timeout configured (2 minutes) and any reasonable number of multi-cpu machines, this query will very likely timeout, but depending on how big your prom node is it might OOM first. I've seen it take upwards of 40GB of RAM before timing out on one of our prom nodes.

Proposed Implementation

Summary

Reflection would allow a very accurate measure of size, but has the downside that it is slow. Instead my proposed implementation maintains a counter of approximate current query size in bytes, incrementing as samples are loaded and/or added to the query result. This uses hardcoded size approximations of the different value types: Point, Sample, Vector, Series, and Matrix, based on the expected memory size of their elements.

Performance

I haven't done any kind of benchmarking before/after so I am unsure of the performance implications of this approach. The main areas of concern would be the Series, Vector, and Matrix Size() methods, which iterate over their Labels, Samples, and Series respectively. Calculating sizes for large objects may present performance issues here, but I am unsure how best to create a new benchmark of this worst case to confirm. I have attempted to otherwise include size calculations in existing loops instead of creating new ones after-the-fact.

Known Shortcomings

Aggregations

Aggregations are likely to take up significantly more space than non-aggregations due to label.Builder. There is a PR about this #4248, but without it you might find query size limiting too restrictive.

For example, while experimenting with poison pill queries as examples for this proposal I did some profiling and found that aggregations with a large number of labelsets do indeed end up with a very big label.Builder instance:

screen shot 2018-07-11 at 10 36 53 am

This happened for the following query (where there's a few thousand instances with tens of CPUs each):

increase(node_interrupts_total[1h])

over a two hour window. Removing the increase avoids the large allocation of label.Builder. Query size limiting here works as expected but may be a bit overzealous for some until #4248 lands. On the plus side, this might help raise some places where promql is a bit heavy on memory.

Large ranges with short scrape intervals

If you query a large range (where the definition of large varies greatly depending on what your scrape interval is), the input+output vectors are likely to take significant space. This may make queries over long time frames fail.

eg, using Grafana to query the query size histogram with an auto interval variable in Grafana, and bumping the time range out to 7 days might yield a query like:

histogram_quantile(
  0.99,
  increase(prometheus_engine_query_size_bytes_bucket{instance="example.com:9090"}[6h])
)

which is a little more than 1GB for a node with 5s scrape interval, but only ~20MB for a node with a 5m scrape interval. I've found that ~5GB limit works for most everyting we genuinely need to query, while still giving us sufficient protection from bad queries but this is going to be very subjective. Queries that hit the limit I've found tend to do so quite quickly, at which point their resources are released anyway so even with a theoretical upper-bound of 100GB total query memory configured (20x max connections, 5GB max query size) we haven't seen aggregate usage higher than 5-10GB at any time in practice.

Alternative Approach

It looks like InfluxDB is planning on decoupling querying from ingestion: https://www.dotconferences.com/blog/interview-paul-dix-2018

We’ve started the work to decouple storage from compute (query processing) to improve reliability and scalability.
I initially tried a similar approach, attempting to create a read-only flag that allowed running two binaries (one for ingestion and one for querying) and hence using cgroups or other system-level restrictions to achieve the same robustness, but it was quickly apparent it was either not possible, or (far more likely) I didn't possess the requisite knowledge to make it happen.

Additional Benefits

This approach incidentally also gives better visibility into query size distributions through three new metrics:

  • prometheus_engine_queries_size_max_bytes: gauge indicating the currently configured query size limit
  • prometheus_engine_query_size_bytes_bucket: histogram of query sizes, up to and including the configured limit
  • prometheus_engine_queries_oversize_total: counter of the number of queries that returned an error due to exceeding the configured limit

screen shot 2018-07-13 at 12 08 39 pm

The long tail of this histogram is far more interesting than 90% or lower, and has helped us identify several unnecessarily resource-intensive queries.

Another benefit is that large queries which exceed the configured query size limit, but may not have individually OOM'd the server are often short-circuited instead of timing out attempting to load large volumes of data. This is especially true with queries that load in large range vectors of metrics of small scrape intervals, as the size required for the points (without labels) is asserted before attempting to load all the points+labels required.

What's Missing

Documentation. I think there should at least be some mention of this feature at the bottom of https://prometheus.io/docs/prometheus/latest/querying/basics/, but also definitely in the README.

The info about the query size could be added to the query stats that are currently available by setting stats=true in the url query params. This could also be included in the UI.

@roganartu

This comment has been minimized.

Copy link
Contributor Author

roganartu commented Oct 29, 2018

Related: #4513 (in v2.5.0-rc.0). Can probably achieve a sufficient approximation of this behaviour using this config setting.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 7, 2018

Yeah, this is basically done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.