Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Ability to kill queries that will OOM Prometheus #3922

Closed
zemek opened this Issue Mar 7, 2018 · 12 comments

Comments

Projects
None yet
6 participants
@zemek
Copy link
Contributor

zemek commented Mar 7, 2018

It would be nice if prometheus could kill a query that ends up using a ton of memory. Maybe more than some configured amount?

@codesome

This comment has been minimized.

Copy link
Member

codesome commented Mar 8, 2018

+1. Option to kill a query is very useful in general, irrespective of how much memory it takes.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 8, 2018

Queries already have timeouts, and will be terminated when the tcp connection goes away. We can't know what queries will OOM, as that's a matter for the kernel.

@codesome

This comment has been minimized.

Copy link
Member

codesome commented Mar 8, 2018

So an option to end the tcp connection in the UI then? I can work on that :) Do I have an approval?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 8, 2018

The query will already have timed out by the time you got to that, I don't think that complexity would be worth it.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 8, 2018

Note that there is also #3480 . So there is apparently still a series of events where even the existing timeout doesn't help to stop a query going wild.

Having said that, I see uses for both, a manual way of killing queries as well as some attempts to kill memory-intensive queries before the server OOMs. I had put some thoughts into it for 1.x, but before doing anything concretely, I learned that Prom 2 was already being implemented.

The manual kill would be helpful in scenarios where the timeout has been configured to be fairly long on a server that routinely has to run complex queries. If you already know you have mistyped a query, you don't want to wait for 10m to free the resources again.

The memory-kill is much harder, of course. I have OOM'd quite a few Prom2 servers already, especially as there are quite a few query types that need more RAM on Prom2 than on Prom1.

The most naive implementation would be to let the user configure a max-heap-size (again). Then we can observe the heap size in regular intervals and kill random (or all) running queries, followed by an enforced GC run.

Obvious drawbacks are that this approach has a certain reaction time (polling heap size frequency) and will just randomly kill queries (or kill all queries), which might not hit the worst offender (or will hit it by hitting everybody).

Much better would be to have some way of estimating the memory consumption of individual queries. Then we could kill the worst offender once the aggregate of all running queries passes a configured threshold. For that, we needed to understand the ways queries consume memory (which would also solve the puzzle why some queries consume so much more memory on Prom2 than on Prom2).

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 8, 2018

I'm already working on the memory component for PromQL, I'd rather get our memory usage down than add potentially complex management features. I'll probably fix #3480 in passing.

@kzadorozhny

This comment has been minimized.

Copy link

kzadorozhny commented Mar 22, 2018

I agree that reducing query memory requirements will be an improvement, but there is always be a query that OOMs entire server. Right now there are zero protections from user error. You can't ask your users not to query Prometheus or not to collect metrics they need.

Here is our scenario:

  1. User opens up Prometheus ui. Graph mode is a default.
  2. User stats typing very_popular_metric, Enter.
  3. Prometheus is trying to graph 10,000+ time series.
  4. Game over. Server OOM's way before 30 seconds timeout kicks in.
  5. 10 minutes later after crash recovery completes, server is up again.

Having some sort of limit on the number of time series scanned per query would do the trick.

@zemek

This comment has been minimized.

Copy link
Contributor Author

zemek commented Oct 19, 2018

any update around this? 🙂

@aylei

This comment has been minimized.

Copy link

aylei commented Dec 24, 2018

isn't #4513 already covered this?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 24, 2018

Yip.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Dec 24, 2018

#4513 is huge progress, and I think it will solve most of the typical cases of “enter an expensive query → OOM the server”. However, it is still different from observing heap size and do something if it grows close to a configurable limit. While loaded samples should closely be correlated to memory usage, there are other things taking memory during query evaluation, plus the excessive heap growth could also happen because of ingesting too many different series at the same time, or compactions resulting in blocks too large, or whatever… In other words: What's still missing is a general watchdog for the heap size. As we know from Prometheus 1.x, it's currently hard to implement that in Go, golang/go#16843 is relevant here. Quite old and still unresolved… One might argue that such a watchdog would also do other things than just killing queries, like stopping compactions or even stopping ingestion (as it happened in Prometheus 1.x – where that also turned out to be a double-edged sword). From that perspective, the title of this issue should be changed, or a new issue could be created and this one closed. In any case, I'd say that more work towards OOM-protection will get quite complicated while there is not much left to gain at this point.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 24, 2018

In any case, I'd say that more work towards OOM-protection will get quite complicated while there is not much left to gain at this point.

Agreed, at this point I think all the main variables are under control.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.