Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign up[Feature request] Ability to kill queries that will OOM Prometheus #3922
Comments
This comment has been minimized.
This comment has been minimized.
|
+1. Option to kill a query is very useful in general, irrespective of how much memory it takes. |
This comment has been minimized.
This comment has been minimized.
|
Queries already have timeouts, and will be terminated when the tcp connection goes away. We can't know what queries will OOM, as that's a matter for the kernel. |
This comment has been minimized.
This comment has been minimized.
|
So an option to end the tcp connection in the UI then? I can work on that :) Do I have an approval? |
This comment has been minimized.
This comment has been minimized.
|
The query will already have timed out by the time you got to that, I don't think that complexity would be worth it. |
This comment has been minimized.
This comment has been minimized.
|
Note that there is also #3480 . So there is apparently still a series of events where even the existing timeout doesn't help to stop a query going wild. Having said that, I see uses for both, a manual way of killing queries as well as some attempts to kill memory-intensive queries before the server OOMs. I had put some thoughts into it for 1.x, but before doing anything concretely, I learned that Prom 2 was already being implemented. The manual kill would be helpful in scenarios where the timeout has been configured to be fairly long on a server that routinely has to run complex queries. If you already know you have mistyped a query, you don't want to wait for 10m to free the resources again. The memory-kill is much harder, of course. I have OOM'd quite a few Prom2 servers already, especially as there are quite a few query types that need more RAM on Prom2 than on Prom1. The most naive implementation would be to let the user configure a max-heap-size (again). Then we can observe the heap size in regular intervals and kill random (or all) running queries, followed by an enforced GC run. Obvious drawbacks are that this approach has a certain reaction time (polling heap size frequency) and will just randomly kill queries (or kill all queries), which might not hit the worst offender (or will hit it by hitting everybody). Much better would be to have some way of estimating the memory consumption of individual queries. Then we could kill the worst offender once the aggregate of all running queries passes a configured threshold. For that, we needed to understand the ways queries consume memory (which would also solve the puzzle why some queries consume so much more memory on Prom2 than on Prom2). |
This comment has been minimized.
This comment has been minimized.
|
I'm already working on the memory component for PromQL, I'd rather get our memory usage down than add potentially complex management features. I'll probably fix #3480 in passing. |
This comment has been minimized.
This comment has been minimized.
kzadorozhny
commented
Mar 22, 2018
|
I agree that reducing query memory requirements will be an improvement, but there is always be a query that OOMs entire server. Right now there are zero protections from user error. You can't ask your users not to query Prometheus or not to collect metrics they need. Here is our scenario:
Having some sort of limit on the number of time series scanned per query would do the trick. |
matthiasr
referenced this issue
Jul 12, 2018
Open
Instrumenting "time series examined" for queries and rules #4374
This comment has been minimized.
This comment has been minimized.
|
any update around this? |
This comment has been minimized.
This comment has been minimized.
aylei
commented
Dec 24, 2018
|
isn't #4513 already covered this? |
This comment has been minimized.
This comment has been minimized.
|
Yip. |
brian-brazil
closed this
Dec 24, 2018
This comment has been minimized.
This comment has been minimized.
|
#4513 is huge progress, and I think it will solve most of the typical cases of “enter an expensive query → OOM the server”. However, it is still different from observing heap size and do something if it grows close to a configurable limit. While loaded samples should closely be correlated to memory usage, there are other things taking memory during query evaluation, plus the excessive heap growth could also happen because of ingesting too many different series at the same time, or compactions resulting in blocks too large, or whatever… In other words: What's still missing is a general watchdog for the heap size. As we know from Prometheus 1.x, it's currently hard to implement that in Go, golang/go#16843 is relevant here. Quite old and still unresolved… One might argue that such a watchdog would also do other things than just killing queries, like stopping compactions or even stopping ingestion (as it happened in Prometheus 1.x – where that also turned out to be a double-edged sword). From that perspective, the title of this issue should be changed, or a new issue could be created and this one closed. In any case, I'd say that more work towards OOM-protection will get quite complicated while there is not much left to gain at this point. |
This comment has been minimized.
This comment has been minimized.
Agreed, at this point I think all the main variables are under control. |
zemek commentedMar 7, 2018
It would be nice if prometheus could kill a query that ends up using a ton of memory. Maybe more than some configured amount?