Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upImplement query time-out in local storage. #454
Comments
This comment has been minimized.
This comment has been minimized.
|
Am I correct in assuming that long-running queries in general are caused by a high number of considered series? Would it then be sufficient to check for a timeout before each preload of a single series or should it be done at a lower level? |
This comment has been minimized.
This comment has been minimized.
|
@fabxc Yes, that should be true for most queries at least. A query consists of multiple stages that can take significant time. Roughly speaking:
For most expensive queries, 2 (the preloading) will be by far the most time-intensive step. Within that step, it should be ok to check between individual time series loads whether a timeout should occur, and I wouldn't make it more granular before seeing a concrete need (like checking between individual chunks of a single time series). Not sure if we need to check within 1 and 3 somewhere, but at least we can check between 1, 2, and 3. Anyways, it should still be easy to check more granularly if needed. |
This comment has been minimized.
This comment has been minimized.
|
Thanks for the clarification. Based on the title, I assumed we only wanted to check 2 and embed the check in the local storage implementation. If we want an absolute limit just using the TotalEvalTimer would be fine, but in 3 for range queries you set a TODO for a watchdog timer. Edit: Thinking about it, you probably just meant a) by watchdog timer. |
This comment has been minimized.
This comment has been minimized.
|
Right, I think it's fair to start out by simply timing out only in the query preloading part, which should be the most relevant part for the majority of expensive queries. For that, we would have to check for timeouts in these loops that contain calls to https://github.com/prometheus/prometheus/blob/master/rules/ast/query_analyzer.go#L121-L132 Currently the existing stats timers (like Does that sound like a good starter project? ;) |
This comment has been minimized.
This comment has been minimized.
|
Thanks, sounds good to me. |
fabxc
added a commit
to fabxc/prometheus
that referenced
this issue
Feb 3, 2015
This comment has been minimized.
This comment has been minimized.
|
Fun fact: I've seen some recent real-life queries (just discussed with @mweiden), where most of the time was burnt in (3). So yeah, I guess it makes sense to implement a proper timeout. And as I hear, @fabxc is already working on it. A similar issue is, however, to limit the number of concurrently running queries. Too many queries overlapping might easily cause a death spiral (where every query takes longer, so long that every query runs into a timeout, and then the server will eventually only be busy running gazillions of queries in parallel that all time out...). |
This comment has been minimized.
This comment has been minimized.
|
Is this a general observation for all operations or are there specific ones that turned out to be time-consuming? Should my changes be roughly accepted, setting another timeout checkpoint is a one-liner. |
This comment has been minimized.
This comment has been minimized.
|
It was a rate over a day (rate(something[1d])), i.e. many value had to be touched, and most of the time the CPU was in the code that reads sample values out of a chunk. That should happen in step 3 above, if I'm not mistaken. (We'll also make chunk iterators more efficient soon.) |
This comment has been minimized.
This comment has been minimized.
mweiden
commented
Mar 18, 2015
This comment has been minimized.
This comment has been minimized.
|
Timeout handling has improved with #676. |
brian-brazil
added
the
enhancement
label
Dec 16, 2015
This comment has been minimized.
This comment has been minimized.
|
@fabxc Is there anything to do here except perhaps a custom timeout per query? (If it's only the latter, I'd prefer to have a separate feature request for it.) |
This comment has been minimized.
This comment has been minimized.
|
Custom timeout has been on the table for ever. It's something we can do, but strong use cases haven't come up so far. |
fabxc
closed this
Feb 3, 2016
simonpasquier
pushed a commit
to simonpasquier/prometheus
that referenced
this issue
Oct 12, 2017
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 24, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
beorn7 commentedJan 21, 2015
We need a (configurable) time-out after which a query is killed and doesn't drain resources anymore. Currently, a long-running query will go on forever, which easily leads to a death spiral if the pile-up of queries slows things down even more (especially if the growing RAM usage causes the machine to start paging).