Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement query time-out in local storage. #454

Closed
beorn7 opened this Issue Jan 21, 2015 · 13 comments

Comments

Projects
None yet
5 participants
@beorn7
Copy link
Member

beorn7 commented Jan 21, 2015

We need a (configurable) time-out after which a query is killed and doesn't drain resources anymore. Currently, a long-running query will go on forever, which easily leads to a death spiral if the pile-up of queries slows things down even more (especially if the growing RAM usage causes the machine to start paging).

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Feb 1, 2015

Am I correct in assuming that long-running queries in general are caused by a high number of considered series? Would it then be sufficient to check for a timeout before each preload of a single series or should it be done at a lower level?

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Feb 1, 2015

@fabxc Yes, that should be true for most queries at least. A query consists of multiple stages that can take significant time. Roughly speaking:

  1. query analysis and doing the index lookups to find out which time series are requested
  2. preloading the required time series samples from memory / disk
  3. running the AST evaluation over these series, e.g. summing, function calls, etc.

For most expensive queries, 2 (the preloading) will be by far the most time-intensive step. Within that step, it should be ok to check between individual time series loads whether a timeout should occur, and I wouldn't make it more granular before seeing a concrete need (like checking between individual chunks of a single time series).

Not sure if we need to check within 1 and 3 somewhere, but at least we can check between 1, 2, and 3. Anyways, it should still be easy to check more granularly if needed.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Feb 1, 2015

Thanks for the clarification. Based on the title, I assumed we only wanted to check 2 and embed the check in the local storage implementation.

If we want an absolute limit just using the TotalEvalTimer would be fine, but in 3 for range queries you set a TODO for a watchdog timer.
So we could a) check for the absolute timeout within that loop or b) extend the timeout as long as it is 'alive'. As stage 3 is unlikely to be the bottleneck a) might be sufficient until proven otherwise, yes?

Edit: Thinking about it, you probably just meant a) by watchdog timer.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Feb 1, 2015

Right, I think it's fair to start out by simply timing out only in the query preloading part, which should be the most relevant part for the majority of expensive queries.

For that, we would have to check for timeouts in these loops that contain calls to PreloadRange:

https://github.com/prometheus/prometheus/blob/master/rules/ast/query_analyzer.go#L121-L132
https://github.com/prometheus/prometheus/blob/master/rules/ast/query_analyzer.go#L152-L176

Currently the existing stats timers (like preloadTimer) can't be used to query for their current elapsed time because they are only used to log-print bulk stats at the end of a query, but it might make sense to add an ElapsedTime() method to them to be able to check them for a timeout and then return a timeout error from the prepare*Query() methods.

Does that sound like a good starter project? ;)

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Feb 1, 2015

Thanks, sounds good to me.

fabxc added a commit to fabxc/prometheus that referenced this issue Feb 3, 2015

Query timeout added.
This is related to prometheus#454. Queries now timeout after a duration set by
the -query.timeout flag. The TotalEvalTimer is now started/stopped
inside any of the ast.Eval* functions.
@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Mar 17, 2015

Fun fact: I've seen some recent real-life queries (just discussed with @mweiden), where most of the time was burnt in (3). So yeah, I guess it makes sense to implement a proper timeout. And as I hear, @fabxc is already working on it.

A similar issue is, however, to limit the number of concurrently running queries. Too many queries overlapping might easily cause a death spiral (where every query takes longer, so long that every query runs into a timeout, and then the server will eventually only be busy running gazillions of queries in parallel that all time out...).

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Mar 17, 2015

Is this a general observation for all operations or are there specific ones that turned out to be time-consuming?

Should my changes be roughly accepted, setting another timeout checkpoint is a one-liner.
Limiting the number of concurrent queries is also easy to add, once the main changes are through.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Mar 17, 2015

It was a rate over a day (rate(something[1d])), i.e. many value had to be touched, and most of the time the CPU was in the code that reads sample values out of a chunk. That should happen in step 3 above, if I'm not mistaken.

(We'll also make chunk iterators more efficient soon.)

@mweiden

This comment has been minimized.

Copy link

mweiden commented Mar 18, 2015

@beorn7 @fabxc @juliusv thanks for your work on this!

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented May 7, 2015

Timeout handling has improved with #676.
We should revisit where exactly we want to check timeouts and whether we want to support custom timeouts (e.g. via the API).

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Feb 3, 2016

@fabxc Is there anything to do here except perhaps a custom timeout per query? (If it's only the latter, I'd prefer to have a separate feature request for it.)

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Feb 3, 2016

Custom timeout has been on the table for ever. It's something we can do, but strong use cases haven't come up so far.
If they do, we indeed can open a more specific issue.

@fabxc fabxc closed this Feb 3, 2016

simonpasquier pushed a commit to simonpasquier/prometheus that referenced this issue Oct 12, 2017

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.