Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upPrometheus seemingly not timing out queries once they've hit the storage layer #3480
Comments
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
The issue is that we essentially have a lot of tight loops and don't want to check the context in each one. We really want this but don't have a clear answer as to how to implement it cleanly. |
This comment has been minimized.
This comment has been minimized.
|
My thoughts on that issue so far is that making the context handle the full lifecycle is probably not feasible, i.e. clients should care about calling However, passing down the context and occasionally checking it seems still feasible. I'm wondering though, for things to actually timeout in TSDB something quite bad must happen. Because all evaluation is basically deferred/lazy and only happens once the query layer, i.e. PromQL, actually runs over the data. The query engine does have a context and respects it – that should[tm] be enough. So I think it would be valuable to figure out where exactly in TSDB things were taking so long. |
This comment has been minimized.
This comment has been minimized.
|
@gouthamve If the concern is checking the context a lot-- the overhead is ~4ns using the following basic test to check:
Output is as follows:
So I don't think the overhead is going to be a problem, and if we are still concerned with an 18ns overhead we can do it on every N iterations. If we checked every 10 iterations (or 100) that would be better than never (which is what happens currently). @fabxc To be clear my primary concern is with the 1.8 line of prometheus, although I do notice the same issue on the 2.x line. I expect the primary timeout case to be when (1) the HTTP query is killed (LB timeout or something similar) or (2) the query is taking too long. As for the context, since the Summary: It seems like the hang-up is that people don't see how to implement an immediate stop on context close, and therefore haven't implemented. IMO an 18ns cost per time-series we touch seems like a reasonable tradeoff for being able to stop processing, but if we aren't okay with that we can scale that back. The primary goal I have is to stop executing very bad queries before they complete. Otherwise we end up taking prometheus down because someone wrote a bad query (which has happened a few times to me already). So, with that I'll work on putting together a PR for 1.8 and 2.0 that does some basic context cancellations -- and we can work from that. |
jacksontj
added a commit
to jacksontj/prometheus
that referenced
this issue
Nov 17, 2017
This comment has been minimized.
This comment has been minimized.
|
I have some patches for both 1.8 and 2.0 The 1.8 one is very straightforward and IMO should be merged -- jacksontj@2c834ae The 2.0 one is a bit more complex (4be527f) which I definitely think there is room for discussion on. The goal here isn't to stop processing immediately but rather quickly. It also seems that in 2.0 there isn't as strict of a usage of the interface (the 2.0 API calls some storage stuff directly to a tsdb.db instead of some shared interface). This one needs more work, the basic goal is to have the "expensive" things honor cancels where it makes sense. BTW: the 2.0 patch if we want it I'll submit as actualy PRs against the real repos-- I did the changes locally in the vendor dir for easier review (and testing). |
jacksontj
added a commit
to jacksontj/prometheus
that referenced
this issue
Nov 17, 2017
brian-brazil
referenced this issue
Nov 22, 2017
Closed
volume bind not happening between local volume and /data (running in the container) #3503
This comment has been minimized.
This comment has been minimized.
|
ping @gouthamve @fabxc |
jacksontj
added a commit
to jacksontj/prometheus
that referenced
this issue
Jan 3, 2018
jacksontj
added a commit
to jacksontj/prometheus
that referenced
this issue
Jan 3, 2018
jacksontj
added a commit
to jacksontj/prometheus
that referenced
this issue
Jan 5, 2018
beorn7
referenced this issue
Mar 8, 2018
Closed
[Feature request] Ability to kill queries that will OOM Prometheus #3922
krasi-georgiev
referenced this issue
Oct 16, 2018
Closed
Read-path in TSDB doesn't honor context #411
This comment has been minimized.
This comment has been minimized.
|
@jacksontj since Brian fixed the underlying issue in prometheus/tsdb#485 can we close this? |
This comment has been minimized.
This comment has been minimized.
|
I still think it would be good to implement this properly, but prometheus/tsdb#485 is a workaround to mitigate the issue. As it stands now it still doesn't "time out" but it finishes the steps that don't cancel "within seconds" so its not as terrible as before (where it was order of 10s of minutes). |
This comment has been minimized.
This comment has been minimized.
Yep I see your point. |
This comment has been minimized.
This comment has been minimized.
|
AFAICT the API change here is on internally used methods (ones that we don't guarantee compatibility on?). The main Querier API remains unchanged (it actually uses instead of discards the context). IMO getting the context into this is still useful at some point, so if this is considered a "breaking change" an understanding of what it would take to make such a change would help for whenever we revisit this. |
This comment has been minimized.
This comment has been minimized.
|
Sorry I am a bit lost. How can adding a context avoid changing exported methods? The issue for me is not to avoid changing exported methods, but actually trying to avoid adding context all together. If it is unavoidable, than of course we can discuss, but for now I don't see any reason to keep this one open. |
This comment has been minimized.
This comment has been minimized.
|
I agree with Krasi. If we need it in future, it should be added. There's not really a notion of breaking change here, as this is an internal library. |
This comment has been minimized.
This comment has been minimized.
|
@jacksontj are you still against closing this one and revisiting when/if needed? |
This comment has been minimized.
This comment has been minimized.
|
At this point I'm more confused. Initially the answer was that this wasn't wanted because it was a large API change, now it seems that we are okay since this is an internal API. In addition I'm a bit confused by this:
Why do we want to avoid adding context? Is there some reason besides the change? I'm not aware of any security or performance considerations that would make this something we don't want. As I said before I think context should be added in this flow, but if there are reasons to not put it in I'm open to doing it later; this just seems like a pretty easy win for better cancellation. |
This comment has been minimized.
This comment has been minimized.
what do you mean by "an internal API"? The main reason I am against it is that once it is in it will spread like virus And again if unavoidable it will be added, but so far it looks like the optimisations are the right decision. All I am saying is that lets try to avoid adding it as long as possible and try to solve any problems by optimisations. Is there any reason why you think it needs to be added right away? |
This comment has been minimized.
This comment has been minimized.
|
all that said here is a PR I have been working on that adds a context for cancelling long running compaction. |
This comment has been minimized.
This comment has been minimized.
|
@jacksontj I hope you don't mind, but I will close it for now and will reopen or revisit if we face another problem that can't be solved without adding a context. thanks everyone for all the input. |
jacksontj commentedNov 16, 2017
Seems like I found a bug. We had an incident where prometheus started using large amounts of memory which eventually broke the box. From looking at the metrics from prometheus during the degradation I see some queries which went >4m -- but the query timeout is set to 2m.
From looking at the code (https://github.com/prometheus/prometheus/blob/release-1.8/storage/local/storage.go#L577) it seems that once you get to the storage layer nothing checks to see if the context is complete. Meaning that if there where a query that needed to (as a crazy scenario) spin over 10M time series it would do so until it (1) finishes or (2) process dies.
It seems like this code should be checking the context on some interval to see if it should continue. I looked and it doesn't seem that this is fixed in prom 2.0 either, and I was unable to find any open issues for this issue.