Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus seemingly not timing out queries once they've hit the storage layer #3480

Closed
jacksontj opened this Issue Nov 16, 2017 · 17 comments

Comments

Projects
None yet
5 participants
@jacksontj
Copy link
Contributor

jacksontj commented Nov 16, 2017

Seems like I found a bug. We had an incident where prometheus started using large amounts of memory which eventually broke the box. From looking at the metrics from prometheus during the degradation I see some queries which went >4m -- but the query timeout is set to 2m.

From looking at the code (https://github.com/prometheus/prometheus/blob/release-1.8/storage/local/storage.go#L577) it seems that once you get to the storage layer nothing checks to see if the context is complete. Meaning that if there where a query that needed to (as a crazy scenario) spin over 10M time series it would do so until it (1) finishes or (2) process dies.

It seems like this code should be checking the context on some interval to see if it should continue. I looked and it doesn't seem that this is fixed in prom 2.0 either, and I was unable to find any open issues for this issue.

  • Prometheus version: 1.8
@jacksontj

This comment has been minimized.

Copy link
Contributor Author

jacksontj commented Nov 16, 2017

To be clear, I'm 100% happy with doing the implementation-- I just want to make sure I'm not missing something obvious here before I do the work.

cc: @fabxc @juliusv

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Nov 17, 2017

@jacksontj prometheus/tsdb#84

The issue is that we essentially have a lot of tight loops and don't want to check the context in each one. We really want this but don't have a clear answer as to how to implement it cleanly.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Nov 17, 2017

My thoughts on that issue so far is that making the context handle the full lifecycle is probably not feasible, i.e. clients should care about calling Close() regardless. It just gets unnecessarily complex.

However, passing down the context and occasionally checking it seems still feasible. I'm wondering though, for things to actually timeout in TSDB something quite bad must happen. Because all evaluation is basically deferred/lazy and only happens once the query layer, i.e. PromQL, actually runs over the data.

The query engine does have a context and respects it – that should[tm] be enough. So I think it would be valuable to figure out where exactly in TSDB things were taking so long.

@jacksontj

This comment has been minimized.

Copy link
Contributor Author

jacksontj commented Nov 17, 2017

@gouthamve If the concern is checking the context a lot-- the overhead is ~4ns using the following basic test to check:

func BenchmarkCtxCheck(b *testing.B) {
	ctx, _ := context.WithCancel(context.Background())
	dead := false
	for n := 0; n < b.N; n++ {
		select {
		case <-ctx.Done():
			dead = true
		default:
			dead = false
		}
	}
	if dead {
		fmt.Println(dead)
	}
}

Output is as follows:

ontj/test/ctxCheck$ go test -run=x -bench=.
goos: linux
goarch: amd64
pkg: github.com/jacksontj/test/ctxCheck
BenchmarkCtxCheck-4   	100000000	        17.9 ns/op
PASS
ok  	github.com/jacksontj/test/ctxCheck	1.808s

So I don't think the overhead is going to be a problem, and if we are still concerned with an 18ns overhead we can do it on every N iterations. If we checked every 10 iterations (or 100) that would be better than never (which is what happens currently).

@fabxc To be clear my primary concern is with the 1.8 line of prometheus, although I do notice the same issue on the 2.x line.

I expect the primary timeout case to be when (1) the HTTP query is killed (LB timeout or something similar) or (2) the query is taking too long. As for the context, since the Queryable interface passes in a context. I would 100% expect that any querier implemented would stop if the context is done. That being said there isn't a strict SLA for how quickly it needs to exit. If there are concerns about performance overhead I think thats fine to leave up to the storage engine authors, but the expectation should be that it will exit. Specifically thinking about the case I brought up previously (query which spans 1M time series) it would be unreasonable for the storage engine to continue spinning over timeseries if the ctx is closed.

Summary: It seems like the hang-up is that people don't see how to implement an immediate stop on context close, and therefore haven't implemented. IMO an 18ns cost per time-series we touch seems like a reasonable tradeoff for being able to stop processing, but if we aren't okay with that we can scale that back. The primary goal I have is to stop executing very bad queries before they complete. Otherwise we end up taking prometheus down because someone wrote a bad query (which has happened a few times to me already).

So, with that I'll work on putting together a PR for 1.8 and 2.0 that does some basic context cancellations -- and we can work from that.

jacksontj added a commit to jacksontj/prometheus that referenced this issue Nov 17, 2017

Add context cancellation support to local storage
Now if the context is cancelled during fetching of the time-series the
query will exit

Fixes prometheus#3480 for 1.8
@jacksontj

This comment has been minimized.

Copy link
Contributor Author

jacksontj commented Nov 17, 2017

I have some patches for both 1.8 and 2.0

The 1.8 one is very straightforward and IMO should be merged -- jacksontj@2c834ae

The 2.0 one is a bit more complex (4be527f) which I definitely think there is room for discussion on. The goal here isn't to stop processing immediately but rather quickly. It also seems that in 2.0 there isn't as strict of a usage of the interface (the 2.0 API calls some storage stuff directly to a tsdb.db instead of some shared interface). This one needs more work, the basic goal is to have the "expensive" things honor cancels where it makes sense.

BTW: the 2.0 patch if we want it I'll submit as actualy PRs against the real repos-- I did the changes locally in the vendor dir for easier review (and testing).

@jacksontj

This comment has been minimized.

Copy link
Contributor Author

jacksontj commented Nov 24, 2017

jacksontj added a commit to jacksontj/prometheus that referenced this issue Jan 3, 2018

Add context cancellation support to local storage
Now if the context is cancelled during fetching of the time-series the
query will exit

Fixes prometheus#3480 for 1.8

jacksontj added a commit to jacksontj/prometheus that referenced this issue Jan 3, 2018

jacksontj added a commit to jacksontj/prometheus that referenced this issue Jan 5, 2018

Add context cancellation support to local storage
Now if the context is cancelled during fetching of the time-series the
query will exit

Fixes prometheus#3480 for 1.8
@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jan 28, 2019

@jacksontj since Brian fixed the underlying issue in prometheus/tsdb#485 can we close this?

@jacksontj

This comment has been minimized.

Copy link
Contributor Author

jacksontj commented Feb 3, 2019

I still think it would be good to implement this properly, but prometheus/tsdb#485 is a workaround to mitigate the issue. As it stands now it still doesn't "time out" but it finishes the steps that don't cancel "within seconds" so its not as terrible as before (where it was order of 10s of minutes).

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Feb 3, 2019

I still think it would be good to implement this properly, but prometheus/tsdb#485 is a workaround to mitigate the issue.

Yep I see your point.
Since we have discussed that adding a context is a rather drastic change to the API what do you say to close it now and revisit or reopen if we hit some issue again which can't be solved by some internal optimisations?

@jacksontj

This comment has been minimized.

Copy link
Contributor Author

jacksontj commented Feb 5, 2019

AFAICT the API change here is on internally used methods (ones that we don't guarantee compatibility on?). The main Querier API remains unchanged (it actually uses instead of discards the context). IMO getting the context into this is still useful at some point, so if this is considered a "breaking change" an understanding of what it would take to make such a change would help for whenever we revisit this.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Feb 5, 2019

Sorry I am a bit lost. How can adding a context avoid changing exported methods?

The issue for me is not to avoid changing exported methods, but actually trying to avoid adding context all together. If it is unavoidable, than of course we can discuss, but for now I don't see any reason to keep this one open.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 5, 2019

I agree with Krasi. If we need it in future, it should be added. There's not really a notion of breaking change here, as this is an internal library.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Feb 8, 2019

@jacksontj are you still against closing this one and revisiting when/if needed?

@jacksontj

This comment has been minimized.

Copy link
Contributor Author

jacksontj commented Feb 9, 2019

At this point I'm more confused. Initially the answer was that this wasn't wanted because it was a large API change, now it seems that we are okay since this is an internal API. In addition I'm a bit confused by this:

The issue for me is not to avoid changing exported methods, but actually trying to avoid adding context all together.

Why do we want to avoid adding context? Is there some reason besides the change? I'm not aware of any security or performance considerations that would make this something we don't want.

As I said before I think context should be added in this flow, but if there are reasons to not put it in I'm open to doing it later; this just seems like a pretty easy win for better cancellation.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Feb 9, 2019

now it seems that we are okay since this is an internal API

what do you mean by "an internal API"?

The main reason I am against it is that once it is in it will spread like virus 😄
If one API starts taking context soon enough we would add it to most if not all.
https://faiface.github.io/post/context-should-go-away-go2/

And again if unavoidable it will be added, but so far it looks like the optimisations are the right decision.

All I am saying is that lets try to avoid adding it as long as possible and try to solve any problems by optimisations.

Is there any reason why you think it needs to be added right away?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Feb 9, 2019

all that said here is a PR I have been working on that adds a context for cancelling long running compaction.
Don't think that this is a problem that can't be solved any other way so adding a context is unavoidable.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Feb 27, 2019

@jacksontj I hope you don't mind, but I will close it for now and will reopen or revisit if we face another problem that can't be solved without adding a context.

thanks everyone for all the input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.