Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV in storage/metric tests #135

Closed
bernerdschaefer opened this Issue Apr 12, 2013 · 9 comments

Comments

Projects
None yet
3 participants
@bernerdschaefer
Copy link
Contributor

bernerdschaefer commented Apr 12, 2013

It's intermittent, but I do see this every few times I run make in the root directory, or go test from storage/metric. Here's the trace: https://gist.github.com/bernerdschaefer/5371918

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Apr 12, 2013

This has been bugging us for a while. I can never reproduce it if I just run the single crashing test:

go test ./... -test.run=TestGetFingerprintsForLabelSet

But if I run all tests, it crashes in this one. So I suspect that it has something to do with storage not being correctly closed or some kind of side-effects between tests. Need to dig deeper.

@bernerdschaefer

This comment has been minimized.

Copy link
Contributor Author

bernerdschaefer commented Apr 15, 2013

I think I've figured out what's causing this!

tieredStorage.Close() calls Drain and then closes the underlying data store. However, if you check out tieredStorage.Serve() you'll see that calling Drain doesn't break the event loop. This leads to a case where prometheus tries to write data after the storage is closed -- where we call drain, close the database, and then one of the flush/write memory tickers run.

Note: this also means that calling Close() on tieredStorage will never shut down its Serve() goroutine.

@matttproud

This comment has been minimized.

Copy link
Member

matttproud commented Apr 15, 2013

Let me just say this: All of the extra usage coverage you are providing is
thoroughly excellent. I'm ecstatic about the problems we've been
uncovering and fixing as well as making the interface improvements.

A side note: The tiered storage code is immature and was the result of an
exhaustive sprint. Don't be surprised about finding warts in it. The old
storage code was simple, easy to reason, but could never be performant.
I'm hoping with some additional refactorings that the tiered components
will one day take on some of these more intuitive qualities. A major
improvement would be to break out metric indexing into a separate service
that happens out-of-band of time series sample writing.

2013/4/15 Bernerd Schaefer notifications@github.com

I think I've figured out what's causing this!

tieredStorage.Close()https://github.com/prometheus/prometheus/blob/master/storage/metric/tiered.go#L220calls
Drain and then closes the underlying data store. However, if you check
out tieredStorage.Serve()https://github.com/prometheus/prometheus/blob/master/storage/metric/tiered.go#L182you'll see that calling
Drain doesn't break the event loop. This leads to a case where prometheus
tries to write data after the storage is closed -- where we call drain,
close the database, and then one of the flush/write memory tickers run.

Note: this also means that calling Close() on tieredStorage will never
shut down its Serve() goroutine.


Reply to this email directly or view it on GitHubhttps://github.com//issues/135#issuecomment-16372989
.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Apr 15, 2013

Hm, from what I can see, Drain() blocks until the storage is fully drained (
https://github.com/prometheus/prometheus/blob/master/storage/metric/tiered.go#L111),
breaks the Serve() event loop (
https://github.com/prometheus/prometheus/blob/master/storage/metric/tiered.go#L182),
and only then closes the underlying data store. Am I missing something?

On Mon, Apr 15, 2013 at 10:37 AM, Bernerd Schaefer <notifications@github.com

wrote:

I think I've figured out what's causing this!

tieredStorage.Close()https://github.com/prometheus/prometheus/blob/master/storage/metric/tiered.go#L220calls
Drain and then closes the underlying data store. However, if you check
out tieredStorage.Serve()https://github.com/prometheus/prometheus/blob/master/storage/metric/tiered.go#L182you'll see that calling
Drain doesn't break the event loop. This leads to a case where prometheus
tries to write data after the storage is closed -- where we call drain,
close the database, and then one of the flush/write memory tickers run.

Note: this also means that calling Close() on tieredStorage will never
shut down its Serve() goroutine.


Reply to this email directly or view it on GitHubhttps://github.com//issues/135#issuecomment-16372989
.

@bernerdschaefer

This comment has been minimized.

Copy link
Contributor Author

bernerdschaefer commented Apr 15, 2013

@juliusv break in that context breaks the select (which, in this case, does nothing), not the for loop.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Apr 15, 2013

Ohh... you're right, my eyes didn't even see the for loop, but of course it
has to be there :)

So I'll just change it to return.

On Mon, Apr 15, 2013 at 12:31 PM, Bernerd Schaefer <notifications@github.com

wrote:

@juliusv https://github.com/juliusv break in that context breaks the
select (which, in this case, does nothing), not the for loop.


Reply to this email directly or view it on GitHubhttps://github.com//issues/135#issuecomment-16377661
.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Apr 15, 2013

Damn, "break" -> "return" just makes the tests hang forever. Trying to find
out why. I'm suspecting that something is still trying to write to the
storage after telling it to drain and then gets blocked on the channels.

On Mon, Apr 15, 2013 at 12:43 PM, Julius Volz julius@soundcloud.com wrote:

Ohh... you're right, my eyes didn't even see the for loop, but of course
it has to be there :)

So I'll just change it to return.

On Mon, Apr 15, 2013 at 12:31 PM, Bernerd Schaefer <
notifications@github.com> wrote:

@juliusv https://github.com/juliusv break in that context breaks the
select (which, in this case, does nothing), not the for loop.


Reply to this email directly or view it on GitHubhttps://github.com//issues/135#issuecomment-16377661
.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Apr 15, 2013

Found the problem, where we called both Close() (indirectly) and then
afterwards Drain() in a test. Fix in
#143

On Mon, Apr 15, 2013 at 1:18 PM, Julius Volz julius@soundcloud.com wrote:

Damn, "break" -> "return" just makes the tests hang forever. Trying to
find out why. I'm suspecting that something is still trying to write to the
storage after telling it to drain and then gets blocked on the channels.

On Mon, Apr 15, 2013 at 12:43 PM, Julius Volz julius@soundcloud.comwrote:

Ohh... you're right, my eyes didn't even see the for loop, but of course
it has to be there :)

So I'll just change it to return.

On Mon, Apr 15, 2013 at 12:31 PM, Bernerd Schaefer <
notifications@github.com> wrote:

@juliusv https://github.com/juliusv break in that context breaks the
select (which, in this case, does nothing), not the for loop.


Reply to this email directly or view it on GitHubhttps://github.com//issues/135#issuecomment-16377661
.

@juliusv juliusv closed this Apr 17, 2013

simonpasquier pushed a commit to simonpasquier/prometheus that referenced this issue Oct 12, 2017

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 25, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 25, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.