Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus hangs after interrupt #214

Closed
bernerdschaefer opened this Issue May 3, 2013 · 7 comments

Comments

Projects
None yet
5 participants
@bernerdschaefer
Copy link
Contributor

bernerdschaefer commented May 3, 2013

Here's a transcript with a goroutine dump: https://gist.github.com/bernerdschaefer/5508760

It appears to be a problem with the ordering of calls to levigo -- see this gist which shows that the following two cases produce indefinitely blocked goroutines:

  1. Calling db.Close() twice. For example, closing explicitly in one location, with a defered close elsewhere can trigger this.
  2. Calling db.Get() after db.Close(). This can happen if the ordering of calls (e.g., from different goroutines) is not synchronized.

The way we capture interrupts from main.go, if the graceful shutdown is blocked by something like this, sending further interrupts are simply ignored, and the process needs to be forcibly killed in some way.

Maybe this is related to #31?

@ghost ghost assigned matttproud May 3, 2013

@ghost ghost assigned bernerdschaefer May 23, 2013

@matttproud-soundcloud

This comment has been minimized.

Copy link
Contributor

matttproud-soundcloud commented May 23, 2013

I think both this and #31 are effectively fixed now, though the Gist you proposed to me in-person a few weeks back about lifecycle management would probably fix this. Are you OK if I assign this to you since you seemed to have made good headway on the design?

@typingduck

This comment has been minimized.

Copy link

typingduck commented Mar 7, 2014

+1

Finding that I have to manually kill prometheus these days as runit cannot do so.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Mar 7, 2014

@typingduck the shutdown has actually been working reliably for a long while now (I fixed it), so I'll close this ticket. What you're seeing with runit is probably just that it times out as Prometheus is still flushing its in-memory state to disk. If you tail the log, you'll see that at some point, it says, "Done flushing" and then exits. Maybe there's a way to increase runit timeouts, but that's not a Prometheus bug then.

@juliusv juliusv closed this Mar 7, 2014

@typingduck

This comment has been minimized.

Copy link

typingduck commented Mar 7, 2014

Do you have a estimate on how long to timeout? I have increased the runit timeout to 20 seconds but didn't work.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Mar 7, 2014

@typingduck You can look at your Prometheus' log file to see the time elapsed between the lines "Flushing samples to disk..." and "Done flushing.". Seems like it takes approximately 2 minutes in your case, so you might want to set the timeout to 5 or something.

@typingduck

This comment has been minimized.

Copy link

typingduck commented Mar 7, 2014

tks.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 25, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 25, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.