-
Notifications
You must be signed in to change notification settings - Fork 9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus 2: tsdb doesn't clean up locks on unclean shutdown #2689
Comments
Interesting. So I tried out several file locking options. Ultimately, this post http://0pointer.de/blog/projects/locking.html sums it up quite well and I went back to a simple pid file lock. We are using github.com/nightlyone/lockfile, which reads the PID from a left behind lock file. If it is equal to the current processes PID, it continues as usual. If it is another process that is dead, it deletes the lock and proceeds. If it is another live process, it errors. |
So we use K8S PVCs, persistent volumes which are GCE perisstent SSDs for our Prometheus'. This means that even if we need to redeploy the Prometheus instance (e.g. change pod affinity rules, or process requirements), we do not lose the data. As such it is completely normal to have a different pod run (and hence PID) to re-attach to the previous instance. Can we make the check:
|
That's exactly what the used library does. Hence my confusion. |
This has happened 3 times already. And the PIDs are not the same since we're using a complicated
|
Okay, so I think the pid lock package does what we want overall: https://github.com/nightlyone/lockfile/blob/master/lockfile.go#L141-L146 The only case I can think of is a live process in your pid namespace with the same PID the Prometheus had before. This is a lot more likely to happen in containers I think – OTOH typically not because the PIDs tend to be the same. FWIW, from all sorts of file locking, this seemed to be the sanest choice and I don't see an immediate way to work around this one other than having an init script or something deleting the lock file. Or just not locking at all anymore from the Prometheus side... a bit reluctant on that one. |
@fabxc can we have a flag that disables it? Even if it is just a temporary work around? Also, you're right that it is highly likely that another process will take the previous PID. That will happen a lot if you have sidecars. |
@fabxc this fixes it for us, from our side it's no longer an issue. Cheers! |
I just stumbled across this issue without a resolution, having seen the same thing in our Mesos/Marathon deployment. I am starting Prometheus with |
I just hit this issue on a container after reboot. There is a relevant open bug against the locking library, where I've suggested a more robust fix: |
@fabxc I think we should reopen this. I can reproduce every time on linux - NOT in a container I hit by accident at this point the only way I found to stop prometheus is than don't have an idea how we can tackle. |
Try to just delete the lock file in data directory. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
What did you do?
Ran Prom2 in a Kubernetes stateful set pod with a PVC. Re-created the stateful pod pointing it at the same PVC. This means that the POD got rescheduled but the process was rescheduled.
What did you expect to see?
Upon restarts of the pod, I expected Prom2 to come back up from the same piece of data.
What did you see instead? Under which circumstances?
Manually built from commit:
8c483e2
@fabxc
The text was updated successfully, but these errors were encountered: