Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upPrometheus 2: tsdb doesn't clean up locks on unclean shutdown #2689
Comments
This comment has been minimized.
This comment has been minimized.
|
Interesting. So I tried out several file locking options. Ultimately, this post http://0pointer.de/blog/projects/locking.html sums it up quite well and I went back to a simple pid file lock. We are using github.com/nightlyone/lockfile, which reads the PID from a left behind lock file. If it is equal to the current processes PID, it continues as usual. If it is another process that is dead, it deletes the lock and proceeds. If it is another live process, it errors. |
This comment has been minimized.
This comment has been minimized.
|
So we use K8S PVCs, persistent volumes which are GCE perisstent SSDs for our Prometheus'. This means that even if we need to redeploy the Prometheus instance (e.g. change pod affinity rules, or process requirements), we do not lose the data. As such it is completely normal to have a different pod run (and hence PID) to re-attach to the previous instance. Can we make the check:
|
This comment has been minimized.
This comment has been minimized.
|
That's exactly what the used library does. Hence my confusion. |
This comment has been minimized.
This comment has been minimized.
|
This has happened 3 times already. And the PIDs are not the same since we're using a complicated
|
This comment has been minimized.
This comment has been minimized.
|
Okay, so I think the pid lock package does what we want overall: https://github.com/nightlyone/lockfile/blob/master/lockfile.go#L141-L146 The only case I can think of is a live process in your pid namespace with the same PID the Prometheus had before. This is a lot more likely to happen in containers I think – OTOH typically not because the PIDs tend to be the same. FWIW, from all sorts of file locking, this seemed to be the sanest choice and I don't see an immediate way to work around this one other than having an init script or something deleting the lock file. Or just not locking at all anymore from the Prometheus side... a bit reluctant on that one. |
This comment has been minimized.
This comment has been minimized.
|
@fabxc can we have a flag that disables it? Even if it is just a temporary work around? Also, you're right that it is highly likely that another process will take the previous PID. That will happen a lot if you have sidecars. |
This comment has been minimized.
This comment has been minimized.
|
@fabxc this fixes it for us, from our side it's no longer an issue. Cheers! |
mwitkow
closed this
May 10, 2017
This comment has been minimized.
This comment has been minimized.
leoromanovsky
commented
Dec 18, 2017
|
I just stumbled across this issue without a resolution, having seen the same thing in our Mesos/Marathon deployment. I am starting Prometheus with |
travisn
referenced this issue
Jan 3, 2018
Closed
Prometheus pod failed to start with db "Locked by other process" #854
This comment has been minimized.
This comment has been minimized.
tim-seoss
commented
Feb 15, 2018
|
I just hit this issue on a container after reboot. There is a relevant open bug against the locking library, where I've suggested a more robust fix: |
This comment has been minimized.
This comment has been minimized.
|
@fabxc I think we should reopen this. I can reproduce every time on linux - NOT in a container I hit by accident at this point the only way I found to stop prometheus is than don't have an idea how we can tackle. |
This comment has been minimized.
This comment has been minimized.
omerfarukz
commented
Mar 12, 2018
|
Try to just delete the lock file in data directory. |
ntindall
referenced this issue
Jul 1, 2018
Closed
[Intermittent] compaction failed after upgrade from 2.2.1 to 2.3.0 #4292
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 22, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
mwitkow commentedMay 8, 2017
•
edited
What did you do?
Ran Prom2 in a Kubernetes stateful set pod with a PVC. Re-created the stateful pod pointing it at the same PVC. This means that the POD got rescheduled but the process was rescheduled.
What did you expect to see?
Upon restarts of the pod, I expected Prom2 to come back up from the same piece of data.
What did you see instead? Under which circumstances?
Manually built from commit:
8c483e2
@fabxc