Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus 2: tsdb doesn't clean up locks on unclean shutdown #2689

Closed
mwitkow opened this issue May 8, 2017 · 12 comments
Closed

Prometheus 2: tsdb doesn't clean up locks on unclean shutdown #2689

mwitkow opened this issue May 8, 2017 · 12 comments

Comments

@mwitkow
Copy link

mwitkow commented May 8, 2017

What did you do?
Ran Prom2 in a Kubernetes stateful set pod with a PVC. Re-created the stateful pod pointing it at the same PVC. This means that the POD got rescheduled but the process was rescheduled.

What did you expect to see?
Upon restarts of the pod, I expected Prom2 to come back up from the same piece of data.

What did you see instead? Under which circumstances?

2017-05-08T15:11:26.499973000Z time="2017-05-08T15:11:26Z" level=error msg="Opening storage failed: open DB in /prometheus-data: Locked by other process" source="main.go:83" 
2017-05-08T15:11:26.501451000Z 2017/05/08 15:11:26 dinit: pid 35 finished: [/prometheus/prometheus -config.file=/prometheus.yml -log.level=info -query.staleness-delta=60s -query.max-concurrency=100 -query.timeout=2m -storage.local.path=/prometheus-data -storage.tsdb.appendable-blocks=2 -storage.tsdb.max-block-duration=36h -storage.tsdb.min-block-duration=2h -storage.tsdb.retention=360h] with error: exit status 1
  • Prometheus version:

Manually built from commit:
8c483e2

@fabxc

@fabxc
Copy link
Contributor

fabxc commented May 8, 2017

Interesting. So I tried out several file locking options. Ultimately, this post http://0pointer.de/blog/projects/locking.html sums it up quite well and I went back to a simple pid file lock.

We are using github.com/nightlyone/lockfile, which reads the PID from a left behind lock file. If it is equal to the current processes PID, it continues as usual. If it is another process that is dead, it deletes the lock and proceeds. If it is another live process, it errors.
Especially in a container, shouldn't always the first case hit?

@mwitkow
Copy link
Author

mwitkow commented May 8, 2017

So we use K8S PVCs, persistent volumes which are GCE perisstent SSDs for our Prometheus'. This means that even if we need to redeploy the Prometheus instance (e.g. change pod affinity rules, or process requirements), we do not lose the data. As such it is completely normal to have a different pod run (and hence PID) to re-attach to the previous instance.

Can we make the check:

  • check if lock file exists
  • if it exists, check for a PID in the local process namespace. if the PID doesn't exist, ignore the lockfile?

@fabxc
Copy link
Contributor

fabxc commented May 8, 2017

That's exactly what the used library does. Hence my confusion.
In general, with pods it I think it will always have the same PID – but that should also be handled properly.

@mwitkow
Copy link
Author

mwitkow commented May 9, 2017

This has happened 3 times already. And the PIDs are not the same since we're using a complicated dinit (docker init system) inside the container (for legacy reasons):

2017-05-09T10:09:35.159223000Z time="2017-05-09T10:09:35Z" level=error msg="Opening storage failed: open DB in /prometheus-data: Locked by other process" source="main.go:83" 
2017-05-09T10:09:35.159543000Z 2017/05/09 10:09:35 dinit: pid 47 finished: [/prometheus/prometheus -config.file=/prometheus.yml -log.level=info -query.staleness-delta=60s -query.max-concurrency=100 -query.timeout=45s -storage.local.path=/prometheus-data -storage.tsdb.appendable-blocks=2 -storage.tsdb.max-block-duration=2h -storage.tsdb.min-block-duration=20m -storage.tsdb.retention=48h/] with error: exit status 1

@fabxc
Copy link
Contributor

fabxc commented May 9, 2017

Okay, so I think the pid lock package does what we want overall: https://github.com/nightlyone/lockfile/blob/master/lockfile.go#L141-L146

The only case I can think of is a live process in your pid namespace with the same PID the Prometheus had before. This is a lot more likely to happen in containers I think – OTOH typically not because the PIDs tend to be the same.

FWIW, from all sorts of file locking, this seemed to be the sanest choice and I don't see an immediate way to work around this one other than having an init script or something deleting the lock file. Or just not locking at all anymore from the Prometheus side... a bit reluctant on that one.

@mwitkow
Copy link
Author

mwitkow commented May 9, 2017

@fabxc can we have a flag that disables it? Even if it is just a temporary work around?

Also, you're right that it is highly likely that another process will take the previous PID. That will happen a lot if you have sidecars.

@mwitkow
Copy link
Author

mwitkow commented May 10, 2017

@fabxc this fixes it for us, from our side it's no longer an issue. Cheers!

@mwitkow mwitkow closed this as completed May 10, 2017
@leoromanovsky
Copy link

I just stumbled across this issue without a resolution, having seen the same thing in our Mesos/Marathon deployment. I am starting Prometheus with --storage.tsdb.no-lockfile in an attempt to avoid this problem, otherwise will pursue deleting the pidfile on task startup.

@tim-seoss
Copy link

I just hit this issue on a container after reboot.

There is a relevant open bug against the locking library, where I've suggested a more robust fix:

nightlyone/lockfile#25

@krasi-georgiev
Copy link
Contributor

@fabxc I think we should reopen this. I can reproduce every time on linux - NOT in a container
go run main.go uname_default.go fdlimits_default.go --config.file=../../documentation/examples/prometheus.yml

I hit by accident
ctrl+z

at this point the only way I found to stop prometheus is
killall -s 9 main

than
go run main.go uname_default.go fdlimits_default.go --config.file=../../documentation/examples/prometheus.yml
level=error ts=2018-02-15T14:44:24.543184691Z caller=main.go:583 err="Opening storage failed open DB in data/: Locked by other process"

don't have an idea how we can tackle.

@omerfarukz
Copy link

Try to just delete the lock file in data directory.

@lock
Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants