Prometheus 2: tsdb doesn't clean up locks on unclean shutdown #2689

mwitkow · 2017-05-08T15:21:24Z

What did you do?
Ran Prom2 in a Kubernetes stateful set pod with a PVC. Re-created the stateful pod pointing it at the same PVC. This means that the POD got rescheduled but the process was rescheduled.

What did you expect to see?
Upon restarts of the pod, I expected Prom2 to come back up from the same piece of data.

What did you see instead? Under which circumstances?

2017-05-08T15:11:26.499973000Z time="2017-05-08T15:11:26Z" level=error msg="Opening storage failed: open DB in /prometheus-data: Locked by other process" source="main.go:83" 
2017-05-08T15:11:26.501451000Z 2017/05/08 15:11:26 dinit: pid 35 finished: [/prometheus/prometheus -config.file=/prometheus.yml -log.level=info -query.staleness-delta=60s -query.max-concurrency=100 -query.timeout=2m -storage.local.path=/prometheus-data -storage.tsdb.appendable-blocks=2 -storage.tsdb.max-block-duration=36h -storage.tsdb.min-block-duration=2h -storage.tsdb.retention=360h] with error: exit status 1

Prometheus version:

Manually built from commit:
8c483e2

@fabxc

The text was updated successfully, but these errors were encountered:

fabxc · 2017-05-08T16:50:15Z

Interesting. So I tried out several file locking options. Ultimately, this post http://0pointer.de/blog/projects/locking.html sums it up quite well and I went back to a simple pid file lock.

We are using github.com/nightlyone/lockfile, which reads the PID from a left behind lock file. If it is equal to the current processes PID, it continues as usual. If it is another process that is dead, it deletes the lock and proceeds. If it is another live process, it errors.
Especially in a container, shouldn't always the first case hit?

mwitkow · 2017-05-08T18:59:39Z

So we use K8S PVCs, persistent volumes which are GCE perisstent SSDs for our Prometheus'. This means that even if we need to redeploy the Prometheus instance (e.g. change pod affinity rules, or process requirements), we do not lose the data. As such it is completely normal to have a different pod run (and hence PID) to re-attach to the previous instance.

Can we make the check:

check if lock file exists
if it exists, check for a PID in the local process namespace. if the PID doesn't exist, ignore the lockfile?

fabxc · 2017-05-08T19:11:59Z

That's exactly what the used library does. Hence my confusion.
In general, with pods it I think it will always have the same PID – but that should also be handled properly.

mwitkow · 2017-05-09T10:13:32Z

This has happened 3 times already. And the PIDs are not the same since we're using a complicated dinit (docker init system) inside the container (for legacy reasons):

2017-05-09T10:09:35.159223000Z time="2017-05-09T10:09:35Z" level=error msg="Opening storage failed: open DB in /prometheus-data: Locked by other process" source="main.go:83" 
2017-05-09T10:09:35.159543000Z 2017/05/09 10:09:35 dinit: pid 47 finished: [/prometheus/prometheus -config.file=/prometheus.yml -log.level=info -query.staleness-delta=60s -query.max-concurrency=100 -query.timeout=45s -storage.local.path=/prometheus-data -storage.tsdb.appendable-blocks=2 -storage.tsdb.max-block-duration=2h -storage.tsdb.min-block-duration=20m -storage.tsdb.retention=48h/] with error: exit status 1

fabxc · 2017-05-09T10:28:49Z

Okay, so I think the pid lock package does what we want overall: https://github.com/nightlyone/lockfile/blob/master/lockfile.go#L141-L146

The only case I can think of is a live process in your pid namespace with the same PID the Prometheus had before. This is a lot more likely to happen in containers I think – OTOH typically not because the PIDs tend to be the same.

FWIW, from all sorts of file locking, this seemed to be the sanest choice and I don't see an immediate way to work around this one other than having an init script or something deleting the lock file. Or just not locking at all anymore from the Prometheus side... a bit reluctant on that one.

mwitkow · 2017-05-09T10:38:07Z

@fabxc can we have a flag that disables it? Even if it is just a temporary work around?

Also, you're right that it is highly likely that another process will take the previous PID. That will happen a lot if you have sidecars.

mwitkow · 2017-05-10T10:08:50Z

@fabxc this fixes it for us, from our side it's no longer an issue. Cheers!

leoromanovsky · 2017-12-18T21:48:07Z

I just stumbled across this issue without a resolution, having seen the same thing in our Mesos/Marathon deployment. I am starting Prometheus with --storage.tsdb.no-lockfile in an attempt to avoid this problem, otherwise will pursue deleting the pidfile on task startup.

tim-seoss · 2018-02-15T10:26:44Z

I just hit this issue on a container after reboot.

There is a relevant open bug against the locking library, where I've suggested a more robust fix:

nightlyone/lockfile#25

krasi-georgiev · 2018-02-15T15:04:04Z

@fabxc I think we should reopen this. I can reproduce every time on linux - NOT in a container
go run main.go uname_default.go fdlimits_default.go --config.file=../../documentation/examples/prometheus.yml

I hit by accident
ctrl+z

at this point the only way I found to stop prometheus is
killall -s 9 main

than
go run main.go uname_default.go fdlimits_default.go --config.file=../../documentation/examples/prometheus.yml
level=error ts=2018-02-15T14:44:24.543184691Z caller=main.go:583 err="Opening storage failed open DB in data/: Locked by other process"

don't have an idea how we can tackle.

omerfarukz · 2018-03-12T08:16:04Z

Try to just delete the lock file in data directory.

lock · 2019-03-22T21:09:27Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

mwitkow mentioned this issue May 9, 2017

Add fast path for latest values prometheus-junkyard/tsdb#34

Closed

mwitkow closed this as completed May 10, 2017

travisn mentioned this issue Jan 3, 2018

Prometheus pod failed to start with db "Locked by other process" prometheus-operator/prometheus-operator#854

Closed

tim-seoss mentioned this issue Feb 15, 2018

lock failed when program restart after crash nightlyone/lockfile#25

Closed

simonpasquier mentioned this issue May 18, 2018

TSDB Compaction issue - 2.2.1 #4108

Closed

ntindall mentioned this issue Jul 1, 2018

[Intermittent] compaction failed after upgrade from 2.2.1 to 2.3.0 #4292

Closed

fdsmax mentioned this issue Aug 20, 2018

Opening storage failed open DB in /home/tidb/deploy/prometheus2.0.0.data.metrics: Locked by other process pingcap/tidb#7444

Closed

lock bot locked and limited conversation to collaborators Mar 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus 2: tsdb doesn't clean up locks on unclean shutdown #2689

Prometheus 2: tsdb doesn't clean up locks on unclean shutdown #2689

mwitkow commented May 8, 2017 •

edited

Loading

fabxc commented May 8, 2017 •

edited

Loading

mwitkow commented May 8, 2017 •

edited

Loading

fabxc commented May 8, 2017 •

edited

Loading

mwitkow commented May 9, 2017

fabxc commented May 9, 2017 •

edited

Loading

mwitkow commented May 9, 2017 •

edited

Loading

mwitkow commented May 10, 2017

leoromanovsky commented Dec 18, 2017

tim-seoss commented Feb 15, 2018

krasi-georgiev commented Feb 15, 2018

omerfarukz commented Mar 12, 2018

lock bot commented Mar 22, 2019

Prometheus 2: tsdb doesn't clean up locks on unclean shutdown #2689

Prometheus 2: tsdb doesn't clean up locks on unclean shutdown #2689

Comments

mwitkow commented May 8, 2017 • edited Loading

fabxc commented May 8, 2017 • edited Loading

mwitkow commented May 8, 2017 • edited Loading

fabxc commented May 8, 2017 • edited Loading

mwitkow commented May 9, 2017

fabxc commented May 9, 2017 • edited Loading

mwitkow commented May 9, 2017 • edited Loading

mwitkow commented May 10, 2017

leoromanovsky commented Dec 18, 2017

tim-seoss commented Feb 15, 2018

krasi-georgiev commented Feb 15, 2018

omerfarukz commented Mar 12, 2018

lock bot commented Mar 22, 2019

mwitkow commented May 8, 2017 •

edited

Loading

fabxc commented May 8, 2017 •

edited

Loading

mwitkow commented May 8, 2017 •

edited

Loading

fabxc commented May 8, 2017 •

edited

Loading

fabxc commented May 9, 2017 •

edited

Loading

mwitkow commented May 9, 2017 •

edited

Loading