-
Notifications
You must be signed in to change notification settings - Fork 10k
Description
At SAP we're using Prometheus to monitor our 13+ kubernetes clusters. The recent upgrade to Prometheus v2.0.0 was initially very smooth, but is meanwhile somewhat painful, since we're seeing the following error on a daily basis. At first Prometheus returns inconsistent metric values, which affects alerting, and eventually crashes with:
level=error ts=2017-12-01T19:22:24.923269594Z caller=db.go:255 component=tsdb msg="retention cutoff failed" err="read block meta /prometheus/01BZFAM16QFQ3ECY7E09DH7X7H: open /prometheus/01BZFAM16QFQ3ECY7E09DH7X7H/meta.json: no such file or directory"
level=info ts=2017-12-01T19:22:24.923332144Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1512129600000 maxt=1512136800000
level=error ts=2017-12-01T19:22:28.906791057Z caller=db.go:260 component=tsdb msg="compaction failed" err="reload blocks: read meta information /prometheus/01BZFAM16QFQ3ECY7E09DH7X7H: open /prometheus/01BZFAM16QFQ3ECY7E09DH7X7H/meta.json: no such file or directory"
On restart it fails with
level=info ts=2017-11-30T13:53:15.774449669Z caller=main.go:215 msg="Starting Prometheus" version="(version=2.0.0, branch=HEAD, revision=0a74f98628a0463dddc90528220c94de5032d1a0)"
level=info ts=2017-11-30T13:53:15.774567774Z caller=main.go:216 build_context="(go=go1.9.2, user=root@615b82cb36b6, date=20171108-07:11:59)"
level=info ts=2017-11-30T13:53:15.774584415Z caller=main.go:217 host_details="(Linux 4.13.9-coreos #1 SMP Thu Oct 26 03:21:00 UTC 2017 x86_64 prometheus-frontend-4217608546-6mkiw (none))"
level=info ts=2017-11-30T13:53:15.77544454Z caller=web.go:380 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2017-11-30T13:53:15.776060323Z caller=main.go:314 msg="Starting TSDB"
level=info ts=2017-11-30T13:53:15.776080166Z caller=targetmanager.go:71 component="target manager" msg="Starting target manager..."
level=error ts=2017-11-30T13:53:16.931485157Z caller=main.go:323 msg="Opening storage failed" err="read meta information /prometheus/01BZFAM16QFQ3ECY7E09DH7X7H: open /prometheus/01BZFAM16QFQ3ECY7E09DH7X7H/meta.json: no such file or directory"
This can only be fixed manually by deleting at least the affected directory.
Memory usage is consistent. Nothing obvious here.
Prometheus stores the data on an NFS mount, which worked perfectly with previous versions.
Since this makes our monitoring setup quite unreliable, I'm thinking about downgrading to Prometheus v1.8.2, which did a fantastic job in the past.
I cannot see where prometheus fails to write the meta.json. Hopefully you know more @fabxc?
Similar to #2805.
We also observed Prometheus v2.0.0 filling up the 300GiB volume with data. This resulted in no space left on disk followed by the above error. Best guess: Retention was not kicking in.
Environment
-
System information:
Linux 4.13.16-coreos-r1 x86_64
-
Prometheus version:
prometheus, version 2.0.0 (branch: HEAD, revision: 0a74f98)
build user: root@615b82cb36b6
build date: 20171108-07:11:59
go version: go1.9.2 -
Prometheus configuration file:
Configuration can be found here.