Skip to content

Prometheus v2.0.0 data corruption #3534

@auhlig

Description

@auhlig

At SAP we're using Prometheus to monitor our 13+ kubernetes clusters. The recent upgrade to Prometheus v2.0.0 was initially very smooth, but is meanwhile somewhat painful, since we're seeing the following error on a daily basis. At first Prometheus returns inconsistent metric values, which affects alerting, and eventually crashes with:

level=error ts=2017-12-01T19:22:24.923269594Z caller=db.go:255 component=tsdb msg="retention cutoff failed" err="read block meta /prometheus/01BZFAM16QFQ3ECY7E09DH7X7H: open /prometheus/01BZFAM16QFQ3ECY7E09DH7X7H/meta.json: no such file or directory"
level=info ts=2017-12-01T19:22:24.923332144Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1512129600000 maxt=1512136800000
level=error ts=2017-12-01T19:22:28.906791057Z caller=db.go:260 component=tsdb msg="compaction failed" err="reload blocks: read meta information /prometheus/01BZFAM16QFQ3ECY7E09DH7X7H: open /prometheus/01BZFAM16QFQ3ECY7E09DH7X7H/meta.json: no such file or directory"

On restart it fails with

level=info ts=2017-11-30T13:53:15.774449669Z caller=main.go:215 msg="Starting Prometheus" version="(version=2.0.0, branch=HEAD, revision=0a74f98628a0463dddc90528220c94de5032d1a0)"
level=info ts=2017-11-30T13:53:15.774567774Z caller=main.go:216 build_context="(go=go1.9.2, user=root@615b82cb36b6, date=20171108-07:11:59)"
level=info ts=2017-11-30T13:53:15.774584415Z caller=main.go:217 host_details="(Linux 4.13.9-coreos #1 SMP Thu Oct 26 03:21:00 UTC 2017 x86_64 prometheus-frontend-4217608546-6mkiw (none))"
level=info ts=2017-11-30T13:53:15.77544454Z caller=web.go:380 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2017-11-30T13:53:15.776060323Z caller=main.go:314 msg="Starting TSDB"
level=info ts=2017-11-30T13:53:15.776080166Z caller=targetmanager.go:71 component="target manager" msg="Starting target manager..."
level=error ts=2017-11-30T13:53:16.931485157Z caller=main.go:323 msg="Opening storage failed" err="read meta information /prometheus/01BZFAM16QFQ3ECY7E09DH7X7H: open /prometheus/01BZFAM16QFQ3ECY7E09DH7X7H/meta.json: no such file or directory"

This can only be fixed manually by deleting at least the affected directory.
Memory usage is consistent. Nothing obvious here.
Prometheus stores the data on an NFS mount, which worked perfectly with previous versions.
Since this makes our monitoring setup quite unreliable, I'm thinking about downgrading to Prometheus v1.8.2, which did a fantastic job in the past.

I cannot see where prometheus fails to write the meta.json. Hopefully you know more @fabxc?
Similar to #2805.

We also observed Prometheus v2.0.0 filling up the 300GiB volume with data. This resulted in no space left on disk followed by the above error. Best guess: Retention was not kicking in.

Environment

  • System information:

    Linux 4.13.16-coreos-r1 x86_64

  • Prometheus version:

    prometheus, version 2.0.0 (branch: HEAD, revision: 0a74f98)
    build user: root@615b82cb36b6
    build date: 20171108-07:11:59
    go version: go1.9.2

  • Prometheus configuration file:

Configuration can be found here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions