Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus v2.0.0 data corruption #3534

Closed
auhlig opened this issue Dec 1, 2017 · 12 comments

Comments

Projects
None yet
10 participants
@auhlig
Copy link

commented Dec 1, 2017

At SAP we're using Prometheus to monitor our 13+ kubernetes clusters. The recent upgrade to Prometheus v2.0.0 was initially very smooth, but is meanwhile somewhat painful, since we're seeing the following error on a daily basis. At first Prometheus returns inconsistent metric values, which affects alerting, and eventually crashes with:

level=error ts=2017-12-01T19:22:24.923269594Z caller=db.go:255 component=tsdb msg="retention cutoff failed" err="read block meta /prometheus/01BZFAM16QFQ3ECY7E09DH7X7H: open /prometheus/01BZFAM16QFQ3ECY7E09DH7X7H/meta.json: no such file or directory"
level=info ts=2017-12-01T19:22:24.923332144Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1512129600000 maxt=1512136800000
level=error ts=2017-12-01T19:22:28.906791057Z caller=db.go:260 component=tsdb msg="compaction failed" err="reload blocks: read meta information /prometheus/01BZFAM16QFQ3ECY7E09DH7X7H: open /prometheus/01BZFAM16QFQ3ECY7E09DH7X7H/meta.json: no such file or directory"

On restart it fails with

level=info ts=2017-11-30T13:53:15.774449669Z caller=main.go:215 msg="Starting Prometheus" version="(version=2.0.0, branch=HEAD, revision=0a74f98628a0463dddc90528220c94de5032d1a0)"
level=info ts=2017-11-30T13:53:15.774567774Z caller=main.go:216 build_context="(go=go1.9.2, user=root@615b82cb36b6, date=20171108-07:11:59)"
level=info ts=2017-11-30T13:53:15.774584415Z caller=main.go:217 host_details="(Linux 4.13.9-coreos #1 SMP Thu Oct 26 03:21:00 UTC 2017 x86_64 prometheus-frontend-4217608546-6mkiw (none))"
level=info ts=2017-11-30T13:53:15.77544454Z caller=web.go:380 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2017-11-30T13:53:15.776060323Z caller=main.go:314 msg="Starting TSDB"
level=info ts=2017-11-30T13:53:15.776080166Z caller=targetmanager.go:71 component="target manager" msg="Starting target manager..."
level=error ts=2017-11-30T13:53:16.931485157Z caller=main.go:323 msg="Opening storage failed" err="read meta information /prometheus/01BZFAM16QFQ3ECY7E09DH7X7H: open /prometheus/01BZFAM16QFQ3ECY7E09DH7X7H/meta.json: no such file or directory"

This can only be fixed manually by deleting at least the affected directory.
Memory usage is consistent. Nothing obvious here.
Prometheus stores the data on an NFS mount, which worked perfectly with previous versions.
Since this makes our monitoring setup quite unreliable, I'm thinking about downgrading to Prometheus v1.8.2, which did a fantastic job in the past.

I cannot see where prometheus fails to write the meta.json. Hopefully you know more @fabxc?
Similar to #2805.

We also observed Prometheus v2.0.0 filling up the 300GiB volume with data. This resulted in no space left on disk followed by the above error. Best guess: Retention was not kicking in.

Environment

  • System information:

    Linux 4.13.16-coreos-r1 x86_64

  • Prometheus version:

    prometheus, version 2.0.0 (branch: HEAD, revision: 0a74f98)
    build user: root@615b82cb36b6
    build date: 20171108-07:11:59
    go version: go1.9.2

  • Prometheus configuration file:

Configuration can be found here.

@brian-brazil

This comment has been minimized.

Copy link
Member

commented Dec 1, 2017

Prometheus stores the data on an NFS mount, which worked perfectly with previous versions.

NFS is not supported, by any version of Prometheus. It requires a POSIX filesystem.

@auhlig

This comment has been minimized.

Copy link
Author

commented Dec 1, 2017

Thanks for the quick reply.
Please also consider the 2nd part of the issue: In the same setup while using a Kubernetes PVC the retention is not considered, so the volume fills up, eventually leading to the error described above. I saw a couple potentially related commits in prometheus/tsdb. Is this issue known?

@gouthamve

This comment has been minimized.

Copy link
Member

commented Dec 1, 2017

@auhlig

This comment has been minimized.

Copy link
Author

commented Dec 2, 2017

Thanks for the answer @gouthamve.
Our Prometheis v2.0.0 instances seem to work fine after manually deleting the data outside of the retention window.
Do you already have a timeline for the next release?
Is the fix you mentioned already in the master branch, so I could build and test it?

@alexandrul

This comment has been minimized.

Copy link

commented Dec 5, 2017

I have encountered the same error messages on a Windows server with local storage (the disk appears as HP LOGICAL VOLUME SCSI Disk Device).

In my case I was unable to start Prometheus after a config change until I have deleted the affected folder.

Windows Server 2012 R2
Prometheus: version=2.0.0, branch=HEAD, revision=0a74f98628a0463dddc90528220c94de5032d1a0, go=go1.9.2

@BugRoger

This comment has been minimized.

Copy link

commented Dec 8, 2017

For what it's worth, we built Prometheus against the latest prometheus/tsdb and that solved this particular issue with NFS.

@wkruse

This comment has been minimized.

Copy link

commented Jan 16, 2018

Related to #3506 and should be fixed by prometheus/tsdb#213 and #3508.

@anguslees

This comment has been minimized.

Copy link

commented Feb 14, 2018

Related to #3506 and should be fixed by prometheus/tsdb#213 and #3508.

I still see the above issue with prometheus v2.1.0, which afaict includes #3508. I believe this indicates that the tsdb change was not sufficient.

Edit: oh, to clarify: my .nfs* file was created with prometheus v2.0.0. So it's possible v2.1.0 has removed the code that deleted files while still open (I need to run long enough to have a prometheus node go offline before I can be sure). I was expecting/hoping that the fix would also involve correctly ignoring these files if present, but this part of the issue has not changed with v2.1.0.

@AnilNeeluru

This comment has been minimized.

Copy link

commented Feb 27, 2018

I am seeing similar data corruption issue with prometheus v2.0.0, when i restart prometheus.

prom

To clarify, am not using nfs. it just ext4 filesystem, below is the mount location where empty metadata.json error occurred.
/dev/vda3 on /data-2 type ext4 (rw,relatime,errors=remount-ro,data=ordered)

This can only be fixed manually by deleting at least the affected directory from the mount location. Please let me know whether this issue can be fixed with upgrading prometheus to later version of v2.1.0

@fuxes

This comment has been minimized.

Copy link

commented Jul 31, 2018

Has this bug been fixed?

@simonpasquier

This comment has been minimized.

Copy link
Member

commented Jul 31, 2018

@fuxes if you're asking about running Prometheus on NFS, what Brian answered hasn't changed.

If you have additional questions, please ask on the prometheus users mailing list or IRC.

@lock

This comment has been minimized.

Copy link

commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.