Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing data / Prometheus data store corruption? #3233

Closed
cubranic opened this Issue Oct 2, 2017 · 5 comments

Comments

Projects
None yet
5 participants
@cubranic
Copy link

cubranic commented Oct 2, 2017

We noticed that our Prometheus database seems to be missing data going back more than a few days. This coincides with the time we had to restart the Prometheus systemd service, which had gone unresponsive.

I'm not familiar with the structure of in /var/lib/prometheus/data, but I noticed that there is 2.2 GB of data in /var/lib/prometheus/data/orphans, all dated from around the time shortly before the restart when the server had gone unresponsive.

Environment

  • System information:
Linux 4.4.0-92-generic x86_64
  • Prometheus version:
prometheus, version 1.6.3 (branch: master, revision: c580b60c67f2c5f6b638c3322161bcdf6d68d7fc)
  build user:       root@a6410e65f5c7
  build date:       20170522-09:15:06
  go version:       go1.8.1
  • Logs:

The journal only shows entries since the service was last restarted, and there is nothing in it indicating data errors or corruption:

$ sudo journalctl -u prometheus
-- Logs begin at Thu 2017-09-28 14:59:01 PDT, end at Mon 2017-10-02 12:28:52 PDT. --
Sep 28 15:02:57 hk-east prometheus[1422]: time="2017-09-28T15:02:57-07:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633"
Sep 28 15:03:01 hk-east prometheus[1422]: time="2017-09-28T15:03:01-07:00" level=info msg="Done checkpointing in-memory metrics and chunks in 4.101461923s." source="persistence.go:665"
Sep 28 15:08:01 hk-east prometheus[1422]: time="2017-09-28T15:08:01-07:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633"
Sep 28 15:08:05 hk-east prometheus[1422]: time="2017-09-28T15:08:05-07:00" level=info msg="Done checkpointing in-memory metrics and chunks in 4.176247805s." source="persistence.go:665"
Sep 28 15:13:05 hk-east prometheus[1422]: time="2017-09-28T15:13:05-07:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633"
Sep 28 15:13:09 hk-east prometheus[1422]: time="2017-09-28T15:13:09-07:00" level=info msg="Done checkpointing in-memory metrics and chunks in 4.012909007s." source="persistence.go:665"
Sep 28 15:14:27 hk-east prometheus[1422]: time="2017-09-28T15:14:27-07:00" level=info msg="Remote storage resharding from 6 to 5 shards." source="queue_manager.go:351"
Sep 28 15:18:09 hk-east prometheus[1422]: time="2017-09-28T15:18:09-07:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633"
Sep 28 15:18:13 hk-east prometheus[1422]: time="2017-09-28T15:18:13-07:00" level=info msg="Done checkpointing in-memory metrics and chunks in 4.273546963s." source="persistence.go:665"
Sep 28 15:20:47 hk-east prometheus[1422]: time="2017-09-28T15:20:47-07:00" level=info msg="Remote storage resharding from 5 to 9 shards." source="queue_manager.go:351"
Sep 28 15:21:57 hk-east prometheus[1422]: time="2017-09-28T15:21:57-07:00" level=info msg="Remote storage resharding from 9 to 7 shards." source="queue_manager.go:351"
Sep 28 15:23:13 hk-east prometheus[1422]: time="2017-09-28T15:23:13-07:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633"
Sep 28 15:23:17 hk-east prometheus[1422]: time="2017-09-28T15:23:17-07:00" level=info msg="Done checkpointing in-memory metrics and chunks in 4.236170522s." source="persistence.go:665"
Sep 28 15:28:17 hk-east prometheus[1422]: time="2017-09-28T15:28:17-07:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633"
Sep 28 15:28:21 hk-east prometheus[1422]: time="2017-09-28T15:28:21-07:00" level=info msg="Done checkpointing in-memory metrics and chunks in 3.990110605s." source="persistence.go:665"
Sep 28 15:33:21 hk-east prometheus[1422]: time="2017-09-28T15:33:21-07:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633"
Sep 28 15:33:25 hk-east prometheus[1422]: time="2017-09-28T15:33:25-07:00" level=info msg="Done checkpointing in-memory metrics and chunks in 3.771038866s." source="persistence.go:665"
Sep 28 15:38:25 hk-east prometheus[1422]: time="2017-09-28T15:38:25-07:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633"
Sep 28 15:38:29 hk-east prometheus[1422]: time="2017-09-28T15:38:29-07:00" level=info msg="Done checkpointing in-memory metrics and chunks in 3.988321103s." source="persistence.go:665"
....
@rbobrovnikov

This comment has been minimized.

Copy link

rbobrovnikov commented Oct 10, 2017

I have same behavior. Prometheus "forgets" previous data, and sometimes stucks after few days, and became unreachable.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Oct 10, 2017

The orphaned directory is filled with series data in two different ways:

  • The server had to “quarantine” a series because it encountered some data consistency. Those might be triggered by a bug in Prometheus or by actual physical data corruption of some kind. As of 1.8.0, there is no known bug that creates data corruption.
  • During crash recovery (after an unclean shutdown), data was found that could not be recovered into a consistent series.

There are *.hint files that tell you what happened, along with the orphaned series data, if any. At some point in the past, we planned to create forensic tools to do something with the data. As Prometheus 2.x has a completely new on-disk format, no such thing is planned anymore. You can of course analyze manually at will.

You should definitely upgrade to the latest 1.x release (currently 1.8.0) and then see what problems there still are. If you still see problems, a deeper investigation might be in order.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Oct 10, 2017

To be clear: If you have GIgabytes of data in orphaned, clearly something massive has gone wrong, and any visible data loss is unsurprising.

The nastiest data corruption caused by a bug in Prometheus was happening in 1.5.0 and 1.5.1. Those corruption could lay dormant in your storage for a while before the suddenly become apparent (resulting in quarantine, even while running on later releases). But in your case, I suspect some external corruption as even this bug only corrupted a small fraction of series.

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Jan 18, 2018

Closing this as wont-fix as it is superseded by the new 2.0 storage.

@gouthamve gouthamve closed this Jan 18, 2018

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.