Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"unexpected gap to last checkpoint" error #5456

Closed
mtanda opened this Issue Apr 12, 2019 · 7 comments

Comments

Projects
None yet
2 participants
@mtanda
Copy link
Contributor

mtanda commented Apr 12, 2019

Bug Report

What did you do?
recording metrics to Prometheus for several months.
The instance is suddenly stopped, and relaunch Prometheus on new instance.

What did you expect to see?
No error, continue to success working without error log.

What did you see instead? Under which circumstances?
Prometheus output "unexpected gap to last checkpoint" error.

Environment
EC2 instance launched on AWS

  • System information:
$ uname -srm
Linux 4.4.0-1072-aws x86_64
  • Prometheus version:
$ /usr/local/prometheus/bin/prometheus --version
prometheus, version 2.5.0 (branch: HEAD, revision: 67dc912ac8b24f94a1fc478f352d25179c94ab9b)
  build user:       root@578ab108d0b9
  build date:       20181106-11:40:44
  go version:       go1.11.1
  • Prometheus configuration file:
$ head -3 /usr/local/prometheus/etc/prometheus_archive.yml
global:
  scrape_interval: 180s
  evaluation_interval: 180s
/usr/local/prometheus/bin/prometheus --config.file=/usr/local/prometheus/etc/prometheus_archive.yml --web.listen-address=127.0.0.1:19090 --web.external-url=http://127.0.0.1:19090/prometheus_archive --storage.tsdb.path=/home/prometheus_archive --storage.tsdb.retention=400d --web.enable-lifecycle --web.enable-admin-api --query.lookback-delta=15m

Timeline

Timeline of log and what I do.

[2019-04-02T23:33:53] EC2 instance suddenly shutdown, automatically create new EC2 instance without EBS

https://gist.github.com/mtanda/30a0aa2462b42e72bb6f63f0a11b136c#file-gistfile1-txt-L26

[2019-04-02T23:34:12] detach EBS from broken instance, and attach/mount EBS, then restart Prometheus

https://gist.github.com/mtanda/30a0aa2462b42e72bb6f63f0a11b136c#file-gistfile1-txt-L64

[2019-04-06T11:00:37] first error message "reload blocks: head truncate failed: create checkpoint: read segments: corruption after 184385759 bytes: unexpected full record"

https://gist.github.com/mtanda/30a0aa2462b42e72bb6f63f0a11b136c#file-gistfile1-txt-L197

[2019-04-06T15:28:38] remove suspected broken wal file

I found the related issue, and I try to fix the issue by removing broken wal file.
#4695

The timezone of instance is JST(+9:00).

$ /usr/local/prometheus/bin/prometheus --version
prometheus, version 2.5.0 (branch: HEAD, revision: 67dc912ac8b24f94a1fc478f352d25179c94ab9b)
  build user:       root@578ab108d0b9
  build date:       20181106-11:40:44
  go version:       go1.11.1
$ ls -l /home/prometheus_archive/wal/
total 178856
-rw-r--r-- 1 prometheus prometheus 134217728 Apr  2 21:24 00000076
-rw-r--r-- 1 prometheus prometheus  48917874 Apr  7 00:34 00000077
drwxr-xr-x 2 prometheus prometheus      4096 Apr  4 06:00 checkpoint.000075
drwxr-xr-x 2 prometheus prometheus      4096 Apr  6 20:00 checkpoint.000077.tmp
$ find /home/prometheus_archive/wal/
/home/prometheus_archive/wal/
/home/prometheus_archive/wal/00000076
/home/prometheus_archive/wal/checkpoint.000077.tmp
/home/prometheus_archive/wal/checkpoint.000077.tmp/00000000
/home/prometheus_archive/wal/checkpoint.000075
/home/prometheus_archive/wal/checkpoint.000075/00000000
/home/prometheus_archive/wal/00000077

I guess 00000076 is broken, remove it.

$ ls -l /home/prometheus_archive/wal/
total 48868
-rw-r--r-- 1 prometheus prometheus 50032498 Apr  7 00:48 00000077
drwxr-xr-x 2 prometheus prometheus     4096 Apr  4 06:00 checkpoint.000075
drwxr-xr-x 2 prometheus prometheus     4096 Apr  6 20:00 checkpoint.000077.tmp

After removing wal file, launch Prometheus.
https://gist.github.com/mtanda/30a0aa2462b42e72bb6f63f0a11b136c#file-gistfile1-txt-L218

[2019-04-11T19:00:03] another error message "reload blocks: head truncate failed: create checkpoint: unexpected gap to last checkpoint"

https://gist.github.com/mtanda/30a0aa2462b42e72bb6f63f0a11b136c#file-gistfile1-txt-L452

I found related bug fix prometheus/tsdb#451, try to update v2.7.1 (latest tested version in our env)

$ /usr/local/prometheus/bin/prometheus --version
prometheus, version 2.7.1 (branch: HEAD, revision: 62e591f928ddf6b3468308b7ac1de1c63aa7fcf3)
  build user:       root@f9f82868fc43
  build date:       20190131-11:16:59
  go version:       go1.11.5

https://gist.github.com/mtanda/30a0aa2462b42e72bb6f63f0a11b136c#file-gistfile1-txt-L482

But, error message is not disappeared...
https://gist.github.com/mtanda/30a0aa2462b42e72bb6f63f0a11b136c#file-gistfile1-txt-L512

$ ls -l /home/data/prometheus_archive/wal/
total 581796
-rw-r--r-- 1 prometheus prometheus 134217728 Apr  7 21:25 00000077
-rw-r--r-- 1 prometheus prometheus 134217728 Apr  9 07:55 00000078
-rw-r--r-- 1 prometheus prometheus 134217728 Apr 10 18:06 00000079
-rw-r--r-- 1 prometheus prometheus 134217728 Apr 12 03:14 00000080
-rw-r--r-- 1 prometheus prometheus  58857948 Apr 12 18:52 00000081
drwxr-xr-x 2 prometheus prometheus      4096 Apr  4 06:00 checkpoint.000075
drwxr-xr-x 2 prometheus prometheus      4096 Apr  6 20:00 checkpoint.000077.tmp

I read the tsdb code, and I understand the meaning of error message.
There is "checkpoint.000075". Last checkpoint is 75.
Prometheus expect next checkpoint is 76, but I remove "00000076", so it is not exist.
So, try to match last checkpoint to current status, I stop Prometheus, and remove "00000077", rename "checkpoint.000077.tmp" to "checkpoint.000077". (I guess the files in "checkpoint.000077.tmp" is not loaded, so it doesn't break Prometheus)

$ ls -l /home/data/prometheus_archive/wal
total 453780
-rw-r--r-- 1 prometheus prometheus 134217728 Apr  9 07:55 00000078
-rw-r--r-- 1 prometheus prometheus 134217728 Apr 10 18:06 00000079
-rw-r--r-- 1 prometheus prometheus 134217728 Apr 12 03:14 00000080
-rw-r--r-- 1 prometheus prometheus  61997056 Apr 12 19:32 00000081
drwxr-xr-x 2 prometheus prometheus      4096 Apr  6 20:00 checkpoint.000077

The error is disappeared.
https://gist.github.com/mtanda/30a0aa2462b42e72bb6f63f0a11b136c#file-gistfile1-txt-L564

@mtanda

This comment has been minimized.

Copy link
Contributor Author

mtanda commented Apr 12, 2019

@krasi-georgiev I create new issue about "unexpected gap to last checkpoint" error.
I remove wal file manually, so I'm not sure this is bug, or my fault.
But, I don't have no other way to recover when I hit the "unexpected full record" error.
If you have a trouble shooting guide for recovering broken state, please tell me.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Apr 12, 2019

yeah the "unexpected full record" is caused by a file corruption caused by the host crash and don't think we can do anything about it. In the more recent Prometheus versions we have added in the logs the wal segment number that is corrupted and the usual is to delete all segments after the corrupted one.

I have a plan to add a tsdb scan command to the tsdb cli tool that should help mitigate such corruptions, but haven't had the chance to complete that PR.
prometheus/tsdb#320

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Apr 12, 2019

btw very nice nice step by step description about the issue.

@mtanda

This comment has been minimized.

Copy link
Contributor Author

mtanda commented Apr 12, 2019

The tsdb cli tool is really nice.
As Prometheus user, I never hit trouble like this issue after upgrading Prometheus from 1.x to 2.x.
Prometheus 2.x is really stable, so thanks for all developer and well managed release process :-)

But when I hit some issue, I need to read Prometheus code, so I did when using Prometheus 1.x.
Understanding error log is not easy for everyone, providing more easier way or tool is really important, I think.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Apr 12, 2019

yes I recognise this hence the tsdb cli scan addition.
Eventually should get back to it.

I think we can close the issue since this is expected behaviour after a host crash.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Apr 12, 2019

I mean not expected, but nothing we can do about it apart from continue the work on the tsdb scan tool and we already have a PR for that so no need to keep this one open as well.

@mtanda

This comment has been minimized.

Copy link
Contributor Author

mtanda commented Apr 12, 2019

I agree. I hope Prometheus community notice about the issue, some situation repair process doesn't work.
Or this issue will help someone who hit same trouble.

@mtanda mtanda closed this Apr 12, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.