Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign up"unexpected gap to last checkpoint" error #5456
Comments
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev I create new issue about "unexpected gap to last checkpoint" error. |
This comment has been minimized.
This comment has been minimized.
|
yeah the "unexpected full record" is caused by a file corruption caused by the host crash and don't think we can do anything about it. In the more recent Prometheus versions we have added in the logs the wal segment number that is corrupted and the usual is to delete all segments after the corrupted one. I have a plan to add a tsdb scan command to the tsdb cli tool that should help mitigate such corruptions, but haven't had the chance to complete that PR. |
krasi-georgiev
added
the
component/local storage
label
Apr 12, 2019
This comment has been minimized.
This comment has been minimized.
|
btw very nice nice step by step description about the issue. |
This comment has been minimized.
This comment has been minimized.
|
The tsdb cli tool is really nice. But when I hit some issue, I need to read Prometheus code, so I did when using Prometheus 1.x. |
This comment has been minimized.
This comment has been minimized.
|
yes I recognise this hence the tsdb cli scan addition. I think we can close the issue since this is expected behaviour after a host crash. |
This comment has been minimized.
This comment has been minimized.
|
I mean not expected, but nothing we can do about it apart from continue the work on the tsdb scan tool and we already have a PR for that so no need to keep this one open as well. |
This comment has been minimized.
This comment has been minimized.
|
I agree. I hope Prometheus community notice about the issue, some situation repair process doesn't work. |
mtanda commentedApr 12, 2019
Bug Report
What did you do?
recording metrics to Prometheus for several months.
The instance is suddenly stopped, and relaunch Prometheus on new instance.
What did you expect to see?
No error, continue to success working without error log.
What did you see instead? Under which circumstances?
Prometheus output "unexpected gap to last checkpoint" error.
Environment
EC2 instance launched on AWS
Timeline
Timeline of log and what I do.
[2019-04-02T23:33:53] EC2 instance suddenly shutdown, automatically create new EC2 instance without EBS
https://gist.github.com/mtanda/30a0aa2462b42e72bb6f63f0a11b136c#file-gistfile1-txt-L26
[2019-04-02T23:34:12] detach EBS from broken instance, and attach/mount EBS, then restart Prometheus
https://gist.github.com/mtanda/30a0aa2462b42e72bb6f63f0a11b136c#file-gistfile1-txt-L64
[2019-04-06T11:00:37] first error message "reload blocks: head truncate failed: create checkpoint: read segments: corruption after 184385759 bytes: unexpected full record"
https://gist.github.com/mtanda/30a0aa2462b42e72bb6f63f0a11b136c#file-gistfile1-txt-L197
[2019-04-06T15:28:38] remove suspected broken wal file
I found the related issue, and I try to fix the issue by removing broken wal file.
#4695
The timezone of instance is JST(+9:00).
I guess
00000076is broken, remove it.After removing wal file, launch Prometheus.
https://gist.github.com/mtanda/30a0aa2462b42e72bb6f63f0a11b136c#file-gistfile1-txt-L218
[2019-04-11T19:00:03] another error message "reload blocks: head truncate failed: create checkpoint: unexpected gap to last checkpoint"
https://gist.github.com/mtanda/30a0aa2462b42e72bb6f63f0a11b136c#file-gistfile1-txt-L452
I found related bug fix prometheus/tsdb#451, try to update v2.7.1 (latest tested version in our env)
https://gist.github.com/mtanda/30a0aa2462b42e72bb6f63f0a11b136c#file-gistfile1-txt-L482
But, error message is not disappeared...
https://gist.github.com/mtanda/30a0aa2462b42e72bb6f63f0a11b136c#file-gistfile1-txt-L512
I read the tsdb code, and I understand the meaning of error message.
There is "checkpoint.000075". Last checkpoint is 75.
Prometheus expect next checkpoint is 76, but I remove "00000076", so it is not exist.
So, try to match last checkpoint to current status, I stop Prometheus, and remove "00000077", rename "checkpoint.000077.tmp" to "checkpoint.000077". (I guess the files in "checkpoint.000077.tmp" is not loaded, so it doesn't break Prometheus)
The error is disappeared.
https://gist.github.com/mtanda/30a0aa2462b42e72bb6f63f0a11b136c#file-gistfile1-txt-L564