Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upOpening storage failed invalid block sequence: block time ranges overlap after updating from 2.3.1 to 2.3.2 (potential corruption?) #4388
Comments
This comment has been minimized.
This comment has been minimized.
|
Can you clarify if the blocks it's complaining about were generated by 2.3.1 or 2.3.2? |
This comment has been minimized.
This comment has been minimized.
|
The blocks it is complaining about should have been generated by 2.3.1. I only had 2.3.2 attempt to start up once, which failed in the described manner. Looking at the timestamps (ls output is UTC, logs are in GMT+2, sorry for that): The files are older than the log, so I'm fairly sure here. I've now deleted the 2h files and only kept the 6h file. This allowed Prometheus to start up. |
This comment has been minimized.
This comment has been minimized.
|
I believe this was the issue fixed by 2.3.2, so this is as expected. 2.3.2 can no longer produce such blocks. |
This comment has been minimized.
This comment has been minimized.
|
Fair enough :) Would it make sense to consider this is worthy of a "release note", something like: "If you see Prometheus failing to start up with [...], then you should manually clean out the blocks such that they don't overlap. Take into account the |
This comment has been minimized.
This comment has been minimized.
|
Fabian added some code in 2.3.2 that would handle crash loops, but if the crash loop happened in 2.3.1 than the blocks wouldn't have the metadata needed to recover from this state. For such cases the only option would be to use the tsdb scan tool |
This comment has been minimized.
This comment has been minimized.
ntindall
commented
Aug 9, 2018
|
@krasi-georgiev can you point me to the source code / documentation for the tsdb scan tool? I don't think it is this, is it? https://github.com/prometheus/tsdb/blob/master/cmd/tsdb/main.go |
This comment has been minimized.
This comment has been minimized.
|
it is still WIP |
ankon commentedJul 16, 2018
Bug Report
What did you do?
Looking into an issue with Prometheus failing due to a disk-full situation. I resized the partition (with Prometheus not running as far as I can tell), and then also noticed Prometheus 2.3.2 with TSDB fixes being available, so I added that in for good measure.
What did you expect to see?
Prometheus starting up again, potentially with reports of lost data due to the disk full problems.
What did you see instead? Under which circumstances?
Prometheus fails to start, reporting:
This looks like it was introduced with prometheus/tsdb#347, which was pulled into Prometheus 2.3.2.
If I were to guess: things crashed mid-merging?
More data about these chunks:
Environment
Prometheus 2.3.2 running on 2.3.1 data in Kubernetes 1.5 on AWS (data directory is on EBS).
Disk was full, so data loss is expected.