Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upblock time ranges overlap #4302
Comments
This comment has been minimized.
This comment has been minimized.
|
This was the contents of the data directory when startup failed
I killed off the |
This comment has been minimized.
This comment has been minimized.
|
I've hit startup failure on this, and on missing metadata json files (which I think may have happened after running out of disk space at one time). |
This comment has been minimized.
This comment has been minimized.
|
Can you share the logs of the first failure?
Continuing to work with a broken database seems unwise to me. |
This comment has been minimized.
This comment has been minimized.
|
This sounds like prometheus/tsdb#347 |
brian-brazil
added
kind/bug
component/local storage
labels
Jun 22, 2018
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil can the .tmp files be cleaned at startup? |
This comment has been minimized.
This comment has been minimized.
|
I'm not sure it's wise to delete information useful to debug issues like this. |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil I meant more generally. |
This comment has been minimized.
This comment has been minimized.
|
It is safe for you to delete these files while Prometheus is shutdown. |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil prometheus/tsdb#347 is different and it will not cause overlapping blocks. I think this one is similar to #4200 I think this one happened at failed compaction - a new 6h block was created as a result of a compaction , but it failed somewhere in the middle leaving the original 2h block behind so at next startup these 2 overlapped. @tcolgate .tmp are created during compaction and when the compaction is complete the old block will be deleted and the .tmp will be renamed by removing the .tmp extension. In theory these are safe to delete as the original block will not be deleted at failure. |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev we've not deleted any TS for a while, though I did have to delete a shard a while back that had lost the metadata file (I think failed to write during a situation with a full disk). |
This comment has been minimized.
This comment has been minimized.
|
it seems you are hitting #4108 which has exactly the same behaviour as what you have described. did you try to lower the storage.tsdb.max-block-duration? I think when the compaction runs it expands all series from all compacted blocks in RAM and than writes a new merged block so lowering the |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev I need to understand if that can be changed on a pre-existing tsdb. However, since the nil dedupe bug was fixed, our compactions have been completing without problems (though that is only a couple of days ago) |
This comment has been minimized.
This comment has been minimized.
|
just double checked and I don't see any problem on reducing the |
This comment has been minimized.
This comment has been minimized.
did I misunderstand that you are still seeing OOM during compaction? |
This comment has been minimized.
This comment has been minimized.
|
The block size settings should not be changed, those flags are only there for load testing. |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil would you also note why not and a suggestion to help @tcolgate's issue? the only downside I see is that reducing the |
This comment has been minimized.
This comment has been minimized.
|
That's just what they're for, and no user should ever have a need to adjust them. Changing them is likely to only cause more problems, and frustrate debugging. |
This comment has been minimized.
This comment has been minimized.
|
I also note that changing that setting would get us no closer to solving the actual issue. Let's debug the problem rather than changing random unrelated settings. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Crashing shouldn't break the tsdb though, we should recover gracefully. |
This comment has been minimized.
This comment has been minimized.
|
I think I discussed this with @fabxc or @gouthamve and decided to leave any recovery decisions out of tsdb and put them in the tsdb scan cmd. |
This comment has been minimized.
This comment has been minimized.
|
I guess Prometheus can use the new |
This comment has been minimized.
This comment has been minimized.
|
A crash at any point shouldn't put us into a situation that requires manual action. Only bugs and data corruption should get us into such a state. |
This comment has been minimized.
This comment has been minimized.
|
@tcolgate Could you give us the logs from the prometheus until the point of it errors with overlapping blocks? |
This comment has been minimized.
This comment has been minimized.
We are not seeing OOMs during compaction under 2.3.1, (this isn't because compaction is taking less ram, but because our OOMs were largely due to other issues, compaction memory usage is still too high, but that is being dealt with elsewhere). |
This comment has been minimized.
This comment has been minimized.
|
The logs are not terribly exciting, except for highlight a few exits and OOMs prior to the crash. (We had an issue with our sidecar that was causing more frequest relstart than it should have, The OOMs in this case were caused by a runaway exported (stackdriver-exporter exposing LB latencies), that decide to add an extra 500k timeseries. |
This comment has been minimized.
This comment has been minimized.
|
These logs show that the overlap was caused exactly by the OOMing
since it didn't complete it created overlapping blocks which was caight at the next start up.
we need to think how would this be best handled, but in the meanwhile you can use the tsdb scan tool to delete the overlapping blocks. |
This comment has been minimized.
This comment has been minimized.
|
I infered the block from the message and just blatted it by hand. |
This comment has been minimized.
This comment has been minimized.
|
The tsdb tool is only as a POC , but at some point we can add it to Prom itself and automate it, maybe add some explicit flag at startup how to handle such cases - auto delete conflicts, prompt before deleting etc. We are still discussing and considering different options and your input would be useful. |
This comment has been minimized.
This comment has been minimized.
|
We should never automatically throw away user data, we should avoid getting into that situation in the first place. |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil you could ignore bad data at start though. By not starting you are defacto ignoring the data I most care about, the data you could be collecting. Clearly there is a fine line to tread. Certainly we shouldn't be getting into these situations in the first place. But refusing to start (thus not collecting data), is just as bad as ignoring data. |
This comment has been minimized.
This comment has been minimized.
|
Should this be fixed via prometheus/tsdb#354? |
This comment has been minimized.
This comment has been minimized.
|
That's the presumption, so closing for now. |
brian-brazil
closed this
Jun 28, 2018
brian-brazil
referenced this issue
Jun 29, 2018
Closed
Prometheus does not start: Opening storage failed invalid block sequence #4324
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 22, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
tcolgate commentedJun 22, 2018
Bug Report
What did you do?
Found prom in a crashloop
What did you expect to see?
Not that
What did you see instead? Under which circumstances?
level=error ts=2018-06-22T07:54:13.895344831Z caller=main.go:597 err="Opening storage failed invalid block sequence: block time ranges overlap: [mint: 1529604000000, maxt: 1529611200000, range: 2h0m0s, blocks: 2]: <ulid: 01CGJ2B565RBWABAM80KJ9N2CW, mint: 1529604000000, maxt: 1529611200000, range: 2h0m0s>, <ulid: 01CGJQDTK2PE8ZCTK5XN2FDMEZ, mint: 1529604000000, maxt: 1529625600000, range: 6h0m0s>\n[mint: 1529611200000, maxt: 1529618400000, range: 2h0m0s, blocks: 2]: <ulid: 01CGJQDTK2PE8ZCTK5XN2FDMEZ, mint: 1529604000000, maxt: 1529625600000, range: 6h0m0s>, <ulid: 01CGJ96WDY85B5RXBNHGTZY9TH, mint: 1529611200000, maxt: 1529618400000, range: 2h0m0s>\n[mint: 1529618400000, maxt: 1529625600000, range: 2h0m0s, blocks: 2]: <ulid: 01CGJQDTK2PE8ZCTK5XN2FDMEZ, mint: 1529604000000, maxt: 1529625600000, range: 6h0m0s>, <ulid: 01CGJG7NSPZKGGVJ53W3SMXRDR, mint: 1529618400000, maxt: 1529625600000, range: 2h0m0s>"In addition server fails to start.
Environment
System information:
insert output of
uname -srmherePrometheus version:
2.3.1