Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upRecover from LevelDB corruptions #1967
Comments
This comment has been minimized.
This comment has been minimized.
|
I guess with leveldb corruptions you're out of luck right now. @beorn7 implemented an extensive crash recovery for our chunk storage, but that's only for the data we write directly to disk. There are some developments around a new index system by @fabxc, but it will take some time until that will be ready. We could write a stress test for that right in the beginning to make sure corruptions get handled gracefully. |
grobie
added
dependency/external
component/local storage
labels
Sep 9, 2016
This comment has been minimized.
This comment has been minimized.
|
Thanks @grobie ! Is there any way to avoid deleting all data for leveldb crash (or other crash)? |
This comment has been minimized.
This comment has been minimized.
|
That's a question for @beorn7. On Fri, Sep 9, 2016 at 1:09 AM zhangxin notifications@github.com wrote:
|
grobie
changed the title
Any parameters to tune to relieve data crash and Prometheus panic?
Recover from LevelDB corruptions
Sep 9, 2016
This comment has been minimized.
This comment has been minimized.
|
@zxwing If you find any problems that are not LevelDB related, I'm highly interested in them. That's the part we really want to keep solid, and better testing is on my list (see #447). When it comes to LevelDB, there are people much more familiar with it working on it upstream. We essentially treat it as a black box, or you could say we don't dare to open the can of worms. Crash recovery or crash resilience is something you could ask the upstream developers about. Besides waiting for improvements from upstream, our only strategy is to get rid of it completely, not so much driven by instability but by our desire for completely different indexing strategies that fit our use case better. #651 is mostly kept by our inability to open the leveldb black-box. Again, moving to a more integrated indexing solution will help us. To come back to your original question: I don't know any parameters to tune for leveldb, but feel free to ask the goleveldb folks. Hot backups and new indexing has medium priority for us. I'm pretty confident it will happen, but not during the next couple of month (unless a contributor shows up who wants to work on it). And as said, if you find any non-leveldb related error reports, please follow up with them here. Thanks! |
This comment has been minimized.
This comment has been minimized.
|
@beorn7 Thank you! I will improve the test system to handle the LevelDB crash issue (by deleting all data) and continue my test. Will definitely report to you if new panic issues found |
zxwing
closed this
Sep 9, 2016
This comment has been minimized.
This comment has been minimized.
rektide
commented
Nov 20, 2016
|
I realize this is something we hope upstream will fix, but this severely impacts Prometheus usability and it's important that this project leave this issue open and visible, as something to track, until this is no longer a colossal problem for Prometheus stability. I've been running Prometheus on a number of laptops, and within 4 months of regular usage or so, the LevelDB metadata holding the indexes gets corrupt and I have to nuke the node & start it over. This is really really sad. I'd thought that I was storing some really interesting battery data, but all three systems have quite consistently nuked themsleves after mere months, thanks to this issue. As for coping strategies- I realize I can setup some federation, which would give some way to prevent having to nuke all the data. Since #651 is open, I'm under the impression though that there is no way to create a backup of Prometheus data, and I don't believe it's possible to setup federation for past data to share old data. All together, that leaves effectively no strategies for coping with this issue. I'd also challenge- if Prometheus has the data still, why can it not recover the indexes? That seems like a major flaw that it can't reprocess the raw data into new indexes if those indexes have to be dropped. Is there sufficient data to recreate, or do indexes need more data than the custom, bulk Prometheus data chunks to be built? If more data is needed, what data is that? |
This comment has been minimized.
This comment has been minimized.
|
I believe that missing data is simply "labels" -> fingerprint map. Each set of labels maps to an integer value (fingerprint) which is then used in multiple places , but also as a metric file name. So if promtetheus created a .txt file with such mappings on a side, it probably would be enough to rebuild indexes |
This comment has been minimized.
This comment has been minimized.
|
To clarify:
Having said all that, there is a corner case where crashrecovery would be able to recover more than it does, and that is if archived_fingerprint_to_timerange or archived_fingerprint_to_metric are so corrupt that they cannot even be opened. (That's the error message that started this issue.) So far, it was so rare that we didn't really bother. It's a relatively easy fix (as we already recover everything correctly if those two LevelDBs are still openable). I have filed #2210 to separate it cleanly from this issue. |
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 24, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
zxwing commentedSep 9, 2016
I did following tests for Prometheus's stability:
Unfortunately, Prometheus panics after 5 ~ 10 round tests because of data corruption, errors are like, for example,
source="main.go:206"time="2016-09-05T19:00:22+08:00" level=error msg="Error opening memory series storage: leveldb/storage: corrupted or incomplete manifest file" source="main.go:143"I have read other similar issues #1496, #651. My ten years experience on open source tell me that this kind of issues is not a high priority; however, I am pursuing a way to relieve this issue, as our system has very strict requirements on stability.
Can you suggest me some parameters to tune? Thank you!