New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover from LevelDB corruptions #1967

Closed
zxwing opened this Issue Sep 9, 2016 · 8 comments

Comments

Projects
None yet
5 participants
@zxwing

zxwing commented Sep 9, 2016

I did following tests for Prometheus's stability:

  1. Run a Prometheus for a while to collect some data
  2. Do a cold shutdown to the machine(to simulate hardware failure)
  3. Power on the machine and check if the Prometheus continues to work
  4. Repeat #1 ~ #3

Unfortunately, Prometheus panics after 5 ~ 10 round tests because of data corruption, errors are like, for example, source="main.go:206"time="2016-09-05T19:00:22+08:00" level=error msg="Error opening memory series storage: leveldb/storage: corrupted or incomplete manifest file" source="main.go:143"

I have read other similar issues #1496, #651. My ten years experience on open source tell me that this kind of issues is not a high priority; however, I am pursuing a way to relieve this issue, as our system has very strict requirements on stability.

Can you suggest me some parameters to tune? Thank you!

@grobie

This comment has been minimized.

Show comment
Hide comment
@grobie

grobie Sep 9, 2016

Member

I guess with leveldb corruptions you're out of luck right now. @beorn7 implemented an extensive crash recovery for our chunk storage, but that's only for the data we write directly to disk.

There are some developments around a new index system by @fabxc, but it will take some time until that will be ready. We could write a stress test for that right in the beginning to make sure corruptions get handled gracefully.

Member

grobie commented Sep 9, 2016

I guess with leveldb corruptions you're out of luck right now. @beorn7 implemented an extensive crash recovery for our chunk storage, but that's only for the data we write directly to disk.

There are some developments around a new index system by @fabxc, but it will take some time until that will be ready. We could write a stress test for that right in the beginning to make sure corruptions get handled gracefully.

@zxwing

This comment has been minimized.

Show comment
Hide comment
@zxwing

zxwing Sep 9, 2016

Thanks @grobie ! Is there any way to avoid deleting all data for leveldb crash (or other crash)?

zxwing commented Sep 9, 2016

Thanks @grobie ! Is there any way to avoid deleting all data for leveldb crash (or other crash)?

@grobie

This comment has been minimized.

Show comment
Hide comment
@grobie

grobie Sep 9, 2016

Member

That's a question for @beorn7.

On Fri, Sep 9, 2016 at 1:09 AM zhangxin notifications@github.com wrote:

Thanks @grobie https://github.com/grobie ! Is there any way to avoid
deleting all data for leveldb crash (or other crash)?


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#1967 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAANaLQ4J0iZScmezwFMakgK96Pgtpjiks5qoOoYgaJpZM4J4rIT
.

Member

grobie commented Sep 9, 2016

That's a question for @beorn7.

On Fri, Sep 9, 2016 at 1:09 AM zhangxin notifications@github.com wrote:

Thanks @grobie https://github.com/grobie ! Is there any way to avoid
deleting all data for leveldb crash (or other crash)?


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#1967 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAANaLQ4J0iZScmezwFMakgK96Pgtpjiks5qoOoYgaJpZM4J4rIT
.

@grobie grobie changed the title from Any parameters to tune to relieve data crash and Prometheus panic? to Recover from LevelDB corruptions Sep 9, 2016

@beorn7

This comment has been minimized.

Show comment
Hide comment
@beorn7

beorn7 Sep 9, 2016

Member

@zxwing If you find any problems that are not LevelDB related, I'm highly interested in them. That's the part we really want to keep solid, and better testing is on my list (see #447).

When it comes to LevelDB, there are people much more familiar with it working on it upstream. We essentially treat it as a black box, or you could say we don't dare to open the can of worms. Crash recovery or crash resilience is something you could ask the upstream developers about.

Besides waiting for improvements from upstream, our only strategy is to get rid of it completely, not so much driven by instability but by our desire for completely different indexing strategies that fit our use case better.

#651 is mostly kept by our inability to open the leveldb black-box. Again, moving to a more integrated indexing solution will help us.

To come back to your original question: I don't know any parameters to tune for leveldb, but feel free to ask the goleveldb folks. Hot backups and new indexing has medium priority for us. I'm pretty confident it will happen, but not during the next couple of month (unless a contributor shows up who wants to work on it).

And as said, if you find any non-leveldb related error reports, please follow up with them here. Thanks!

Member

beorn7 commented Sep 9, 2016

@zxwing If you find any problems that are not LevelDB related, I'm highly interested in them. That's the part we really want to keep solid, and better testing is on my list (see #447).

When it comes to LevelDB, there are people much more familiar with it working on it upstream. We essentially treat it as a black box, or you could say we don't dare to open the can of worms. Crash recovery or crash resilience is something you could ask the upstream developers about.

Besides waiting for improvements from upstream, our only strategy is to get rid of it completely, not so much driven by instability but by our desire for completely different indexing strategies that fit our use case better.

#651 is mostly kept by our inability to open the leveldb black-box. Again, moving to a more integrated indexing solution will help us.

To come back to your original question: I don't know any parameters to tune for leveldb, but feel free to ask the goleveldb folks. Hot backups and new indexing has medium priority for us. I'm pretty confident it will happen, but not during the next couple of month (unless a contributor shows up who wants to work on it).

And as said, if you find any non-leveldb related error reports, please follow up with them here. Thanks!

@zxwing

This comment has been minimized.

Show comment
Hide comment
@zxwing

zxwing Sep 9, 2016

@beorn7 Thank you! I will improve the test system to handle the LevelDB crash issue (by deleting all data) and continue my test. Will definitely report to you if new panic issues found

zxwing commented Sep 9, 2016

@beorn7 Thank you! I will improve the test system to handle the LevelDB crash issue (by deleting all data) and continue my test. Will definitely report to you if new panic issues found

@zxwing zxwing closed this Sep 9, 2016

@rektide

This comment has been minimized.

Show comment
Hide comment
@rektide

rektide Nov 20, 2016

I realize this is something we hope upstream will fix, but this severely impacts Prometheus usability and it's important that this project leave this issue open and visible, as something to track, until this is no longer a colossal problem for Prometheus stability. I've been running Prometheus on a number of laptops, and within 4 months of regular usage or so, the LevelDB metadata holding the indexes gets corrupt and I have to nuke the node & start it over. This is really really sad. I'd thought that I was storing some really interesting battery data, but all three systems have quite consistently nuked themsleves after mere months, thanks to this issue.

As for coping strategies- I realize I can setup some federation, which would give some way to prevent having to nuke all the data. Since #651 is open, I'm under the impression though that there is no way to create a backup of Prometheus data, and I don't believe it's possible to setup federation for past data to share old data. All together, that leaves effectively no strategies for coping with this issue.

I'd also challenge- if Prometheus has the data still, why can it not recover the indexes? That seems like a major flaw that it can't reprocess the raw data into new indexes if those indexes have to be dropped. Is there sufficient data to recreate, or do indexes need more data than the custom, bulk Prometheus data chunks to be built? If more data is needed, what data is that?

rektide commented Nov 20, 2016

I realize this is something we hope upstream will fix, but this severely impacts Prometheus usability and it's important that this project leave this issue open and visible, as something to track, until this is no longer a colossal problem for Prometheus stability. I've been running Prometheus on a number of laptops, and within 4 months of regular usage or so, the LevelDB metadata holding the indexes gets corrupt and I have to nuke the node & start it over. This is really really sad. I'd thought that I was storing some really interesting battery data, but all three systems have quite consistently nuked themsleves after mere months, thanks to this issue.

As for coping strategies- I realize I can setup some federation, which would give some way to prevent having to nuke all the data. Since #651 is open, I'm under the impression though that there is no way to create a backup of Prometheus data, and I don't believe it's possible to setup federation for past data to share old data. All together, that leaves effectively no strategies for coping with this issue.

I'd also challenge- if Prometheus has the data still, why can it not recover the indexes? That seems like a major flaw that it can't reprocess the raw data into new indexes if those indexes have to be dropped. Is there sufficient data to recreate, or do indexes need more data than the custom, bulk Prometheus data chunks to be built? If more data is needed, what data is that?

@redbaron

This comment has been minimized.

Show comment
Hide comment
@redbaron

redbaron Nov 20, 2016

Contributor

I believe that missing data is simply "labels" -> fingerprint map. Each set of labels maps to an integer value (fingerprint) which is then used in multiple places , but also as a metric file name.

So if promtetheus created a .txt file with such mappings on a side, it probably would be enough to rebuild indexes

Contributor

redbaron commented Nov 20, 2016

I believe that missing data is simply "labels" -> fingerprint map. Each set of labels maps to an integer value (fingerprint) which is then used in multiple places , but also as a metric file name.

So if promtetheus created a .txt file with such mappings on a side, it probably would be enough to rebuild indexes

@beorn7

This comment has been minimized.

Show comment
Hide comment
@beorn7

beorn7 Nov 21, 2016

Member

To clarify:

  • Prometheus crash recovery does recover all the indices it can (with one exception in a corner case, see below).
  • There is one leveldb that simply is the only place where the information is stored, and that is archived_fingerprint_to_metric. If that index is completely unreadable, all archived time series are lost.
  • We have no evidence that LevelDB has systematic bugs that lead to corruption eventually. On the contrary, LevelDB did a pretty good job with recovering from crashes so far. After three years of using it at SoundCloud on hundreds of servers, we only had a handful of unsuccessful recoveries where we had to blame LevelDB, and the assumption so far is that actual data corruption on disk is to blame, not a bug in the code. If @rektide sees regular data corruptions, I would strongly assume something else is wrong, perhaps in a subtle way, that leads to the impression LevelDB is to blame.
  • When we refer to upstream and treating LevelDB as a black-box, we don't want to imply there are bugs in LevelDB. This is more about Prometheus devs not being experts in LevelDB internals. LevelDB has a well-documented on-disk format, so there might even be tools out there to recover data from corrupted files and bring them back into readable state. That would not be true for a home-grown Prometheus solution. We would understand it better but we would not have a whole community around our on-disk format. In other words: In terms of stability and recoverability, using LevelDB is a net plus. When we talk about coming up with our own indexing solution, it is more about the ability to create consistent hot backups (see next item) and better suited indexing algorithms.
  • You can take a cold backup from Prometheus without problem. #651 is about hot backups.
  • We strongly recommend to run two identically configured Prometheus servers for critical monitoring. You can then tolerate corruption of one. (Which should still be a rare event. As said above, in @rektide 's case, something else must be wrong.)
  • A crash recovery is never perfect. Prometheus tries to limit corruption and loss when the binary crashes. But "power off" type crashes can even lead to arbitrary data loss. (Note that even sync'ing files is no guarantee that the data is persisted to disk over this kind of crash.) In a non-distributed system, there is no way to be "safe". Even if the crash recovery succeeds, you might see missing data (and you cannot find out what exactly is missing). To protect against crashes, you need a strategy that doesn't depend on crash recovery, i.e. take cold backups or run Prometheus server in duplicate or triplicate.

Having said all that, there is a corner case where crashrecovery would be able to recover more than it does, and that is if archived_fingerprint_to_timerange or archived_fingerprint_to_metric are so corrupt that they cannot even be opened. (That's the error message that started this issue.) So far, it was so rare that we didn't really bother. It's a relatively easy fix (as we already recover everything correctly if those two LevelDBs are still openable). I have filed #2210 to separate it cleanly from this issue.

Member

beorn7 commented Nov 21, 2016

To clarify:

  • Prometheus crash recovery does recover all the indices it can (with one exception in a corner case, see below).
  • There is one leveldb that simply is the only place where the information is stored, and that is archived_fingerprint_to_metric. If that index is completely unreadable, all archived time series are lost.
  • We have no evidence that LevelDB has systematic bugs that lead to corruption eventually. On the contrary, LevelDB did a pretty good job with recovering from crashes so far. After three years of using it at SoundCloud on hundreds of servers, we only had a handful of unsuccessful recoveries where we had to blame LevelDB, and the assumption so far is that actual data corruption on disk is to blame, not a bug in the code. If @rektide sees regular data corruptions, I would strongly assume something else is wrong, perhaps in a subtle way, that leads to the impression LevelDB is to blame.
  • When we refer to upstream and treating LevelDB as a black-box, we don't want to imply there are bugs in LevelDB. This is more about Prometheus devs not being experts in LevelDB internals. LevelDB has a well-documented on-disk format, so there might even be tools out there to recover data from corrupted files and bring them back into readable state. That would not be true for a home-grown Prometheus solution. We would understand it better but we would not have a whole community around our on-disk format. In other words: In terms of stability and recoverability, using LevelDB is a net plus. When we talk about coming up with our own indexing solution, it is more about the ability to create consistent hot backups (see next item) and better suited indexing algorithms.
  • You can take a cold backup from Prometheus without problem. #651 is about hot backups.
  • We strongly recommend to run two identically configured Prometheus servers for critical monitoring. You can then tolerate corruption of one. (Which should still be a rare event. As said above, in @rektide 's case, something else must be wrong.)
  • A crash recovery is never perfect. Prometheus tries to limit corruption and loss when the binary crashes. But "power off" type crashes can even lead to arbitrary data loss. (Note that even sync'ing files is no guarantee that the data is persisted to disk over this kind of crash.) In a non-distributed system, there is no way to be "safe". Even if the crash recovery succeeds, you might see missing data (and you cannot find out what exactly is missing). To protect against crashes, you need a strategy that doesn't depend on crash recovery, i.e. take cold backups or run Prometheus server in duplicate or triplicate.

Having said all that, there is a corner case where crashrecovery would be able to recover more than it does, and that is if archived_fingerprint_to_timerange or archived_fingerprint_to_metric are so corrupt that they cannot even be opened. (That's the error message that started this issue.) So far, it was so rare that we didn't really bother. It's a relatively easy fix (as we already recover everything correctly if those two LevelDBs are still openable). I have filed #2210 to separate it cleanly from this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment