Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WAL requires more robust corruption handling #4705

Closed
gouthamve opened this Issue Oct 7, 2018 · 14 comments

Comments

Projects
None yet
6 participants
@gouthamve
Copy link
Member

gouthamve commented Oct 7, 2018

From #4603 (comment)

We should be able to handle errors of the form:

level=error ts=2018-10-07T02:39:01.980632051Z caller=main.go:617 err="opening storage failed: read WAL: repair corrupted WAL: cannot handle error: invalid record type 255"
@zegl

This comment has been minimized.

Copy link

zegl commented Nov 1, 2018

I have seen two new WAL corruption errors (both of these are from the same instance) since upgrading to Prometheus 2.4.3.

The only fix for the following two situations has been to delete the WAL and restart Prometheus.

Both errors have been happening during (re)boot, and causes Prometheus to crash/stop.

level=error ts=2018-10-31T08:54:51.552889842Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: head truncate failed: create checkpoint: read segments: corruption after 612039672 bytes: unexpected checksum a2ddd98, expected fb979772"
level=error ts=2018-11-01T08:22:16.121797899Z caller=main.go:617 err="opening storage failed: read WAL: backfill checkpoint: read records: corruption in segment 19 at 47213509: unexpected checksum bbac5c0d, expected 5e23349f"
@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 9, 2018

@gouthamve what is your idea to handle these?
I think even when a single record is corrupted we don't know its length so wouldn't know how to skip and continue reading any non corrupted records.
The only way to NOT shutdown would be to ignore or delete any data after this point.

Or the data wipe of corrupted records can be implemented in the tsdb scan cli tool.

I am undecided whether a hard fail or automatic data wipe is preferable so I will let someone more involved in operations give opinions on that.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Nov 9, 2018

Our general approach to WAL corruption is to ignore everything after the corruption.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 9, 2018

@gouthamve , @fabxc do you agree with that so I can work on a fix?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 9, 2018

I just had an idea to skip WAL pages that have corrupted records in them so at least the data loss would be minimal.

so display a log to notify for the skipped paged and continue business as usual.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Nov 9, 2018

@nodox

This comment has been minimized.

Copy link

nodox commented Nov 14, 2018

What is the progress on this issue as we are experiencing this problem as well.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 14, 2018

I have an idea how to fix it and will try to open a PR tomorrow.

@nodox are you getting the exact same error log?

any pointers on how it was triggered?

@nodox

This comment has been minimized.

Copy link

nodox commented Nov 15, 2018

@krasi-georgiev I've encountered this in a kubernetes environment FYI. After a bit of analysis it seems the issue is caused because the rentention window is longer than the available storage. After we did some capacity planning based on the formula provided in the documentation we were in fact under the required capacity. After making the adjustment the problem seemed solved. However making the storage increase did cause data loss so we'll only be able to tell if the problem is solved

  • when the rentention windows is approached
  • when the container is near its capacity if ever

If these cases are met and the WAL error occurs then its resolved. At the prometheus level though it should be handled gracefully so its doesn't brick our pods creations cycles. I would imagine that you could write code that triggers samples to be removed so the WAL has enough space on disk.

What do you think?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Nov 15, 2018

Our general stance is that capacity planning is the user's responsibility, even detecting that we're in this situation isn't possible in the general case and not all users would want silent data loss to occur if it happens.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 15, 2018

@nodox on top of what Brian said we are working on a size based retention options in Prometheus which should address your issue.

I have already opened a PR that improves the repair handling so lets merge that and will revisit.

@alex88

This comment has been minimized.

Copy link

alex88 commented Jan 7, 2019

Even now on 2.6.0 (which reading the changelog seems to have the updated tsdb version) I still get on startup:

caller=main.go:624 err="opening storage failed: read WAL: backfill checkpoint: read records: corruption in segment /data/wal/checkpoint.001467/00000000 at 40173568: unexpected first record"

Is there a flag or something else to at least make it skip those blocs?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jan 7, 2019

I think at this point it is best to delete anything after the corrupted segment.
the entire /data/wal/checkpoint.001467 folder in your case.

@alex88

This comment has been minimized.

Copy link

alex88 commented Jan 7, 2019

Yeah I ended up doing that and it's now back up, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.