Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upWAL requires more robust corruption handling #4705
Comments
gouthamve
added
kind/bug
priority/P1
component/local storage
labels
Oct 7, 2018
This comment has been minimized.
This comment has been minimized.
zegl
commented
Nov 1, 2018
|
I have seen two new WAL corruption errors (both of these are from the same instance) since upgrading to Prometheus 2.4.3. The only fix for the following two situations has been to delete the WAL and restart Prometheus. Both errors have been happening during (re)boot, and causes Prometheus to crash/stop.
|
This comment has been minimized.
This comment has been minimized.
|
@gouthamve what is your idea to handle these? Or the data wipe of corrupted records can be implemented in the tsdb scan cli tool. I am undecided whether a hard fail or automatic data wipe is preferable so I will let someone more involved in operations give opinions on that. |
This comment has been minimized.
This comment has been minimized.
|
Our general approach to WAL corruption is to ignore everything after the corruption. |
This comment has been minimized.
This comment has been minimized.
|
@gouthamve , @fabxc do you agree with that so I can work on a fix? |
This comment has been minimized.
This comment has been minimized.
|
I just had an idea to skip WAL pages that have corrupted records in them so at least the data loss would be minimal. so display a log to notify for the skipped paged and continue business as usual. |
This comment has been minimized.
This comment has been minimized.
|
That could result in partial data, which can be very messy for the user and
we could miss a series creation too. I don't think it's safe to use
anything after a corruption.
…On Fri 9 Nov 2018, 15:28 Krasi Georgiev ***@***.*** wrote:
I just had an idea to skip WAL pages that have corrupted records in them
so at least the data loss would be minimal.
so display a log to notify for the skipped paged and continue business as
usual.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#4705 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGyTdq597FGeChNp0GFu82HsvH9nZQJ9ks5utZ8NgaJpZM4XLpi8>
.
|
This comment has been minimized.
This comment has been minimized.
nodox
commented
Nov 14, 2018
|
What is the progress on this issue as we are experiencing this problem as well. |
This comment has been minimized.
This comment has been minimized.
|
I have an idea how to fix it and will try to open a PR tomorrow. @nodox are you getting the exact same error log? any pointers on how it was triggered? |
krasi-georgiev
referenced this issue
Nov 15, 2018
Merged
repair wal when the record cannot be decoded #453
This comment has been minimized.
This comment has been minimized.
nodox
commented
Nov 15, 2018
|
@krasi-georgiev I've encountered this in a kubernetes environment FYI. After a bit of analysis it seems the issue is caused because the rentention window is longer than the available storage. After we did some capacity planning based on the formula provided in the documentation we were in fact under the required capacity. After making the adjustment the problem seemed solved. However making the storage increase did cause data loss so we'll only be able to tell if the problem is solved
If these cases are met and the WAL error occurs then its resolved. At the prometheus level though it should be handled gracefully so its doesn't brick our pods creations cycles. I would imagine that you could write code that triggers samples to be removed so the WAL has enough space on disk. What do you think? |
This comment has been minimized.
This comment has been minimized.
|
Our general stance is that capacity planning is the user's responsibility, even detecting that we're in this situation isn't possible in the general case and not all users would want silent data loss to occur if it happens. |
This comment has been minimized.
This comment has been minimized.
|
@nodox on top of what Brian said we are working on a size based retention options in Prometheus which should address your issue. I have already opened a PR that improves the repair handling so lets merge that and will revisit. |
krasi-georgiev
closed this
in
prometheus/tsdb#453
Nov 30, 2018
This comment has been minimized.
This comment has been minimized.
alex88
commented
Jan 7, 2019
|
Even now on 2.6.0 (which reading the changelog seems to have the updated tsdb version) I still get on startup:
Is there a flag or something else to at least make it skip those blocs? |
This comment has been minimized.
This comment has been minimized.
|
I think at this point it is best to delete anything after the corrupted segment. |
This comment has been minimized.
This comment has been minimized.
alex88
commented
Jan 7, 2019
|
Yeah I ended up doing that and it's now back up, thanks! |
gouthamve commentedOct 7, 2018
From #4603 (comment)
We should be able to handle errors of the form: