Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upIsolate corrupted series files instead of panicking. #877
Comments
beorn7
self-assigned this
Jul 6, 2015
This comment has been minimized.
This comment has been minimized.
|
This really looks like severe data corruption. One might argue if panicking is the right response. If corruption of this kind happens, there is probably nothing to rescue... On the other hand, the affected series could be used to the orphaned directory for forensics... I'll leave this open for the record, with a changed title, but not high priority to deal with it right now. |
beorn7
changed the title
panics due to storage errors
Isolate corrupted series files instead of panicking.
Jul 21, 2015
This comment has been minimized.
This comment has been minimized.
mdirkse
commented
Sep 3, 2015
|
I'd second the notion that it'd be useful to know exactly where the data is corrupted and if there is anything to rescue. Also, if the server randomly fails and data gets corrupted in such a way that prometheus dies soon after it's started (as is the case at the moment for our server) then that's a problem. I'd prefer it to deduce that it has corrupt data, and then start collecting new data anyways. At least then you have some sort of monitoring. I currently have a DB that is corrupted to the point that prometheus shuts down after a couple seconds of operation with a chunk panic. Right now I have no choice but to delete all the data and start over. Not sure what caused the corruption, probably random restarts of the process or machine. The top of the stacktrace looks like this:
I did an strace on the failing process which yielded this:
|
This comment has been minimized.
This comment has been minimized.
mdirkse
commented
Sep 3, 2015
|
Ah, turns out that my corruption was probably due to the disk being full at a certain point in time. |
brian-brazil
added
the
bug
label
Dec 16, 2015
This comment has been minimized.
This comment has been minimized.
|
Having the same problem. Tried stracing and deleting the files that Prometheus was accessing just before crashing, several times, but it's always crashing. Is there a tool one can use to determine if a database file is valid or not? I'd rather not have to delete our entire history. |
This comment has been minimized.
This comment has been minimized.
|
Currently no tool. It will be a pretty easy fix to not crash in the case above, but my gut feeling is that those kind of corruptions will come in bulk in most cases (e.g. because you ran out of disk space), so you will have a lot of lost series anyway, and the value of your history might be questionable... I'll definitely implement some kind of series isolation mechanism eventually, but not very soon. PRs welcome. ;) |
This comment has been minimized.
This comment has been minimized.
|
My problem was the same as the OP — ran out of disk space. I don't know anything about how Prometheus organizes its files, but presumably old history would still be valid? |
This comment has been minimized.
This comment has been minimized.
|
Considering there's apparently no way to do consistent snapshots for backup purposes, it sounds like this problem might conceivably also arise if you try to restore a backup. I'm a bit worried now. |
This comment has been minimized.
This comment has been minimized.
|
Cold backups are no problem. But even "hot" snapshots have always worked for me. The corruption we are talking about here can only happen because of...
|
This comment has been minimized.
This comment has been minimized.
|
With "hot" snapshots, I get constant errors from |
This comment has been minimized.
This comment has been minimized.
|
Files that are incompletely written are fixed during crash recovery. The corruptions causing the crashes are different in nature, wrong rather than incomplete data, or wrong alignment. |
beorn7
referenced this issue
Mar 2, 2016
Merged
Handle errors caused by data corruption more gracefully #1448
This comment has been minimized.
This comment has been minimized.
|
Fixed by #1448 |
This comment has been minimized.
This comment has been minimized.
|
This bug seems to be un-closeable.... Meta bug in Github... :) |
prometheus
locked and limited conversation to collaborators
Mar 18, 2016
prometheus
unlocked this conversation
Mar 21, 2016
This comment has been minimized.
This comment has been minimized.
|
I filed a support ticket about the un-closeability of this issue. |
beorn7
closed this
Mar 23, 2016
This comment has been minimized.
This comment has been minimized.
|
GH support made this closeable. \o/ |
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 24, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
TheTincho commentedJul 6, 2015
While looking though data with the prometheus console, I get many errors in the log like this:
And some times this one:
I suspect this might be some DB corruption, as the process has been killed many times (usually through SIGTERM though). I have also seen the whole server come down because of other errors that might be related. For example: