Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign uptsdb Panic when reading chunk #4128
Comments
This comment has been minimized.
This comment has been minimized.
|
Apologies if this is in the wrong repo. Any advice on recovering from this error without removing all tsdb files would be appreciated. |
This comment has been minimized.
This comment has been minimized.
|
had quick look and it seems some corruption in the chunk so it fails to read the encoding. Is this easy to reproduce? |
This comment has been minimized.
This comment has been minimized.
No it happened in a 1Tb tsdb (which I eventually deleted), and it really could be a result of any failure (disk or whatever). So I don't think it's a problem things get corrupt, I think the problem is that tsdb can't work around it (or I can't tell it to work around it), I'd much prefer to lose a bit of data then have to start fresh with everything. I'm cool with closing this if this is too broad to be a bug. |
This comment has been minimized.
This comment has been minimized.
|
In most cases I have seen only one block is causing an issue so if you delete the entire directory you loose that date range,but Prometheus should still read the remaining. @gouthamve suggested that we should probably add this as a cleanup command to the tsdb tool so if I find time I will try to implement this. closing for now, but feel free to reopen if you can provide more info how to reproduce. |
krasi-georgiev
closed this
May 1, 2018
This comment has been minimized.
This comment has been minimized.
sr
commented
May 4, 2018
|
@krasi-georgiev In the future, how can we collect more useful debugging data to help diagnose corruption issues? Unfortunately our tsdb contains private information and we are not able to share it, unless there's some way we can anonymize labels. |
This comment has been minimized.
This comment has been minimized.
|
that is a good idea , I will think what and how we can expose something useful . btw I already opened the PR for the scan and delete corrupted data and just waiting for some input from the other maintainers before I start. |
This comment has been minimized.
This comment has been minimized.
|
One thing that surprised me is how the checksums failed to catch this. I think something more than a simple corruption is going on here. And if it is indeed a corruption, I would like the querier to catch it, error rather than panic. Re-opening to track, will be closed when fixed. |
gouthamve
reopened this
May 7, 2018
This comment has been minimized.
This comment has been minimized.
bonan
commented
Jun 7, 2018
|
I'm seeing the same issue in both 2.2.1 and 2.3.0:
After updating to 2.3.0 (and running with debug log):
|
This comment has been minimized.
This comment has been minimized.
|
@bonan thanks for the report. |
simonpasquier
referenced this issue
Jun 8, 2018
Merged
chunks: fix potential "index out of range" error #344
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev see prometheus/tsdb#344. As I said in the PR, I'm not sure how the issue can be triggered except if the chunk is corrupted? |
This comment has been minimized.
This comment has been minimized.
bonan
commented
Jun 8, 2018
•
|
Reboot and forced fsck seems to have solved the problem for me, so I can't reproduce it any more |
brian-brazil
added
kind/bug
component/local storage
labels
Jun 13, 2018
This comment has been minimized.
This comment has been minimized.
|
@simonpasquier is this fixed with prometheus/tsdb#344? |
This comment has been minimized.
This comment has been minimized.
@gouthamve can you point out where do you think this should be cached? |
This comment has been minimized.
This comment has been minimized.
|
I don't think there's an issue. This is indeed possibly some corruption. But if the corruption causes the file to be too short, we don't even get to a stage where we could validate a checksum. This was simply a bug where we didn't handle an error condition properly. |
This comment has been minimized.
This comment has been minimized.
|
yes this was my understanding as well , just didn't know if goutham had anything else in mind. I will update tsdb and will close this when merged. |
This comment has been minimized.
This comment has been minimized.
|
lets consider this fixed and it will be added to the next release. |
krasi-georgiev
closed this
Jun 28, 2018
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 22, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
karlhungus commentedMay 1, 2018
Bug Report
What did you do?
Ran prometheus
What did you expect to see?
It run
What did you see instead? Under which circumstances?
Exception on startup: panic: runtime error: index out of range (looks similar to: prometheus/tsdb#251)
Environment