Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upCrash due to violated storage invariant #367
Comments
This comment has been minimized.
This comment has been minimized.
|
When using the dumper tool, the storage around that fingerprint looks like this (note that there is only a single chunk for this fingerprint, it's possible that this is what triggers the error):
|
This comment has been minimized.
This comment has been minimized.
|
Ok, that's actually a corrupt chunk. The FirstTime is larger than the LastTime, and you can see that this is due to out-of-order samples in the chunk. So now the question is, what corrupted this chunk? Probably a bug in the compactor. I remember we once had this kind of problem in the compactor already, when two arguments to an append() were the wrong way around. |
This comment has been minimized.
This comment has been minimized.
|
I traced this down to two pernicious bugs, one in the compactor, one in the watermarker. I have fixes for both, but am holding on to them as I reflect on how to best add regression tests for them (so far I've tested them with a bunch of Go-written tools and scripts).
prometheus/storage/metric/processor.go Line 118 in 4a87c00 The check prometheus/storage/metric/processor.go Line 139 in 4a87c00 That means when the for loop exits that there will be unacted samples that belong to the wrong (next) fingerprint, and they are stuck into a chunk that gets stored as the previous fingerprint. Thus, chunks get corrupted in the way we saw above. Ouch!
/cc @matttproud |
This comment has been minimized.
This comment has been minimized.
|
We should evaluate whether it makes sense to entirely get rid of
|
This comment has been minimized.
This comment has been minimized.
|
@matttproud Fixes along with regression tests for the two bugs reported in this bug are uploaded for review here: http://review.prometheus.io:8080/#/c/51/ |
This comment has been minimized.
This comment has been minimized.
|
There's also a test case included in http://review.prometheus.io:8080/#/c/54/ which is currently commented out and which triggers #368 (which is not yet fixed), in case you want to reproduce it. |
juliusv
referenced this issue
Oct 26, 2013
Closed
Store times as Unix timestamps instead of time.Time #370
This comment has been minimized.
This comment has been minimized.
juliusv
closed this
Oct 26, 2013
This comment has been minimized.
This comment has been minimized.
|
@juliusv OK. Good astute observations in the course of this! Please review the following for post-TBR code review: |
This comment has been minimized.
This comment has been minimized.
|
@matttproud Thanks a lot! I'll get to fix these comments up in time! What is still most problematic right now is this compaction crash that I haven't figured out yet: #368. A number of production Prometheusses are crashing every few hours now due to this. The compaction tests in http://review.prometheus.io/#/c/54/4/storage/metric/compaction_regression_test.go have a commented-out test case that reproduce this crash every time. |
juliusv
added a commit
that referenced
this issue
Dec 2, 2013
juliusv
added a commit
that referenced
this issue
Dec 2, 2013
juliusv
added a commit
to prometheus/client_golang
that referenced
this issue
Dec 2, 2013
juliusv
added a commit
that referenced
this issue
Dec 3, 2013
simonpasquier
pushed a commit
to simonpasquier/prometheus
that referenced
this issue
Oct 12, 2017
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 25, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
juliusv commentedOct 17, 2013
A Prometheus instance was crashing every time a curator was run with:
I have a copy of the storage that triggers this crash if someone wants to look at it.