Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upStorage: Possible bug handling chunks with very few (one?) samples. #2374
Comments
beorn7
added
component/local storage
kind/bug
priority/P2
labels
Jan 27, 2017
beorn7
self-assigned this
Jan 27, 2017
This comment has been minimized.
This comment has been minimized.
One thought I had while looking through the code is that this is done by the series maintenance process, so it could be 6h or more hours before it gets to closing it. Could that be part of the problem? |
This comment has been minimized.
This comment has been minimized.
|
Possibly. Especially since we haven't designed chunks to handle more than 1h time deltas well. This will be interesting to investigate... :) |
This comment has been minimized.
This comment has been minimized.
|
If it is that, an extra check in the chunk append code on one of the two servers should verify. |
This comment has been minimized.
This comment has been minimized.
|
Research so far: Not much luck. I stared at code for quite some time and found suspects. To verify, I added a bunch of panics for invariants, and added error handling where it was missing so far (and could therefore have masked data corruption etc.). As it looks, all those suspects were red herrings (but good that we have better checking in the code, will be included in my PR). Behavioral observation: The errors spike after the second start of v1.5 (i.e. the first time v1.5 was used, nothing happened, but once the server starts using a v1.5-generated snapshot, the errors happen for a while). The spike goes down, presumably because all affected series have been quarantined. Still, only a relatively small number of series is affected. Those series all seem to be short-lived and have ended a while ago. Next promising candidate: The new way of not checkpointing chunkdescs for persisted chunks is interacting badly with specific series (all evicted? all persisted? only one chunk?). |
beorn7
added
priority/P1
and removed
priority/P2
labels
Feb 6, 2017
This comment has been minimized.
This comment has been minimized.
|
I think I got it: Series that have only persisted chunks but are not yet archived will be completely ignored in the checkpoint. After a restart, those series are in limbo: They are not archived, but in the index, and they obviously have a series file. If This is very bad, raising to P0 and working on a fix right now. |
beorn7
added
priority/P0
and removed
priority/P1
labels
Feb 6, 2017
beorn7
referenced this issue
Feb 6, 2017
Merged
storage: Fix checkpointing of fully persisted memory series. #2400
discordianfish
added
the
in progress
label
Feb 6, 2017
beorn7
closed this
Feb 7, 2017
discordianfish
removed
the
in progress
label
Feb 7, 2017
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 24, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
beorn7 commentedJan 27, 2017
Across the SoundCloud fleet of Prometheus servers, I have seen rare series quarantining because of the dreaded "dropped more chunks from persistence than from memory" error. So far, we attributed this error to data corruption (as the most likely scenario is a truncated file, which can plausibly happen in many different error cases). Now I have seen it happen with the same time series on two mirrored servers, which hints towards properties of that time series to be a factor. The time series in question were very sparse, like one data point every couple of hours. This is a rare occurrence in typical Prometheus setups, so a bug caused by that would only be tickled rarely. Since a head chunk is closed if it doesn't get any samples appended for more than 1h, the time series in question would have only chunks with one sample each. This might throw series maintenance off the rails.