Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage: Possible bug handling chunks with very few (one?) samples. #2374

Closed
beorn7 opened this Issue Jan 27, 2017 · 6 comments

Comments

Projects
None yet
3 participants
@beorn7
Copy link
Member

beorn7 commented Jan 27, 2017

Across the SoundCloud fleet of Prometheus servers, I have seen rare series quarantining because of the dreaded "dropped more chunks from persistence than from memory" error. So far, we attributed this error to data corruption (as the most likely scenario is a truncated file, which can plausibly happen in many different error cases). Now I have seen it happen with the same time series on two mirrored servers, which hints towards properties of that time series to be a factor. The time series in question were very sparse, like one data point every couple of hours. This is a rare occurrence in typical Prometheus setups, so a bug caused by that would only be tickled rarely. Since a head chunk is closed if it doesn't get any samples appended for more than 1h, the time series in question would have only chunks with one sample each. This might throw series maintenance off the rails.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 27, 2017

Since a head chunk is closed if it doesn't get any samples appended for more than 1h

One thought I had while looking through the code is that this is done by the series maintenance process, so it could be 6h or more hours before it gets to closing it. Could that be part of the problem?

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Jan 27, 2017

Possibly. Especially since we haven't designed chunks to handle more than 1h time deltas well.

This will be interesting to investigate... :)

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 27, 2017

If it is that, an extra check in the chunk append code on one of the two servers should verify.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Feb 6, 2017

Research so far:

Not much luck. I stared at code for quite some time and found suspects. To verify, I added a bunch of panics for invariants, and added error handling where it was missing so far (and could therefore have masked data corruption etc.). As it looks, all those suspects were red herrings (but good that we have better checking in the code, will be included in my PR).

Behavioral observation: The errors spike after the second start of v1.5 (i.e. the first time v1.5 was used, nothing happened, but once the server starts using a v1.5-generated snapshot, the errors happen for a while). The spike goes down, presumably because all affected series have been quarantined. Still, only a relatively small number of series is affected. Those series all seem to be short-lived and have ended a while ago.

Next promising candidate: The new way of not checkpointing chunkdescs for persisted chunks is interacting badly with specific series (all evicted? all persisted? only one chunk?).

@beorn7 beorn7 added priority/P1 and removed priority/P2 labels Feb 6, 2017

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Feb 6, 2017

I think I got it:

Series that have only persisted chunks but are not yet archived will be completely ignored in the checkpoint.

After a restart, those series are in limbo: They are not archived, but in the index, and they obviously have a series file. If getOrCreateSeries is called for them, they will be seen as a completely new metric. That's bad enough, as all the persisted data is ignored. Things blow up once new chunks in the series get persisted and hit the already existing series file.

This is very bad, raising to P0 and working on a fix right now.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.