Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upPrometheus fails to recover after getting KILLed #435
Comments
discordianfish
added
bug
labels
Dec 26, 2014
This comment has been minimized.
This comment has been minimized.
|
Unfortunately I can't reproduce the full situation. First time I started prometheus, I got tons of "Truncating file.. Recovered ...". This took around an hour with prometheus_local_storage_memory_series = 83490. Now if I start it again, it's not truncating things anymore but just says:
It might be a bit quicker but still really slow. Strace shows that prometheus open the db files, lseeks, reads and write a bit. Generally what I would expect it to do. But it's slow. This is strace limited to open with timestamps:
So not sure what's wrong here. Either it's unexpected that prometheus touches all those files or it's just too slow doing that. That system isn't doing much, it's running a elastic search node but that's barely under load. |
juliusv
referenced this issue
Jan 6, 2015
Closed
Prometheus panicing with panic: runtime error: index out of range #437
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil encountered the same issue in #437 - see there for additional info. |
This comment has been minimized.
This comment has been minimized.
|
If the file length of a series file is consistent with the information in the checkpoint, the file should only be stat'd during crash recovery, not open'd. When you see "Cecovered metric [...]: head chunk found among the [...] recovered chunks in series file.", then the file size was inconsistent and the file had to be inspected. That happens if a chunk was written to the file after the last checkpoint was taken. Series with a high ingestion rate will more often be in that state. What was your ingestion rate? What was your checkpointing interval? |
This comment has been minimized.
This comment has been minimized.
|
We're using the defaults, and have about 23k series scraped once a minute. |
This comment has been minimized.
This comment has been minimized.
|
My question was related to the slowness, which (in my current theory) can happen if you have relatively few series but write chunks quite frequently (more often than checkpoints being written). @brian-brazil You had no trouble with slowness of crash recovery, but @discordianfish had, so I was particularly interested in his numbers of samples per series. |
This comment has been minimized.
This comment has been minimized.
|
@beorn7 My scrape interval is 60s, except when scraping itself (still 15s) |
This comment has been minimized.
This comment has been minimized.
|
@discordianfish Thanks. So there goes my theory. I tried several scenarios now, and the crash recovery behaved as expected. In your case, it had to seek to many series files, and that can only happen if most series have a head chunk that is different from the one in the most recent checkpoint. I plan to implement a mechanism that will trigger a snapshot early if that happens. Perhaps something delayed the checkpointing in your case. All of that is still only about the slowness, not about the panic. Still investigating... |
This comment has been minimized.
This comment has been minimized.
|
Modified version of my theory: If many metrics have the same collection pattern (i.e. they are scraped with the same interval, and scraping similarly structured data, which has a similar compression behavior), then all those metrics will start a new chunk at approx. the same time. Just after that has happened, all the corresponding time series are "dirty", i.e. the head chunk in the last checkpoint is not the current head chunk. If a crash happens at that time, all those series have to be checked during recovery, requiring a disk seek each. Three fixes related to this issue went into the code now:
I assume that addresses all the issues, but in general, the whole crash recovery desperately needs proper automated testing, for which I'll open a separate issue. |
beorn7
closed this
Jan 12, 2015
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 24, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
discordianfish commentedDec 26, 2014
Hi,
I KILLed prometheus accidentally. After restarting, apparently recovered a bunch of samples but then panic'ed due a index being ot of range: https://gist.github.com/discordianfish/18b5c2eaea2e2d66858f