Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upstorage.local.retention killing prometheus? #1550
Comments
This comment has been minimized.
This comment has been minimized.
|
Is there an error or stack trace produced when it dies?
It'd best to keep all the data for a good while. You never know when you need to retroactively debug. |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil not sure what this means.. is it looking for stored data to run the queries on?
|
This comment has been minimized.
This comment has been minimized.
|
@beorn7 See panic "panic: dropped unpersisted chunks from memory". Possible bug? |
This comment has been minimized.
This comment has been minimized.
|
That's a sanity check we have, in this case it's actually correct as the retention is too small. We can probably refine the check. |
brian-brazil
added
the
bug
label
Apr 12, 2016
This comment has been minimized.
This comment has been minimized.
|
Seems rather like a bug to me - if a user configures a very short retention period that overlaps with chunks being in memory, of course it should not crash, it should just keep the chunks as long as absolutely needed (like when they're still involved in an ongoing query), then drop them from everywhere. This kind of panic is usually only guarding against programming mistakes. |
This comment has been minimized.
This comment has been minimized.
|
Thanks for the trace. Yes, that looks like a bug. Will look into it ASAP. |
beorn7
self-assigned this
Apr 12, 2016
This comment has been minimized.
This comment has been minimized.
|
thanks guys, based on your discussion, it seems like this is caused when the retention is too small, so far I have been trying 1m-15m window. Just playing around right now to see how small a retention can be. |
This comment has been minimized.
This comment has been minimized.
|
@mokshpooja It's possible this is already fixed in head (or in 0.18rc1). I vaguely remember I have touched that part. |
This comment has been minimized.
This comment has been minimized.
|
@beorn7 what about the last stable release- 0.17.0? I have been using stable- 0.15.1. |
This comment has been minimized.
This comment has been minimized.
|
Oh, please do try 0.18.0rc1 or HEAD. 0.15.1 is ancient and will have all kinds of bugs that are already fixed in master. |
This comment has been minimized.
This comment has been minimized.
|
Yes, definitely try to reproduce with 0.18rc1. Which reminds me that we might want to write a Howto for filing bugs. It should include running from HEAD in master if possible, or running the latest published binaries (including RCs) as we tend to close issues once they are in master. |
This comment has been minimized.
This comment has been minimized.
|
@beorn7 I am trying 0.18rc1 from the binary tar file. Has anything changed in the prometheus.yml format? My old promtheus.yml is not working any more.
My prometheus.yml-
|
This comment has been minimized.
This comment has been minimized.
|
|
This comment has been minimized.
This comment has been minimized.
|
Hi Guys, now I have 0.18rc1 working however the problem remains the same. I have checked retention from 1-15 mins after the retention interval Prometheus still crashes. If you would like to see the massive error log let me know. |
This comment has been minimized.
This comment has been minimized.
|
Yes, please give us the crash log generated by 0.18rc1. If it's too long, as a gist. |
This comment has been minimized.
This comment has been minimized.
|
prometheus/storage/local/series.go Line 323 in e83f05f |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
thanks. will look into it ASAP. |
beorn7
added
the
Critical
label
Apr 14, 2016
This comment has been minimized.
This comment has been minimized.
|
I'll mark it as critical as with the new varbit chunks, a chunk may last so long that this could become in issue with longer retention times, too. |
This comment has been minimized.
This comment has been minimized.
|
thanks for the explanation, i think I understand what's going on here. |
beorn7
referenced this issue
Apr 14, 2016
Merged
Checkpoint fingerprint mappings only upon shutdown #1555
This comment has been minimized.
This comment has been minimized.
|
OK, got it. So the problem is if the retention is shorter than the chunk timeout (currently hardcoded to 1h) or in the more extreme case shorter than the scrape interval. In both cases, the storage will try to drop a still open head chunk (because either it hasn't got a sample in a while but not long enough to close it because of the timeout, or because the last scrape is longer ago than the retention time quite regularly). An open head chunk cannot be persisted by definition, and that triggers our sanity check. Good news is that this will strictly only happen with retention time below 1h. So it will not get worse with the longer lived varbit chunks. It should also be fairly simple to fix. Stay tuned. |
beorn7
closed this
in
#1559
Apr 15, 2016
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 24, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
mokshpooja commentedApr 12, 2016
Hi Guys,
I have 2 setups, one on my local development setup (vagrant+vm+docker) and the other aws+docker. In both setups I have prometheus+cAdvisor+Alertmanager running as seperate containers.
I want to clean the disk storage every 15 minutes so I added below in my prometheus dockerfile-
CMD "-storage.local.retention=15m"It seems like after 15 mins my prometheus container dies.
My aim is to clean the collected metrics+storage every 15 minutes and only persist and collected metrics from a timestamp if/when an error occurs.
Any idea what might be doing wrong with storage.local.retention?