Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upPrometheus hangs during checkpointing #2846
Comments
This comment has been minimized.
This comment has been minimized.
|
Thanks for the very comprehensive bug report, a real role model. Looking at the goroutine dump, this looks like a problem within LevelDB. There are a number of LevelDB goroutines that are stuck (some of them are normal, but there are others that should never be stuck for that long). So LevelDB accesses block forever, which blocks one of the goroutines currently in Prometheus code while having locked a series, which then blocks checkpointing (which has to lock all series sequentially) and access (read and write) to any series that shares a lock with that series. So everything handling series comes to a grinding halt. We had issues with LevelDB in the past, but none looked like this. Not sure what makes you to run into it regularly (assuming that the other hangs all look the same). Perhaps it is some property of your storage device that LevelDB doesn't deal well with. Good news: Prometheus 2 will not have LevelDB anymore, but our own indexing engine. |
brian-brazil
added
the
component/local storage
label
Jul 14, 2017
This comment has been minimized.
This comment has been minimized.
|
Closing as this particular issue won't occur on 2.0. |
brian-brazil
closed this
Jul 14, 2017
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 23, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
TimSimmons commentedJun 15, 2017
What did you do?
Ran Prometheus servers on cloud VMs (32gb/12cpu some 64gb/20cpu) with ingest rates ranging
from 10k samples/s to 50k (with two at 100k) with 3 days of retention.
Number of time series ranges from 500k to 6mil, mostly on the 1-3m range.
What did you expect to see?
Prometheus running consistently. Which it usually does.
What did you see instead? Under which circumstances?
Semi-randomly distributed (skewed toward the higher-usage machines), probably 2 a week out of 30,
an instance will "hang":
Environment
Startup flags:
There are no logs when this happens.
Other Notes
Goroutine dump
Metrics
Ingest rate, checkpoint duration, I/O are all consistent until this happens. Then all I/O essentially stops.
The servers recover just fine with a restart.
This happens across at least 10 different instances of varying load, but it does seem to happen more often on the busier instances.
IRC conversation with @brian-brazil