Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign up0.14.0rc1 hangs on restart #721
Comments
This comment has been minimized.
This comment has been minimized.
|
Increasing the queue size just prevents the issue from showing but it's not a fix. @beorn7, this was reported on IRC as well. @tlopatic with the default queue capacity, how long did you wait until you terminated the hanging process? Do you have an estimate how many time series you have? |
This comment has been minimized.
This comment has been minimized.
|
@fabxc 5 million time series as per the above. |
This comment has been minimized.
This comment has been minimized.
|
Alrighty... two issues work together here:
|
beorn7
added
bug
labels
May 23, 2015
This comment has been minimized.
This comment has been minimized.
|
@fabxc I can try to fix (2) later today, but if you want to give it a spin, go ahead... |
This comment has been minimized.
This comment has been minimized.
|
One guess at what's causing the storage to become dirty: you're scraping 10,000 targets, so you could be running out of file descriptors just for the TCP connections even if you're still on the default ulimit of 1024. |
This comment has been minimized.
This comment has been minimized.
|
If that were the case, a storage write error would show up somewhere in the logs. In any case, it'd be interesting to take a look at the full logs. |
This comment has been minimized.
This comment has been minimized.
|
Thanks for your quick response, guys. Let me fill in some of the gaps. I am running 20 instances of a small Python script that simulates a collectd_exporter. So, there aren't more than 20 scrapes going on at any time. Each script simulates 500 nodes with 500 metrics, i.e., 250,000 time series. That multiplied by 20 gives the 5,000,000 time series. I waited for at least 10 minutes after the "10000 metrics queued for indexing." message. There was zero CPU load from Prometheus (as per "top") and zero disk I/O (as per "iostat") during that time, that's when I aborted. What else... I am running everything on an c4.8xlarge AWS instance with 500 GiB of SSD-based EBS storage for the Prometheus data. The scripts that I am using are here: http://lopatic.de/start.sh - That's how I start Prometheus. The initial scrape takes ~40 minutes. I assume that is because the labels / fingerprint mappings are initialized in the database. After that initial scrape, Prometheus is able to keep the 5-second interval. I'll try to reproduce the issue later today and provide a log of the first Prometheus run after clearing out the data directory. Any particular (debug?) options that you'd like me to use? |
This comment has been minimized.
This comment has been minimized.
|
Oh, and I set GOMAXPROCS to 24. See start.sh. |
fabxc
referenced this issue
May 23, 2015
Merged
Read from indexing queue during crash recovery. #724
This comment has been minimized.
This comment has been minimized.
|
You know what, I just restarted Prometheus so that you don't have to wait for the log. And, yes, there's indeed an error message. See this log snippet: Looks like a race condition, no? The fingerprint mappings are checkpointed by two threads in parallel, one of them gets to rename mappings.db.tmp, the other doesn't, and an error is reported. |
This comment has been minimized.
This comment has been minimized.
|
Nice testing setup – and thanks for all the details. Looks like we have to do a full lock in this section. But @beorn7 has to confirm that. If so, we have to do a v0.13.4 as well. |
This comment has been minimized.
This comment has been minimized.
|
@tlopatic Ok, if that was the problem and not the file descriptors, I guess you actually increased your ulimit from the default of 1024 already? I'm quite impressed that you manage to sustain a 5-second scrape rate for 5 million time series. I think you may have just pushed the boundaries of what a single Prometheus server can do :) |
This comment has been minimized.
This comment has been minimized.
|
By a factor of 2.5, IIRC :) |
This comment has been minimized.
This comment has been minimized.
|
@fabxc Yeah, right, we need the RW lock in that section... So cool, two bugs found, one about the checkpointing of FP mappings, one about the startup order. The former never bit us because collisions are so rare - in non-pathologic setups. @juliusv : The problems @tlopatic ran into are true bugs. Once they are fixed, we can see if the boundaries have been pushed too far at all... Prometheus might be just fine. :) |
This comment has been minimized.
This comment has been minimized.
|
@tlopatic, once #724 is merged, both bugs should be fixed in |
This comment has been minimized.
This comment has been minimized.
|
Are you calling my setup PATHOLOGICAL?! ;) Thanks for the fast help on a Saturday, BTW. That's great! Let's take the 5,000,000 time series every 5 seconds with a grain of salt for now. I've only been playing with Prometheus for a few hours and so far I've mostly spent my time on understanding its internals and what exactly the configuration options do. Let me run this for a few hours or days to see how the chunks_to_persist number develops over time. Also, so far I am only ingesting and not running any queries or alerts. In any case, I'll keep an eye on master and confirm that everything works when the fix is merged. Thanks again for your support. |
This comment has been minimized.
This comment has been minimized.
|
Oh, and I didn't increase my file descriptor ulimit. I am only talking to 20 simulated collectd_exporters, each of which pretends to have 500 metrics for 500 targets/instances, i.e., 250,000 time series. So, unless I am missing something, there shouldn't be more than 20 concurrent TCP connections at any time. |
This comment has been minimized.
This comment has been minimized.
|
Ah yes, that makes FD issues very unlikely. As for the ingestion rate, graphing |
This comment has been minimized.
This comment has been minimized.
|
#724 is merged. This is fixed! Binaries will be released very soon. |
beorn7
closed this
May 23, 2015
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
jmcfarlane
commented
May 23, 2015
|
Confirmed this fixed the deadlock issue for us as well (I'm the one who reported in irc). Thanks! |
This comment has been minimized.
This comment has been minimized.
|
@jmcfarlane Great, thanks for the confirmation! |
simonpasquier
pushed a commit
to simonpasquier/prometheus
that referenced
this issue
Oct 12, 2017
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 24, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
uncle-betty commentedMay 23, 2015
Hey guys,
I am trying out Prometheus from the master branch with 5,000,000 time series (simulated 10,000 nodes with simulated 500 metrics each). When I start it initially with an empty data directory, things seem fine. When I control-C it, it seems to shut down cleanly. However, when I restart it, it hangs at this point:
...
Indexing metrics in memory.
10000 metrics queued for indexing.
The reason seems to be that indexingQueueCapacity is 16,384. So, the queue blocks, once it has 16,384 elements queued. When I increase that number beyond 5,000,000, then things work.
Thomas