Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.14.0rc1 hangs on restart #721

Closed
uncle-betty opened this Issue May 23, 2015 · 22 comments

Comments

Projects
None yet
5 participants
@uncle-betty
Copy link

uncle-betty commented May 23, 2015

Hey guys,

I am trying out Prometheus from the master branch with 5,000,000 time series (simulated 10,000 nodes with simulated 500 metrics each). When I start it initially with an empty data directory, things seem fine. When I control-C it, it seems to shut down cleanly. However, when I restart it, it hangs at this point:

...
Indexing metrics in memory.
10000 metrics queued for indexing.

The reason seems to be that indexingQueueCapacity is 16,384. So, the queue blocks, once it has 16,384 elements queued. When I increase that number beyond 5,000,000, then things work.

Thomas

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented May 23, 2015

Increasing the queue size just prevents the issue from showing but it's not a fix.
Thanks for reporting this.

@beorn7, this was reported on IRC as well.
Could this be a regression by #704 (I cannot see how) or could this be related to @juliusv's observeration from yesterday where indexing took very long for some reason (which was on a v0.13.x machine, I assume).

@tlopatic with the default queue capacity, how long did you wait until you terminated the hanging process? Do you have an estimate how many time series you have?

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented May 23, 2015

@fabxc 5 million time series as per the above.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented May 23, 2015

Alrighty... two issues work together here:

  1. @tlopatic Your storage is dirty after the seemingly clean shutdown. Either the shutdown was not clean after all, or something happened during the runtime that made it dirty. Could you check the logs of the first run of the server? There should be an error level message "The storage is now inconsistent. Restart Prometheus ASAP to initiate recovery."
  2. The other issue is a bug in the crash recovery (triggered by the first issue). @fabxc This looks indeed like a regression introduced by #704 . The crash recovery needs the indexing goroutine to run already, but now we are starting it only after crash recovery, i.e. if we have more metrics to index during crash recovery than capacity in the queue, we are deadlocked. (Note that all metrics are always reindexed during crash recovery.)

@beorn7 beorn7 added bug labels May 23, 2015

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented May 23, 2015

@fabxc I can try to fix (2) later today, but if you want to give it a spin, go ahead...

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented May 23, 2015

One guess at what's causing the storage to become dirty: you're scraping 10,000 targets, so you could be running out of file descriptors just for the TCP connections even if you're still on the default ulimit of 1024.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented May 23, 2015

If that were the case, a storage write error would show up somewhere in the logs. In any case, it'd be interesting to take a look at the full logs.

@uncle-betty

This comment has been minimized.

Copy link
Author

uncle-betty commented May 23, 2015

Thanks for your quick response, guys. Let me fill in some of the gaps.

I am running 20 instances of a small Python script that simulates a collectd_exporter. So, there aren't more than 20 scrapes going on at any time. Each script simulates 500 nodes with 500 metrics, i.e., 250,000 time series. That multiplied by 20 gives the 5,000,000 time series.

I waited for at least 10 minutes after the "10000 metrics queued for indexing." message. There was zero CPU load from Prometheus (as per "top") and zero disk I/O (as per "iostat") during that time, that's when I aborted.

What else... I am running everything on an c4.8xlarge AWS instance with 500 GiB of SSD-based EBS storage for the Prometheus data.

The scripts that I am using are here:

http://lopatic.de/start.sh - That's how I start Prometheus.
http://lopatic.de/prom.yml - That's my Prometheus configuration.
http://lopatic.de/sim.py - That's the Python script that simulates a collectd_exporter instance. It listens on port 10,000 + x, where x is a command line argument.
http://lopatic.de/sim.sh - That's how I start 20 of the Python scripts for ports 10,000 through 10,0019.

The initial scrape takes ~40 minutes. I assume that is because the labels / fingerprint mappings are initialized in the database. After that initial scrape, Prometheus is able to keep the 5-second interval.

I'll try to reproduce the issue later today and provide a log of the first Prometheus run after clearing out the data directory. Any particular (debug?) options that you'd like me to use?

@uncle-betty

This comment has been minimized.

Copy link
Author

uncle-betty commented May 23, 2015

Oh, and I set GOMAXPROCS to 24. See start.sh.

@uncle-betty

This comment has been minimized.

Copy link
Author

uncle-betty commented May 23, 2015

You know what, I just restarted Prometheus so that you don't have to wait for the log. And, yes, there's indeed an error message. See this log snippet:

http://lopatic.de/log-1.txt

Looks like a race condition, no? The fingerprint mappings are checkpointed by two threads in parallel, one of them gets to rename mappings.db.tmp, the other doesn't, and an error is reported.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented May 23, 2015

Nice testing setup – and thanks for all the details.
I think your analysis is correct.

Looks like we have to do a full lock in this section. But @beorn7 has to confirm that.

If so, we have to do a v0.13.4 as well.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented May 23, 2015

@tlopatic Ok, if that was the problem and not the file descriptors, I guess you actually increased your ulimit from the default of 1024 already? I'm quite impressed that you manage to sustain a 5-second scrape rate for 5 million time series. I think you may have just pushed the boundaries of what a single Prometheus server can do :)

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented May 23, 2015

By a factor of 2.5, IIRC :)

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented May 23, 2015

@fabxc Yeah, right, we need the RW lock in that section...

So cool, two bugs found, one about the checkpointing of FP mappings, one about the startup order.

The former never bit us because collisions are so rare - in non-pathologic setups.

@juliusv : The problems @tlopatic ran into are true bugs. Once they are fixed, we can see if the boundaries have been pushed too far at all... Prometheus might be just fine. :)

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented May 23, 2015

@tlopatic, once #724 is merged, both bugs should be fixed in master.
Thanks again for the detailed report.

@uncle-betty

This comment has been minimized.

Copy link
Author

uncle-betty commented May 23, 2015

Are you calling my setup PATHOLOGICAL?! ;)

Thanks for the fast help on a Saturday, BTW. That's great!

Let's take the 5,000,000 time series every 5 seconds with a grain of salt for now. I've only been playing with Prometheus for a few hours and so far I've mostly spent my time on understanding its internals and what exactly the configuration options do.

Let me run this for a few hours or days to see how the chunks_to_persist number develops over time. Also, so far I am only ingesting and not running any queries or alerts.

In any case, I'll keep an eye on master and confirm that everything works when the fix is merged.

Thanks again for your support.

@uncle-betty

This comment has been minimized.

Copy link
Author

uncle-betty commented May 23, 2015

Oh, and I didn't increase my file descriptor ulimit. I am only talking to 20 simulated collectd_exporters, each of which pretends to have 500 metrics for 500 targets/instances, i.e., 250,000 time series. So, unless I am missing something, there shouldn't be more than 20 concurrent TCP connections at any time.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented May 23, 2015

Ah yes, that makes FD issues very unlikely.

As for the ingestion rate, graphing rate(prometheus_local_storage_ingested_samples_total[5m]) could be interesting (the per-second rate of ingested samples). We would expect ingestion to be really fast as long as nothing has to be written to disk. Over time, as you are running through the whole lifecycle of time series (writing out chunks, accumulating a lot of history on disk, and then finally also purging old chunks after the configured retention time), ingestion rates are likely to be much lower.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented May 23, 2015

#724 is merged. This is fixed!

Binaries will be released very soon.

@beorn7 beorn7 closed this May 23, 2015

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented May 23, 2015

@beorn7 @fabxc You guys rock!

@jmcfarlane

This comment has been minimized.

Copy link

jmcfarlane commented May 23, 2015

Confirmed this fixed the deadlock issue for us as well (I'm the one who reported in irc).

Thanks!

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented May 23, 2015

@jmcfarlane Great, thanks for the confirmation!

simonpasquier pushed a commit to simonpasquier/prometheus that referenced this issue Oct 12, 2017

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.