Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus hangs during checkpointing #2846

Closed
TimSimmons opened this Issue Jun 15, 2017 · 3 comments

Comments

Projects
None yet
3 participants
@TimSimmons
Copy link

TimSimmons commented Jun 15, 2017

What did you do?

Ran Prometheus servers on cloud VMs (32gb/12cpu some 64gb/20cpu) with ingest rates ranging
from 10k samples/s to 50k (with two at 100k) with 3 days of retention.

Number of time series ranges from 500k to 6mil, mostly on the 1-3m range.

What did you expect to see?

Prometheus running consistently. Which it usually does.

What did you see instead? Under which circumstances?

Semi-randomly distributed (skewed toward the higher-usage machines), probably 2 a week out of 30,
an instance will "hang":

  • Scraping stops
  • The UI is responsive
  • Queries time out
  • /metrics responds
  • The server is checkpointing indefinitely

Environment

  • System information:
$ uname -srm
Linux 4.4.0-66-generic x86_64
  • Prometheus version:
$ /opt/prometheus/bin/prometheus -version
prometheus, version 1.6.1 (branch: master, revision: 4666df502c0e239ed4aa1d80abbbfb54f61b23c3)
  build user:       root@ca1b452514b7
  build date:       20170419-13:26:24
  go version:       go1.8.1
  • Prometheus configuration file:
---
global:
  scrape_interval:     1m
  evaluation_interval: 1m
  scrape_timeout: 30s

scrape_configs:
  - job_name: <redacted>
    file_sd_configs:
      - files:
        - /opt/prometheus/services/<redacted>.json

Startup flags:

/opt/prometheus/bin/prometheus \
    -config.file "/opt/prometheus/prometheus.yml" \
    -storage.local.target-heap-size 38283265092 \
    -storage.local.chunk-encoding-version 2 \
    -storage.local.path "/data/prometheus" \
    -storage.local.retention 72h0m0s \
    -log.format="logger:syslog?appname=prometheus&local=7&json=true" \
    -log.level "warn"
  • Logs:

There are no logs when this happens.

Other Notes

Goroutine dump

Metrics

Ingest rate, checkpoint duration, I/O are all consistent until this happens. Then all I/O essentially stops.

The servers recover just fine with a restart.

This happens across at least 10 different instances of varying load, but it does seem to happen more often on the busier instances.

IRC conversation with @brian-brazil

09:24 <timsim> Hello! Is there precedent for a (busy) Prometheus server to hang while checkpointing and seem to stay there forever? It's still up and still responds via the UI. But it's not scraping, and it's been checkpointing for days. This happens in our fleet pretty often. The educated guess is I/O issues, but is there any way I can confirm that? Most of my instances work 100% of the time, others do this every week or two.
09:27 <•bbrazil> timsim: we believe that's due to running out of I/O capacity
09:28 <•bbrazil> timsim: if you have more information, there's an issue tracking it
09:28 <timsim> Interesting, it does seem to happen more often on my busiest instances
09:29 <•bbrazil> ah, mail thread actually. "Degrading sample ingestion rate"
09:29 — timsim reads
09:37 <timsim> Hm. My checkpoint sizes are more 4-15 GB than that guys 70, but he's probably on metal and I'm on cloud vms.
09:37 <timsim> Well, I also only have 64gb of RAM at most. So it's not possible for mine to be that big
09:38 <timsim> My biggest one has a million more time series though.
09:38 <timsim> I think I'm going to drop the scrape interval down from 1m to 3m, hopefully that gives it some headroom.
09:43 <timsim> bbrazil: Just to be totally clear (and thank you for the information, by the way). Mine seem to checkpoint at a consistent duration (most under five minutes, one at 40), but then all of a sudden, it will checkpoint forever. It's not possible that I'm hitting some kind of hung goroutine, but rather just having I/O issues on this random checkpoint that never resolve?
09:44 <timsim> also, my sample ingestion rate doesn't degrade at all until this happens, then it goes to 0
09:47 <•bbrazil> timsim: that would imply something is getting stuck somewhere
09:47 <•bbrazil> timsim: does all i/o stop on the machine? that'd imply a kernel/hardware issue
09:47 <timsim> bbrazil: Works fine when I restart Prometheus
09:48 <•bbrazil> can you get a goroutine dump?
09:48 <timsim> sure
09:49 Y<YellowS4> Hi everyone! Is there a way to dynamically set the value of a `params` member or to match only a subset of the targets in a file_sd based on a label?
09:50 <timsim> bbrazil: https://gist.githubusercontent.com/TimSimmons/56841047eecdf8011fc29dddff99da67/raw/06fe8f119b00e50e6167473d4d4b8275b98b900e/dump.txt
09:52 <•bbrazil> the checkpoint code is waiting on a lock
09:53 <•bbrazil> so probably not an I/O problem
09:53 <timsim> more `storage.local.num-fingerprint-mutexes` maybe?
09:54 <•bbrazil> might mitigate, but not it
09:54 <timsim> I tried that in our staging env with mixed results, mostly inconclusive. 
09:55 <•bbrazil> hmm, 1520 goroutines in that sate
09:56 <•bbrazil> this needs deeper investigation, can you file a bug?
09:56 <timsim> Absolutely. Thank you so much.
10:02 <•bbrazil> is there any log output around when this happens?
10:03 <timsim> None. I'm on WARN, but I think I've tried capturing on INFO. 
10:04 <•bbrazil> I'm wondering if a goroutine might have panicked while keeping a lock
@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Jun 19, 2017

Thanks for the very comprehensive bug report, a real role model.

Looking at the goroutine dump, this looks like a problem within LevelDB. There are a number of LevelDB goroutines that are stuck (some of them are normal, but there are others that should never be stuck for that long). So LevelDB accesses block forever, which blocks one of the goroutines currently in Prometheus code while having locked a series, which then blocks checkpointing (which has to lock all series sequentially) and access (read and write) to any series that shares a lock with that series. So everything handling series comes to a grinding halt.

We had issues with LevelDB in the past, but none looked like this. Not sure what makes you to run into it regularly (assuming that the other hangs all look the same). Perhaps it is some property of your storage device that LevelDB doesn't deal well with.

Good news: Prometheus 2 will not have LevelDB anymore, but our own indexing engine.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 14, 2017

Closing as this particular issue won't occur on 2.0.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.