Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.0.0-beta.3 long sluggish startup #3166

Closed
TimSimmons opened this Issue Sep 13, 2017 · 2 comments

Comments

Projects
None yet
1 participant
@TimSimmons
Copy link

TimSimmons commented Sep 13, 2017

What did you do?

Installed Prometheus 2.0.0-beta.3 on a 64g RAM 20 CPU cloud VM.

What did you expect to see?

Relatively stable collection of around 25-30k metrics/s across ~6.5 million time series from ~2600 targets.

What did you see instead? Under which circumstances?

When I start Prometheus, there is a long "spin-up" time where it seems to work on something and starts scraping slowly before eventually getting up to speed. During this time (30min+), the UI is very slow to respond (/targets taking >60s, simple queries timing out, /status taking 5+ seconds), and the CPU load high. All targets are not scraped for the first time for over an hour. Eventually the behavior levels out and becomes consistent with what you would expect.

Environment

  • System information:
    Linux 4.4.0-78-generic x86_64

  • Prometheus version:

prometheus, version 2.0.0-beta.3 (branch: HEAD, revision: 066783b3991dd64729325fc4f880dfffb484a2c2)
  build user:       root@0cbc320660dc
  build date:       20170912-10:17:45
  go version:       go1.8.3
  • Prometheus configuration file:
---
global:
  scrape_interval:     300s
  evaluation_interval: 1m
  scrape_timeout: 30s

rule_files:
  - /opt/prometheus/rules/*

scrape_configs:
  - job_name: <redacted>
    file_sd_configs:
      - files:
        - /opt/prometheus/services/<redacted>.json
  • Logs:
INFO[0000] Starting prometheus (version=2.0.0-beta.3, branch=HEAD, revision=066783b3991dd64729325fc4f880dfffb484a2c2)  source="main.go:210"
INFO[0000] Build context (go=go1.8.3, user=root@0cbc320660dc, date=20170912-10:17:45)  source="main.go:211"
INFO[0000] Host details (Linux 4.4.0-78-generic #99~14.04.2-Ubuntu SMP Thu Apr 27 18:49:46 UTC 2017 x86_64 ... (none))  source="main.go:212"
INFO[0000] Starting tsdb                                 source="main.go:224"
INFO[0000] tsdb started                                  source="main.go:230"
INFO[0000] Loading configuration file /opt/prometheus/prometheus.yml  source="main.go:363"
INFO[0000] Starting target manager...                    source="targetmanager.go:67"
{"level":"info","msg":"Server is ready to receive requests.","source":"main.go:340","time":"2017-09-13T15:09:41Z"}
{"level":"info","msg":"Listening on 0.0.0.0:9090","source":"web.go:359","time":"2017-09-13T15:09:41Z"}

graphs
Restarts are indicated by the by the local minima in the top graph.

Note: I believe I saw similar behavior on beta.0, but it was a much shorter time, maybe 3-5 minutes. This often goes over an hour.

I also saw, after about 12 hours that it ran up against my limit of 25k open files, which wasn't a problem before.

This is new, every once in a while I'll get a few of these popping up:

Message from syslogd@hostname at Sep 13 19:39:07 ...
 kernel:[4421453.944136] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [prometheus:21484]

Message from syslogd@hostname at Sep 13 19:39:07 ...
 kernel:[4421453.944820] NMI watchdog: BUG: soft lockup - CPU#8 stuck for 22s! [prometheus:21482]

Message from syslogd@hostname at Sep 13 19:39:07 ...
 kernel:[4421453.981558] NMI watchdog: BUG: soft lockup - CPU#17 stuck for 22s! [prometheus:21478]

Message from syslogd@hostname at Sep 13 19:39:07 ...
 kernel:[4421453.985994] NMI watchdog: BUG: soft lockup - CPU#18 stuck for 22s! [prometheus:21472]

@TimSimmons

This comment has been minimized.

Copy link
Author

TimSimmons commented Sep 13, 2017

Update: This host is experiencing bad CPU steal, which is likely causing these issues. A duplicate host performed much better on initial startup. I'll continue to monitor both hosts and report anything I find. But I'm going to close this issue because it looks like it's just me. Sorry for the noise!

@TimSimmons TimSimmons closed this Sep 13, 2017

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.