Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU usage running prometheus #1301

Closed
Rucknar opened this Issue Jan 11, 2016 · 19 comments

Comments

Projects
None yet
10 participants
@Rucknar
Copy link
Contributor

Rucknar commented Jan 11, 2016

I'm seeing an issue when running prometheus, it's currently using more CPU than it has in the past and more than i would expect it to.

For example, here is a screenshot of the prometheus container we have running:
screen shot 2016-01-11 at 11 39 07

Here is the config we are using to run prometheus:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    monitor: exporter-metrics

scrape_configs:
- job_name: ptl
  scrape_interval: 15s

  target_groups:
    - targets:
      - 'XXXX:9100'
      - 'XXXX:9100'
      - 'XXXX:9104'
      - 'XXXX:9104'
- job_name: control
  scrape_interval: 15s

  target_groups:
    - targets:
      - 'XXXXX:9100'
      - 'XXXXX.100:9104'
      - 'XXXXX:9010'

rule_files:
  - /alertrules.conf

in the STDOUT logs, we see a few like the one below but nothing strange aside from that:
INFO[0347] Done checkpointing in-memory metrics and chunks in 35.485310843s. source=persistence.go:563

Tried to troubleshoot this without raising a ticket but no joy.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Jan 11, 2016

Any chance your setup just hit the retention period for the first time?
What resource limits does the container have and how many time series does the server hold in memory (prometheus_local_storage_memory_series metric)?

@Rucknar

This comment has been minimized.

Copy link
Contributor Author

Rucknar commented Jan 11, 2016

Not to my knowledge, it's been running for 2-3 months now.
How do i validate the existing setting for prometheus_local_storage_memory_series? I can see how one would set that value manually but can't seem to find it's current value.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Jan 11, 2016

It's a metric that tells you the current state of your server, not a configuration option.
Check the /metrics endpoint of your Prometheus server. It's also only meaningful if we know the resources available to your instance.

@Rucknar

This comment has been minimized.

Copy link
Contributor Author

Rucknar commented Jan 11, 2016

Thanks:
prometheus_local_storage_memory_series 17785

Container is running unrestricted and on a VM with 8 cores, 16GB of memory. It is however shared with some other containers. Need more detailed info?

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Jan 20, 2016

That's plenty of resources for the size of the server.
From the info at hand it's not really possible to draw any conclusions.

Next step would be profiling your server. If you have the go toolchain installed, run:

go tool pprof --svg http://<yourserver>/debug/pprof/profile > prof.svg

If you can share the resulting SVG, we might find out more.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jan 21, 2016

@Rucknar Just checking, did this happen without any Prometheus version upgrade? And what's the version you're running?

@Rucknar

This comment has been minimized.

Copy link
Contributor Author

Rucknar commented Jan 21, 2016

@juliusv It's entirely possible,i pull the container image from :latest and it's been re-pulled a number of times.

branch  stable
buildDate   20160118-19:00:26
buildUser   @9f8f0f8d724a
goVersion   1.5.1
revision    968ee35
version 0.16.1
@Rucknar

This comment has been minimized.

Copy link
Contributor Author

Rucknar commented Jan 21, 2016

@fabxc The container doesn't have the toolchain installed, i'll look at getting that on there.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Jan 21, 2016

Since you run this command against an HTTP endpoint of this server, you can execute it from any machine that can reach the container.

@Rucknar

This comment has been minimized.

Copy link
Contributor Author

Rucknar commented Jan 21, 2016

prof.svg - From when it's behaving:
prof2.svg - From when the error seems present:

Archive.zip

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Jan 21, 2016

From a first look, those look basically identical. In the second one it seems to be a bit more busy – not extremely though.

It's mostly spending time on query evaluation – rate() more precisely. Do you have recording rules? Dashboards?
What do the queries look like? Very possible that your dashboards are just causing too much load by calculation many rates. That wouldn't explain any significant spikyness though.

A graph of CPU load over time would probably help to understand, too.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jan 21, 2016

In case the rate of incoming queries is the/a culprit, check how it changed like this: rate(http_request_duration_microseconds_count{handler=~"query|metrics"}[5m]).

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Feb 2, 2016

Assuming this is resolved. Please re-open if it's still an issue and you want to provide more information.

@beorn7 beorn7 closed this Feb 2, 2016

@ryan5rdx

This comment has been minimized.

Copy link

ryan5rdx commented Sep 21, 2016

@fabxc @beorn7
I'm also having this issue, and it doesnt seem to be the rate of incoming queries. It just seems to be at 100% cpu all the time. I've attached the svg as well. Also, I do have a dashboard and when I use it works, so querying seems fine.
screen shot 2016-09-20 at 5 27 30 pm
prof1.svg.zip

@hairyhenderson

This comment has been minimized.

Copy link
Contributor

hairyhenderson commented Mar 16, 2017

FWIW, I seem to also be seeing this issue, with prometheus 1.5.2... Has anyone had luck troubleshooting this?

@lunemec

This comment has been minimized.

Copy link

lunemec commented Jul 12, 2017

I have the same issue, is there any way to display queries which consume most CPU time?

EDIT: I have tried running rate(http_request_duration_microseconds_count{handler=~"query|metrics"}[5m]) and it shows almost zero, and yet the prometheus server is consuming all the cpu time.

@gauravarora

This comment has been minimized.

Copy link

gauravarora commented Aug 25, 2017

I have the exact same problem as @lunemec above and the metric count is 0.

@j0nimost

This comment has been minimized.

Copy link

j0nimost commented Aug 25, 2017

increase the rate time preferably 15m probably the scrape period is too long

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.