Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU and memory in Prometheus 2.3.2 #5301

Closed
anumercian opened this Issue Mar 4, 2019 · 13 comments

Comments

Projects
None yet
3 participants
@anumercian
Copy link

anumercian commented Mar 4, 2019

Proposal

Use case. Why is this important?
Facing high CPU and memory usage of Prometheus at Scale.

Bug Report

What did you do?
Created 126-130 time series and ran it for 2-3 days.

What did you expect to see?
Prometheus running at normal CPU and memory.

What did you see instead? Under which circumstances?
Prometheus is having CPU spikes every 1 minute for about 5sec to 100%.
image

On an average it seems like CPU is having high CPU throughout:
image

The memory via pprof is about 120Mb but via OS (/proc) its about 450Mb

Environment

  • System information:

    insert output of uname -srm here

  • Prometheus version:

    insert output of prometheus --version here
    Prometheus 2.3.2

  • Alertmanager version:

    insert output of alertmanager --version here (if relevant to the issue)

  • Prometheus configuration file:

insert configuration here

global:
scrape_interval: 15s
evaluation_interval: 5s
external_labels:
monitor: 'codelab-monitor'

rule_files:

  • '/var/run/prometheus/alerts/*.yml'

scrape_configs:

  • job_name: 'prometheus'

    scrape_interval: 5s

    static_configs:

    • targets: ['127.0.0.1:9091']

alerting:
alertmanagers:

  • static_configs:
    • targets:
      • 127.0.0.1:9091
  • Alertmanager configuration file:
insert configuration here (if relevant to the issue)
  • Logs:
insert Prometheus and Alertmanager logs relevant to the issue here
@anumercian

This comment has been minimized.

Copy link
Author

anumercian commented Mar 4, 2019

image

image

@SuperQ

This comment has been minimized.

Copy link
Member

SuperQ commented Mar 5, 2019

I would recommend upgrading to a more current release. There has been a lot of performance and bug fixes since 2.3.2.

I'm not sure where you're getting this data, but you might want to look at rate(process_cpu_seconds_total[1m]).

Also, please include the queries used for included graphs. Without the query we have no idea what the graphs mean.

@anumercian

This comment has been minimized.

Copy link
Author

anumercian commented Mar 5, 2019

Sure, that was a REST query to read the "top" usage of Prometheus daemon.

Query: rate(process_cpu_seconds_total[1m])

For 6hours
image

For 2days
image

@SuperQ

This comment has been minimized.

Copy link
Member

SuperQ commented Mar 5, 2019

Based on the graph, it looks like whatever is producing the other metric data is reporting false information.

@anumercian

This comment has been minimized.

Copy link
Author

anumercian commented Mar 5, 2019

@SuperQ What do you mean by other metric data? The difference between OS memory of 450MB and prom heap pprof of 120Mb?

@SuperQ

This comment has been minimized.

Copy link
Member

SuperQ commented Mar 5, 2019

No, the CPU spikes seem non-existent in the Prometheus direct instrumentation. I'm not sure where Daemon_resource_utilization comes from, but it appears inaccurate.

@anumercian

This comment has been minimized.

Copy link
Author

anumercian commented Mar 5, 2019

The daemon_resource_utilization is similar to the output of "top" on Linux shell, done for each daemon running and I am seeing Prometheus spiking to above 100% every 1-3minutes.

@SuperQ

This comment has been minimized.

Copy link
Member

SuperQ commented Mar 5, 2019

Prometheus CPU depends on what queries are being performed. The average CPU use seems to be about 10% of 1 core.

I don't see an actual issue here, other than "it's using resources".

@SuperQ

This comment has been minimized.

Copy link
Member

SuperQ commented Mar 5, 2019

As for the memory issue, can you please share the results of the query for process_resident_memory_bytes? This is the correct metric for memory use.

@anumercian

This comment has been minimized.

Copy link
Author

anumercian commented Mar 5, 2019

Okay, could the CPU spike that we see in "top" be the Go Garbage collector?

@anumercian

This comment has been minimized.

Copy link
Author

anumercian commented Mar 5, 2019

Will post the result of query "process_resident_memory_bytes" shortly

@anumercian

This comment has been minimized.

Copy link
Author

anumercian commented Mar 5, 2019

This is Heap Profile

image

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Mar 6, 2019

Closing since the discussion also happens on the mailing list (here) and there's no evidence of a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.