Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU usage skyrocketed after update from 2.3 to 2.6 #5162

Closed
rsommer opened this Issue Jan 31, 2019 · 18 comments

Comments

Projects
None yet
4 participants
@rsommer
Copy link

rsommer commented Jan 31, 2019

Bug Report

Today I updated from prometheus 2.3 to prometheus 2.6. After installation, cpu usage skyrocketed on both nodes (ha-setup).

What did you expect to see?
Nearly the same CPU usage as with 2.3

What did you see instead? Under which circumstances?
CPU usage increased from 5% to 70%

Environment

  • System information:

    Linux 4.9.0-8-amd64 x86_64

  • Prometheus version:

    prometheus, version 2.6.0+ds (branch: debian/sid, revision: 2.6.0+ds-1)
    build user: pkg-go-maintainers@lists.alioth.debian.org
    build date: 20181219-15:52:20
    go version: go1.10.4

  • Prometheus configuration file:

global:
  scrape_interval:     10s
  evaluation_interval: 10s
  external_labels:
      monitor: 'infrastructure'
      environment: 'production'
      replica: 'prom02'

scrape_configs:
  - job_name: 'prometheus'

    static_configs:
        - targets: ['localhost:9090']

    file_sd_configs:
        - files:
            - '/etc/prometheus/*.json'
          refresh_interval: 1m

  - job_name: 'slow'
    scrape_interval: 1m
    scrape_timeout: 45s
    file_sd_configs:
        - files:
            - '/etc/prometheus/slow/*.json'
          refresh_interval: 1m

There are currently 642 targets configured via file_sd.
node-monitoring

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jan 31, 2019

I see that you're installation from the Debian package. Can you try the same configuration with the binary from this archive instead?

@rsommer

This comment has been minimized.

Copy link
Author

rsommer commented Jan 31, 2019

I switched to

prometheus, version 2.6.0 (branch: HEAD, revision: dbd1d58c894775c0788470944b818cc724f550fb)
  build user:       root@bf5760470f13
  build date:       20181217-15:14:46
  go version:       go1.11.3

but still see a huge increase in CPU usage, after restart its about 50% instead of 5%.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 31, 2019

Could you get a CPU profile of the 2.6.0?

@rsommer

This comment has been minimized.

Copy link
Author

rsommer commented Jan 31, 2019

pprof.prometheus.samples.cpu.003.pb.gz
Attached a 60 seconds sample.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jan 31, 2019

Hmm it looks like most of the load is by the remote read endpoint.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 31, 2019

Yip, what do you have hitting that?

@rsommer

This comment has been minimized.

Copy link
Author

rsommer commented Jan 31, 2019

These instances a part of a thanos HA setup, so that should be the query nodes. But besides the version bump of the prometheus servers, everything else was untouched. The number of queries has not increased. Any idea at what additional metrics or profiling data I should look to find the culprit?

@rsommer

This comment has been minimized.

Copy link
Author

rsommer commented Jan 31, 2019

prometheus-2-0-overview
Mmh, there is something visible at the GC dashboard.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 31, 2019

I don't see any obvious culprits in the change history. Do the remote read metrics show an increase in traffic?

@rsommer

This comment has been minimized.

Copy link
Author

rsommer commented Jan 31, 2019

Ok, very strange ... since 30 minutes the GC graph has dropped - so has the CPU usage.

@rsommer

This comment has been minimized.

Copy link
Author

rsommer commented Jan 31, 2019

The prometheus_api_remote_read_queries metric was not available in the prometheus 2.3 setup. Since restarting with 2.6, the value was constantly about 10 and has dropped to 0 at the same time the CPU and GC graphs dropped.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 31, 2019

Sounds like something on the Thanos side then. @bwplotka

@bwplotka

This comment has been minimized.

Copy link
Contributor

bwplotka commented Jan 31, 2019

Not sure how it is on the Thanos side if Thanos version & load did not change, but Prometheus resource consumption change?

So what I think changed across those versions & remote read on Prom side are those remote read limits. and concurrency limits:

-storage.remote.read-sample-limit=5e7
-storage.remote.read-concurrent-limit=10  

Maybe that affected performance? For example you constantly hitting one of the limit (because you have it super low like 10 by default). Or just raw performance with those checks?

I would increase this limit to at least 100 as Thanos makes remote read your main read path from Prometheus and see how that improves things?

@rsommer

This comment has been minimized.

Copy link
Author

rsommer commented Jan 31, 2019

I increased the read-concurrent-limit and will have an eye on the servers if the CPU load will stay low or rise again.

@rsommer

This comment has been minimized.

Copy link
Author

rsommer commented Jan 31, 2019

The load is high again and it's coming from the thanos sidecar. I'll try to dig deeper in why its periodically happening.

@rsommer

This comment has been minimized.

Copy link
Author

rsommer commented Feb 1, 2019

Ok, I think this can be closed. After digging backwards through the complete metric stack, it seems to be triggered by a query over a large time window with 10s refresh rate, which was incidentally started about the time the update occured. Nevertheless, I did a few tests in our staging environment and can reproducible see higher load on prometheus when comparing 2.3 to 2.6. When displaying a node exporter dashboard with 10s refresh, I have around 28% CPU usage using prometheus 2.3, after updating to 2.6 I end up at nearly 40%.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Feb 1, 2019

Closing it then, feel free to re-open if you need to. Can you compare memory usage too? It could be that the garbage collection is more aggressive given that it isn't the same Go runtime version between 2.3 and 2.6.

@rsommer

This comment has been minimized.

Copy link
Author

rsommer commented Feb 1, 2019

Memory usage was also higher after switching to 2.6. What I also see on the node dashboard is much more network traffic after the update while that dashboard querys are running . I will try to look into that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.