Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU Usage with Recording Rules and ZK Patch #2481

Closed
jjneely opened this Issue Mar 7, 2017 · 6 comments

Comments

Projects
None yet
2 participants
@jjneely
Copy link
Contributor

jjneely commented Mar 7, 2017

What did you do?

I grabbed Prometheus 1.5.2 and applied the ZK patch from PR #2470, built and tested it. Next I had a team use the new version who monitors a pretty healthy list of Aurora (ZooKeeper discovery) jobs, and healthy usage of recording rules. We saw an immediate jump in CPU compared to the prior Prometheus version (1.2.1). It started taking all CPU resources we granted to the Aurora job running it.

prometheus-zk-cpu-bug

What did you expect to see?

As the configuration was unchanged, we expected to see similar CPU usage from prior versions.

What did you see instead? Under which circumstances?

CPU usage appears normal when I attempted to reproduce with just the prometheus.yml config file copied over to my test environment. CPU usage was under 2 cores. However, when I added the recording rules to my test environment the CPU usage spiked up to 17 cores on my test hardware. This bare metal server now runs at a load of 17+.

I got out the profiling tools to take a look and produced the following SVG:

p-test pprof

Why would recording rules cause more time to be spent in ZookeeperDiscovery?

Environment

My test hardware is Ubuntu Trusty running kernel 3.13.0-85-generic. 64G RAM and 24 cores.

Prometheus was compiled with Go 1.7.4.

  • System information:

    Linux 3.13.0-85-generic x86_64

  • Prometheus version:

    prometheus, version 1.5.2 (branch: foo, revision: 1.5.2-1+JENKINS~trusty.20170306212211)
    build user: jjneely@42lines.net
    build date: 2017-03-06T21:22:40Z
    go version: go1.7.4

  • Prometheus configuration file:

prometheus-test.txt

  • Prometheus Rules:

prometheus-test.rules.txt

@jjneely

This comment has been minimized.

Copy link
Contributor Author

jjneely commented Mar 7, 2017

Testing with this a bit more, this condition happens not with the addition of recording rules, but when I SIGHUP Prometheus to reload its rules.

@StephanErb

This comment has been minimized.

Copy link
Contributor

StephanErb commented Mar 22, 2017

Where you even doing a SIGHUP for the Prometheus 1.2.1 setup? If I am not mistaken it was not even working there. At least for us it always ended up in crashes when combining ZK service discovery and reloads.

In addition, happened this after a single reload or only after multiple ones? Maybe we are missing a necessary cleanup of old watcher or something like this.

@jjneely

This comment has been minimized.

Copy link
Contributor Author

jjneely commented Mar 22, 2017

Yeah, later resting reveals that this affects 1.2.1 as well. Maybe not as badly, as reloads are not uncommon. But yeah, there's some cleanup that's not working here.

@StephanErb

This comment has been minimized.

Copy link
Contributor

StephanErb commented Apr 14, 2017

I am able to reproduce the issue with Prometheus 1.6.0.

Observations after a SIGHUP:

  • level=info msg="Loading configuration file prometheus.yml" source="main.go:251" is logged.
  • CPU utilisation jumps significantly, maxing out the host.
  • Discovered targets drop out of /targets.
  • While it was constant before, the prometheus_treecache_watcher_goroutines is now increasing steadily.
  • After a while, CPU utilisation drops, targets show up, and prometheus_treecache_watcher_goroutines stays constant again, until we send the next SIGHUP

Edit:

After a couple of further reloads, the server crashed with:

time="2017-04-14T22:59:12Z" level=debug msg="Received Zookeeper event: {EventNodeChildrenChanged Unknown /aurora/testuser/devel/user1-test-ps-survives-gw.backend.ps <nil> }" source="treecache.go:118"
time="2017-04-14T22:59:12Z" level=debug msg="Received Zookeeper event: {EventNodeChildrenChanged Unknown /aurora/testuser/devel/user1-test-ps-survives-gw.backend.ps <nil> }" source="treecache.go:118"
time="2017-04-14T22:59:12Z" level=debug msg="Received Zookeeper event: {EventNodeChildrenChanged Unknown /aurora/testuser/devel/user1-test-ps-survives-gw.backend.ps <nil> }" source="treecache.go:118"
time="2017-04-14T22:59:12Z" level=debug msg="Received Zookeeper event: {EventNodeChildrenChanged Unknown /aurora/testuser/devel/user1-test-ps-survives-gw.backend.ps <nil> }" source="treecache.go:118"
time="2017-04-14T22:59:12Z" level=debug msg="Received Zookeeper event: {EventNodeChildrenChanged Unknown /aurora/testuser/devel/user1-test-ps-survives-gw.backend.ps <nil> }" source="treecache.go:118"
time="2017-04-14T22:59:12Z" level=debug msg="Received Zookeeper event: {EventNodeChildrenChanged Unknown /aurora/testuser/devel/user1-test-ps-survives-gw.backend.ps <nil> }" source="treecache.go:118"
time="2017-04-14T22:59:12Z" level=debug msg="Received Zookeeper event: {EventNodeChildrenChanged Unknown /aurora/testuser/devel/user1-test-ps-survives-gw.backend.ps <nil> }" source="treecache.go:118"
time="2017-04-14T22:59:12Z" level=debug msg="Received Zookeeper event: {EventNodeChildrenChanged Unknown /aurora/testuser/devel/user1-test-ps-survives-gw.backend.ps <nil> }" source="treecache.go:118"
panic: close of closed channel

goroutine 259 [running]:
github.com/prometheus/prometheus/util/treecache.(*ZookeeperTreeCache).loop(0xc428f6dbc0, 0xc4201618b7, 0x9)
        /go/src/github.com/prometheus/prometheus/util/treecache/treecache.go:171 +0xb54
created by github.com/prometheus/prometheus/util/treecache.NewZookeeperTreeCache
        /go/src/github.com/prometheus/prometheus/util/treecache/treecache.go:89 +0x1bb
time="2017-04-14T22:59:13Z" level=info msg="Starting prometheus (version=1.6.0, branch=master, revision=10f645382988add09c8e3420d0d7ff2dc4d36a29)" source="main.go:88"
@StephanErb

This comment has been minimized.

Copy link
Contributor

StephanErb commented Apr 30, 2017

@jjneely I have filed the pull request #2669 with a fix. Would be great if you could give it a test run as well.

@juliusv juliusv closed this in #2669 May 2, 2017

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.