Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upHigh CPU after config change and SIGHUP #1114
Comments
This comment has been minimized.
This comment has been minimized.
|
I'm not entirely sure how to interpret these, but here's an strace before / after a config reload:
After:
|
fabxc
added
the
Critical
label
Sep 24, 2015
fabxc
added this to the v0.16.0 milestone
Sep 24, 2015
fabxc
added
the
bug
label
Sep 25, 2015
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
So it it looks like when a FileSD shuts down, it closes its target groups channel, causing the https://github.com/prometheus/prometheus/blob/master/retrieval/targetmanager.go#L142-L146 Even though @fabxc How about instead of |
This comment has been minimized.
This comment has been minimized.
|
Actually, no. It looks like sometimes the |
This comment has been minimized.
This comment has been minimized.
|
The outstanding PR for 0.17.0 is not rewrite of this section. |
This comment has been minimized.
This comment has been minimized.
|
@fabxc Hmm, ok. But looking at https://github.com/prometheus/prometheus/pull/1064/files, you actually did change exactly this code to always return in case of a closed target group channel. I still have to understand why the |
This comment has been minimized.
This comment has been minimized.
|
Yes, that's a line I touched (and probably should have backported in this way as we decided to defer this PR). |
This comment has been minimized.
This comment has been minimized.
|
Yeah, good to know and keep in mind. Still trying to figure out why |
This comment has been minimized.
This comment has been minimized.
|
Tell me when you found out – I feel like this is related to the mysterious out-of-order appending in #1064. |
This comment has been minimized.
This comment has been minimized.
|
Seems I found it out: since the target manager is reused and simply stopped and restarted, it gets a new |
juliusv
added a commit
that referenced
this issue
Sep 25, 2015
juliusv
referenced this issue
Sep 25, 2015
Merged
Fix target manager CPU busyloop caused by bad done-channel handling. #1116
This comment has been minimized.
This comment has been minimized.
|
@dan-cleinmark Could you take a look whether #1116 fixes this issue for you? It fixes it for me. |
juliusv
added a commit
that referenced
this issue
Sep 28, 2015
juliusv
closed this
in
#1116
Sep 28, 2015
This comment has been minimized.
This comment has been minimized.
|
@juliusv belatedly, this looks great! Thanks for the quick turn around. |
This comment has been minimized.
This comment has been minimized.
|
@dan-cleinmark Great to hear! No, thank you for catching this pre-0.16.0! |
fabxc
added a commit
that referenced
this issue
Jan 11, 2016
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 24, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |

dan-cleinmark commentedSep 24, 2015
We're seeing very high CPU use on all cores after a SIGHUP if the prometheus config file changes. Originally we were on 0.15.1 rev. 7a6d12a and the issue still persists on 0.16.0rc1. CPU performance is as expected when the service first starts but consistently drops off after a config change & SIGHUP. A screenshot showing CPU load after a config change & HUP is below.
The green line shows CPU cores idle. The first drop shows an addition in the global labels followed by a HUP, the next 2 are HUPing prometheus after removing target groups - the increases correlate to restarts of the prometheus service. When CPU spikes after a config reload, we see ~ 4/4 cores in 'user' state and on a 16 core box we see ~25% user, ~25% system and ~50% iowait - interestingly, disk load was always constantly stable, even though CPU iowait was very high.
Our prometheus config (200 total file_sd_config groups):