-
Notifications
You must be signed in to change notification settings - Fork 39.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idle cluster, CPU usage increasing #10659
Comments
@dchen1107 for node problem @thockin can we replace the exec liveness probe with something? |
I made a mental note to myself that if anything was using exec on the node we might run into weirdness because of a potential timer leak but I dismissed it at the time. Does your pprof have a bloated siftdown timer block by any chance? |
Hmm, I take that back, the timeout is too small really. But can you post the pprof? |
Sorry, I didn't actually save the pprof from the broken state. But it's easy to reproduce. It did have a small timer section, but the vast majority of the time went to json decoding docker output (and the associated garbage cleanup). |
then ignore my eagerness to correlate to other cpu issues I've seen |
So is this an issue with any exec liveness probe or just this probe? On Wednesday, July 1, 2015, Prashanth B notifications@github.com wrote:
|
@bprashanth You're not wrong, that's clearly a bug, just not this bug :) @derekwaynecarr I think it's probably all exec probes, barring further evidence. Further detail: after turning off the exec probe, my cluster's CPU stopped growing but did not go down. |
Will look into it tomorrow. |
Without that probe we have no real way to know when DNS is in trouble. On Wed, Jul 1, 2015 at 10:28 PM, Dawn Chen notifications@github.com wrote:
|
Further followup: I rebooted the offending kubelet and the CPU dropped back to its initial level. Rebooting a different kubelet didn't have any effect. So it seems clear that exec leaks something. |
I did measurement over the long weekend through heapster, and couldn't reproduce the issue. I did observe slight cpu usage increase by running docker exec, but not as obvious as what reported here. Chatted with @lavalamp offline, he observed such increase from GCE pantheon ui, which is known NOT-ACCURATE since it includes all overheads for running VM itself. |
I can readily believe it's not accurate (it doesn't agree with |
I sync-ed my client to HEAD, which includes ##10763, then I brought up a cluster after lunch, and monitored kubelet usage through heapster, and couldn't reproduce the issue. I will leave the cluster for one night, and recheck it. |
I started a cluster from head just now, too. We'll see if I repro. :) On Mon, Jul 6, 2015 at 5:38 PM, Dawn Chen notifications@github.com wrote:
|
I leave my cluster run over-night. node af12 has kubernetes dns pod running and invoke docker exec every 10s. The below is kubelet cpu usage on node af12 over last 18 hours: There is slight cpu usage increase, but overall usage is ~2%. Based on this graph, it is not worthy doing anything for 1.0. We are going to continue monitoring this for more soaking time. |
I can no longer reproduce this, either. Will assume that #10763 fixed the vast majority of the problem. Thanks, Prashanth! |
Triaged to P0 in war room -- continually increasing CPU usage breaks clusters, so until we're sure it's fixed (and not just likely fixed), leaving open as P0. FWIW, the CPU reporting is in some sense the most accurate with that caveat that it includes all of the VM's usage -- user level, system level, kernel level, and yes, hypervisor overheads -- and is reflective of what the user sees as available for their own use. Even if the cpu usage is coming from a different source than kubelet, kubernetes isn't usable if $unknown_source climbs to 50+% usage. |
Also, @dchen1107 or @lavalamp -- seems like there's a test or release coverage gap here? Presumably this isn't something that'd make it out to an official borg or omega release. Do either of you have enough context to file an issue and/or propose something that would catch these next time? |
we could also use a TCP socket health check on port 53 as a poor man's health check. |
This would be only marginally better than no liveness probe - I am not so On Tue, Jul 7, 2015 at 4:11 PM, Brendan Burns notifications@github.com
|
yeah, though is bouncing the container really going to help it produce results? |
maybe. It helped in the past when skydns could not talk to etcd because it On Tue, Jul 7, 2015 at 4:21 PM, Brendan Burns notifications@github.com
|
@alex-mohr on why not switch back to nsinit: Kubelet still has that support if docker exec is not available. We could fall back to nsinit. But containervm image removed nsinit recently after we switched to docker native exec. Also this introduces another dependency on our node too. |
Disable liveness for dns due to #10659
Moving from P0 to P1 and to v1.0-post milestone. A workaround is in, and that's enough for v1. |
#10760 for release document |
moby/moby#12899 is probably relevant to this issue. We ran into a number of weird issues (cpu, memory, stalls, failure to exec) when using docker exec for a periodic health check (not even using k8s). |
This may be also caused by Heapster itself kubernetes-retired/heapster#397 |
/cc @jayunit100 @rrati. |
@piosz that explains Heapster's memory leakage issue I observed when working on |
Heapster wasn't on my node and that doesn't correlate with the exec probe restart causing cpu to fall. Of course, heapster has issues of its own :) |
Heapster may have its own problems, but it wasn't the cause of my initial On Thu, Jul 9, 2015 at 1:55 PM, Prashanth B notifications@github.com
|
The docker exec issue (moby/moby#14444) was closed long agao and we already have resource usage test running repeatedly int he soak cluster. Closing this ancient issue. |
Over the course of ~ 18 hours, my idle cluster's CPU usage went from ~5% -> ~20%. I repeated it with another cluster, which seems to have a single node with increasing CPU usage.
Surprisingly, kubelet seems to be the hog. I used pprof to figure it out. It turns out to be the exec based liveness probe that the kube-dns pod uses. Turning that off makes kubelet stop using so much CPU. The before & after pprof chart was very clear.
I don't have an answer for why the usage grows over time; maybe docker keeps some history such that every exec returns a bit more data than the previous one?
The text was updated successfully, but these errors were encountered: