Idle cluster, CPU usage increasing #10659

lavalamp · 2015-07-02T01:21:01Z

Over the course of ~ 18 hours, my idle cluster's CPU usage went from ~5% -> ~20%. I repeated it with another cluster, which seems to have a single node with increasing CPU usage.

Surprisingly, kubelet seems to be the hog. I used pprof to figure it out. It turns out to be the exec based liveness probe that the kube-dns pod uses. Turning that off makes kubelet stop using so much CPU. The before & after pprof chart was very clear.

I don't have an answer for why the usage grows over time; maybe docker keeps some history such that every exec returns a bit more data than the previous one?

lavalamp · 2015-07-02T01:21:33Z

@dchen1107 for node problem

@thockin can we replace the exec liveness probe with something?

bprashanth · 2015-07-02T01:30:20Z

I made a mental note to myself that if anything was using exec on the node we might run into weirdness because of a potential timer leak but I dismissed it at the time. Does your pprof have a bloated siftdown timer block by any chance?

bprashanth · 2015-07-02T01:34:27Z

Hmm, I take that back, the timeout is too small really. But can you post the pprof?

lavalamp · 2015-07-02T01:39:42Z

Sorry, I didn't actually save the pprof from the broken state. But it's easy to reproduce. It did have a small timer section, but the vast majority of the time went to json decoding docker output (and the associated garbage cleanup).

bprashanth · 2015-07-02T01:42:46Z

then ignore my eagerness to correlate to other cpu issues I've seen

derekwaynecarr · 2015-07-02T02:02:42Z

So is this an issue with any exec liveness probe or just this probe?

On Wednesday, July 1, 2015, Prashanth B notifications@github.com wrote:

then ignore my eagerness to correlate to other cpu issues I've seen

—
Reply to this email directly or view it on GitHub
#10659 (comment)
.

lavalamp · 2015-07-02T05:01:51Z

@bprashanth You're not wrong, that's clearly a bug, just not this bug :)

@derekwaynecarr I think it's probably all exec probes, barring further evidence.

Further detail: after turning off the exec probe, my cluster's CPU stopped growing but did not go down.

dchen1107 · 2015-07-02T05:28:14Z

Will look into it tomorrow.

thockin · 2015-07-02T05:51:30Z

Without that probe we have no real way to know when DNS is in trouble.

On Wed, Jul 1, 2015 at 10:28 PM, Dawn Chen notifications@github.com wrote:

Will look into it tomorrow.

—
Reply to this email directly or view it on GitHub
#10659 (comment)
.

lavalamp · 2015-07-04T03:36:52Z

Further followup: I rebooted the offending kubelet and the CPU dropped back to its initial level. Rebooting a different kubelet didn't have any effect. So it seems clear that exec leaks something.

dchen1107 · 2015-07-06T22:34:52Z

I did measurement over the long weekend through heapster, and couldn't reproduce the issue. I did observe slight cpu usage increase by running docker exec, but not as obvious as what reported here. Chatted with @lavalamp offline, he observed such increase from GCE pantheon ui, which is known NOT-ACCURATE since it includes all overheads for running VM itself.

lavalamp · 2015-07-06T22:52:51Z

I can readily believe it's not accurate (it doesn't agree with top, for example), but since restarting kubelet made the numbers go back down, I don't believe the VM overhead is a contributing factor.

dchen1107 · 2015-07-07T00:38:12Z

I sync-ed my client to HEAD, which includes ##10763, then I brought up a cluster after lunch, and monitored kubelet usage through heapster, and couldn't reproduce the issue. I will leave the cluster for one night, and recheck it.

lavalamp · 2015-07-07T01:15:59Z

I started a cluster from head just now, too. We'll see if I repro. :)

On Mon, Jul 6, 2015 at 5:38 PM, Dawn Chen notifications@github.com wrote:

I sync-ed my client to HEAD, which includes ##10763
#10763, then I
brought up a cluster after lunch, and monitored kubelet usage through
heapster, and couldn't reproduce the issue. I will leave the cluster for
one night, and recheck it.

—
Reply to this email directly or view it on GitHub
#10659 (comment)
.

dchen1107 · 2015-07-07T16:46:30Z

I leave my cluster run over-night. node af12 has kubernetes dns pod running and invoke docker exec every 10s. The below is kubelet cpu usage on node af12 over last 18 hours:

There is slight cpu usage increase, but overall usage is ~2%. Based on this graph, it is not worthy doing anything for 1.0.

We are going to continue monitoring this for more soaking time.

lavalamp · 2015-07-07T16:59:17Z

I can no longer reproduce this, either. Will assume that #10763 fixed the vast majority of the problem. Thanks, Prashanth!

alex-mohr · 2015-07-07T17:02:17Z

Triaged to P0 in war room -- continually increasing CPU usage breaks clusters, so until we're sure it's fixed (and not just likely fixed), leaving open as P0.

FWIW, the CPU reporting is in some sense the most accurate with that caveat that it includes all of the VM's usage -- user level, system level, kernel level, and yes, hypervisor overheads -- and is reflective of what the user sees as available for their own use. Even if the cpu usage is coming from a different source than kubelet, kubernetes isn't usable if $unknown_source climbs to 50+% usage.

alex-mohr · 2015-07-07T17:09:46Z

Also, @dchen1107 or @lavalamp -- seems like there's a test or release coverage gap here? Presumably this isn't something that'd make it out to an official borg or omega release. Do either of you have enough context to file an issue and/or propose something that would catch these next time?

brendandburns · 2015-07-07T23:11:07Z

we could also use a TCP socket health check on port 53 as a poor man's health check.

thockin · 2015-07-07T23:19:21Z

This would be only marginally better than no liveness probe - I am not so
much worried about the server not listening as I am about it not producing
results.

On Tue, Jul 7, 2015 at 4:11 PM, Brendan Burns notifications@github.com
wrote:

we could also use a TCP socket health check on port 53 as a poor man's
health check.

—
Reply to this email directly or view it on GitHub
#10659 (comment)
.

brendandburns · 2015-07-07T23:21:12Z

yeah, though is bouncing the container really going to help it produce results?

thockin · 2015-07-07T23:24:26Z

maybe. It helped in the past when skydns could not talk to etcd because it
raced at startup. I really want multi-container or whole-pod bounces, but
that is a larger feature (already filed)

On Tue, Jul 7, 2015 at 4:21 PM, Brendan Burns notifications@github.com
wrote:

yeah, though is bouncing the container really going to help it produce
results?

—
Reply to this email directly or view it on GitHub
#10659 (comment)
.

dchen1107 · 2015-07-07T23:38:50Z

@alex-mohr on why not switch back to nsinit: Kubelet still has that support if docker exec is not available. We could fall back to nsinit. But containervm image removed nsinit recently after we switched to docker native exec. Also this introduces another dependency on our node too.

Disable liveness for dns due to #10659

goltermann · 2015-07-08T16:52:15Z

Moving from P0 to P1 and to v1.0-post milestone. A workaround is in, and that's enough for v1.

dchen1107 · 2015-07-08T17:00:46Z

#10760 for release document

philk · 2015-07-08T22:25:20Z

moby/moby#12899 is probably relevant to this issue. We ran into a number of weird issues (cpu, memory, stalls, failure to exec) when using docker exec for a periodic health check (not even using k8s).

piosz · 2015-07-09T08:13:47Z

This may be also caused by Heapster itself kubernetes-retired/heapster#397

timothysc · 2015-07-09T20:36:12Z

/cc @jayunit100 @rrati.

dchen1107 · 2015-07-09T20:42:58Z

@piosz that explains Heapster's memory leakage issue I observed when working on
#10653 (comment)

bprashanth · 2015-07-09T20:55:04Z

Heapster wasn't on my node and that doesn't correlate with the exec probe restart causing cpu to fall. Of course, heapster has issues of its own :)

lavalamp · 2015-07-09T22:34:13Z

Heapster may have its own problems, but it wasn't the cause of my initial
report.

On Thu, Jul 9, 2015 at 1:55 PM, Prashanth B notifications@github.com
wrote:

Heapster wasn't on my node and that doesn't correlate with the exec probe
restart causing cpu to fall. Of course, heapster has issues of its own :)

—
Reply to this email directly or view it on GitHub
#10659 (comment)
.

yujuhong · 2016-08-11T23:54:00Z

The docker exec issue (moby/moby#14444) was closed long agao and we already have resource usage test running repeatedly int he soak cluster. Closing this ancient issue.

lavalamp added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jul 2, 2015

lavalamp added this to the v1.0-candidate milestone Jul 2, 2015

dchen1107 added the kind/bug Categorizes issue or PR as related to a bug. label Jul 2, 2015

dchen1107 modified the milestones: v1.0, v1.0-candidate Jul 2, 2015

dchen1107 self-assigned this Jul 7, 2015

alex-mohr added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jul 7, 2015

lavalamp closed this as completed Jul 7, 2015

alex-mohr reopened this Jul 7, 2015

alex-mohr closed this as completed Jul 7, 2015

dchen1107 added a commit to dchen1107/kubernetes-1 that referenced this issue Jul 7, 2015

Disable liveness for dns due to kubernetes#10659

4f947ce

yujuhong added a commit that referenced this issue Jul 8, 2015

Merge pull request #10884 from dchen1107/cleanup

f7e1a00

Disable liveness for dns due to #10659

bprashanth mentioned this issue Jul 8, 2015

Sidecar container capable of servicing exec style liveness probes over http #10925

Merged

goltermann added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Jul 8, 2015

goltermann modified the milestones: v1.0-post, v1.0 Jul 8, 2015

zmerlynn pushed a commit to zmerlynn/kubernetes that referenced this issue Jul 8, 2015

Disable liveness for dns due to kubernetes#10659

3eb0444

zmerlynn mentioned this issue Jul 8, 2015

Automated cherry pick of #10884 on upstream release 0.21 #10939

Merged

bprashanth mentioned this issue Jul 9, 2015

Use the exec-sidecar as a healthz probe for the syndns container #11004

Merged

brendandburns mentioned this issue Jul 9, 2015

Kubelet taking large amounts of cpu in v0.19.3 #10451

Closed

dchen1107 mentioned this issue Jul 10, 2015

v1.0.0 known issues / FAQ accumulator #10760

Closed

bgrant0607 removed this from the v1.0-post milestone Jul 24, 2015

yujuhong closed this as completed Aug 11, 2016

bprashanth mentioned this issue Oct 7, 2016

Better kubectl exec #10975

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idle cluster, CPU usage increasing #10659

Idle cluster, CPU usage increasing #10659

lavalamp commented Jul 2, 2015

lavalamp commented Jul 2, 2015

bprashanth commented Jul 2, 2015

bprashanth commented Jul 2, 2015

lavalamp commented Jul 2, 2015

bprashanth commented Jul 2, 2015

derekwaynecarr commented Jul 2, 2015

lavalamp commented Jul 2, 2015

dchen1107 commented Jul 2, 2015

thockin commented Jul 2, 2015

lavalamp commented Jul 4, 2015

dchen1107 commented Jul 6, 2015

lavalamp commented Jul 6, 2015

dchen1107 commented Jul 7, 2015

lavalamp commented Jul 7, 2015

dchen1107 commented Jul 7, 2015

lavalamp commented Jul 7, 2015

alex-mohr commented Jul 7, 2015

alex-mohr commented Jul 7, 2015

brendandburns commented Jul 7, 2015

thockin commented Jul 7, 2015

brendandburns commented Jul 7, 2015

thockin commented Jul 7, 2015

dchen1107 commented Jul 7, 2015

goltermann commented Jul 8, 2015

dchen1107 commented Jul 8, 2015

philk commented Jul 8, 2015

piosz commented Jul 9, 2015

timothysc commented Jul 9, 2015

dchen1107 commented Jul 9, 2015

bprashanth commented Jul 9, 2015

lavalamp commented Jul 9, 2015

yujuhong commented Aug 11, 2016

Idle cluster, CPU usage increasing #10659

Idle cluster, CPU usage increasing #10659

Comments

lavalamp commented Jul 2, 2015

lavalamp commented Jul 2, 2015

bprashanth commented Jul 2, 2015

bprashanth commented Jul 2, 2015

lavalamp commented Jul 2, 2015

bprashanth commented Jul 2, 2015

derekwaynecarr commented Jul 2, 2015

lavalamp commented Jul 2, 2015

dchen1107 commented Jul 2, 2015

thockin commented Jul 2, 2015

lavalamp commented Jul 4, 2015

dchen1107 commented Jul 6, 2015

lavalamp commented Jul 6, 2015

dchen1107 commented Jul 7, 2015

lavalamp commented Jul 7, 2015

dchen1107 commented Jul 7, 2015

lavalamp commented Jul 7, 2015

alex-mohr commented Jul 7, 2015

alex-mohr commented Jul 7, 2015

brendandburns commented Jul 7, 2015

thockin commented Jul 7, 2015

brendandburns commented Jul 7, 2015

thockin commented Jul 7, 2015

dchen1107 commented Jul 7, 2015

goltermann commented Jul 8, 2015

dchen1107 commented Jul 8, 2015

philk commented Jul 8, 2015

piosz commented Jul 9, 2015

timothysc commented Jul 9, 2015

dchen1107 commented Jul 9, 2015

bprashanth commented Jul 9, 2015

lavalamp commented Jul 9, 2015

yujuhong commented Aug 11, 2016