e2e flake: Kubelet [Serial] regular resource usage tracking for 35 pods per node over 20m0s #19014

gmarek · 2015-12-22T15:24:06Z

http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-slow/1226/

Kubelet [Serial] regular resource usage tracking for 35 pods per node over 20m0s

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubelet_perf.go:183 Dec 22 04:39:35.502: CPU usage exceeding limits: node e2e-slow-minion-2opu: container "/kubelet": expected 95th% usage < 0.350; got 0.521, container "/docker-daemon": expected 50th% usage < 0.060; got 0.066, container "/docker-daemon": expected 95th% usage < 0.300; got 0.381

cc @kubernetes/goog-node

The text was updated successfully, but these errors were encountered:

yujuhong · 2015-12-22T17:33:57Z

Thanks for reporting. The test flaked twice in the last 30 builds. Looking now.

yujuhong · 2016-01-04T22:46:05Z

It hasn't happened in the last 30 builds, so I'm lowering the priority to p2 for the time being.

This seems to happen in the extreme case where all add-on pods are run on the same node. As we increase the number of pods, the noise would become more negligible. The better solution is to disable the addon pods, but that could be more problematic for other tests run before and after this test. I'll wait until the next occurrence to confirm my theory.

yujuhong · 2016-01-21T23:01:01Z

Update: no failure in the past week.

dchen1107 · 2016-01-22T21:20:41Z

Can we move the test out of flaky and close the issue here?

yujuhong · 2016-01-22T22:51:19Z

The test is not in flaky and has be running fine. I'll close the issue for now until we see it happen again.

gmarek · 2016-02-25T09:08:56Z

Happening again in soak cluster:

Feb 24 19:45:06.334: CPU usage exceeding limits: node gke-jenkins-e2e-20805c06-node-zyyk: container "/kubelet": expected 95th% usage < 0.140; got 0.141 node gke-jenkins-e2e-20805c06-node-feqr: container "/kubelet": expected 95th% usage < 0.140; got 0.151

gmarek · 2016-03-09T18:31:02Z

By looking at numbers it seems to me that there's a memory leak in Kubelet:
http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-soak-continuous-e2e-gce/4870/

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubelet_perf.go:251 Mar 9 07:42:13.166: Memory usage exceeding limits: node jenkins-e2e-minion-rcef: container "/kubelet": expected RSS memory (MB) < 78643200; got 307929088 node jenkins-e2e-minion-i8pl: container "/kubelet": expected RSS memory (MB) < 78643200; got 118616064

yujuhong · 2016-03-09T21:16:53Z

I checked jenkins-e2e-minion-rcef and dumped the goroutines: https://gist.github.com/yujuhong/64f0f2c08a5f20f51099
There were only 10~15 containers running at the time, but pprof showed ~700 goroutines running.

215 @ 0x416b65 0x416bd8 0x41a281 0x4bbcde 0x13420fc 0x133d883 0x1344342 0x440291
#   0x4bbcde    sync.(*Cond).Wait+0x9e                      /usr/src/go/src/sync/cond.go:62
#   0x13420fc   github.com/docker/spdystream.(*PriorityFrameQueue).Pop+0x11c    /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/priority.go:86
#   0x133d883   github.com/docker/spdystream.(*Connection).frameHandler+0x33    /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/connection.go:398
#   0x1344342   github.com/docker/spdystream.func·005+0x72         /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/connection.go:315

43 @ 0x416b65 0x418e0c 0x4182a2 0x133b92e 0x440291
#   0x133b92e   github.com/docker/spdystream.(*idleAwareFramer).monitor+0x69e   /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/connection.go:61

39 @ 0x416b65 0x416bd8 0x41a281 0x4bbcde 0x1344688 0x440291
#   0x4bbcde    sync.(*Cond).Wait+0x9e              /usr/src/go/src/sync/cond.go:62

39 @ 0x416b65 0x418e0c 0x4182a2 0x13403d3 0x440291
#   0x13403d3   github.com/docker/spdystream.(*Connection).shutdown+0x473   /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/connection.go:704

yujuhong · 2016-03-09T21:32:09Z

heap profile: 130: 55846400 [343271: 18892453824] @ heap/1048576
43: 45088768 [492: 515899392] @ 0x41b01c 0x1203acc 0x1205361 0x1205b8d 0x1205cb3 0x13044ad 0x1304699 0xa7115e 0x15c9b95 0x15ca9a7 0x15c86a4 0x15c976a 0x133ba00 0x1340b5b 0x13436bf 0xf26dfc 0xf276e1 0x133e638 0x133db12 0x1344342 0x440291
#       0x1203acc       compress/flate.(*compressor).initDeflate+0x3c                                                                                   /usr/src/go/src/compress/flate/deflate.go:215
#       0x1205361       compress/flate.(*compressor).init+0x201                                                                                         /usr/src/go/src/compress/flate/deflate.go:410
#       0x1205b8d       compress/flate.NewWriter+0x6d                                                                                                   /usr/src/go/src/compress/flate/deflate.go:487
#       0x1205cb3       compress/flate.NewWriterDict+0xd3                                                                                               /usr/src/go/src/compress/flate/deflate.go:501
#       0x13044ad       compress/zlib.(*Writer).writeHeader+0x28d                                                                                       /usr/src/go/src/compress/zlib/writer.go:134
#       0x1304699       compress/zlib.(*Writer).Write+0x49                                                                                              /usr/src/go/src/compress/zlib/writer.go:148
#       0xa7115e        encoding/binary.Write+0x1afe                                                                                                    /usr/src/go/src/encoding/binary/binary.go:333
#       0x15c9b95       github.com/docker/spdystream/spdy.writeHeaderValueBlock+0x105                                                                   /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/spdy/write.go:157
#       0x15ca9a7       github.com/docker/spdystream/spdy.(*Framer).writeSynReplyFrame+0x167                                                            /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/spdy/write.go:237
#       0x15c86a4       github.com/docker/spdystream/spdy.(*SynReplyFrame).write+0x44                                                                   /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/spdy/write.go:19
#       0x15c976a       github.com/docker/spdystream/spdy.(*Framer).WriteFrame+0x4a                                                                     /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/spdy/write.go:138
#       0x133ba00       github.com/docker/spdystream.(*idleAwareFramer).WriteFrame+0xd0                                                                 /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/connection.go:140
#       0x1340b5b       github.com/docker/spdystream.(*Connection).sendReply+0xdb                                                                       /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/connection.go:853
#       0x13436bf       github.com/docker/spdystream.(*Stream).SendReply+0x22f                                                                          /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/stream.go:220
#       0xf26dfc        k8s.io/kubernetes/pkg/util/httpstream/spdy.(*connection).newSpdyStream+0x22c                                                    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/util/httpstream/spdy/connection.go:133
#       0xf276e1        k8s.io/kubernetes/pkg/util/httpstream/spdy.*connection.(k8s.io/kubernetes/pkg/util/httpstream/spdy.newSpdyStream)·fm+0x31       /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/util/httpstream/spdy/connection.go:68
#       0x133e638       github.com/docker/spdystream.(*Connection).handleStreamFrame+0x178                                                              /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/connection.go:490
#       0x133db12       github.com/docker/spdystream.(*Connection).frameHandler+0x2c2                                                                   /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/connection.go:406
#       0x1344342       github.com/docker/spdystream.func·005+0x72                                                                                      /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/connection.go:315

25: 6553600 [216: 56623104] @ 0x41b01c 0x1203b3c 0x1205361 0x1205b8d 0x1205cb3 0x13044ad 0x1304699 0xa7115e 0x15c9b95 0x15ca9a7 0x15c86a4 0x15c976a 0x133ba00 0x1340b5b 0x13436bf 0xf26dfc 0xf276e1 0x133e638 0x133db12 0x1344342 0x440291
#       0x1203b3c       compress/flate.(*compressor).initDeflate+0xac                                                                                   /usr/src/go/src/compress/flate/deflate.go:216
#       0x1205361       compress/flate.(*compressor).init+0x201                                                                                         /usr/src/go/src/compress/flate/deflate.go:410
#       0x1205b8d       compress/flate.NewWriter+0x6d                                                                                                   /usr/src/go/src/compress/flate/deflate.go:487
#       0x1205cb3       compress/flate.NewWriterDict+0xd3                                                                                               /usr/src/go/src/compress/flate/deflate.go:501
#       0x13044ad       compress/zlib.(*Writer).writeHeader+0x28d                                                                                       /usr/src/go/src/compress/zlib/writer.go:134
#       0x1304699       compress/zlib.(*Writer).Write+0x49                                                                                              /usr/src/go/src/compress/zlib/writer.go:148
#       0xa7115e        encoding/binary.Write+0x1afe                                                                                                    /usr/src/go/src/encoding/binary/binary.go:333
#       0x15c9b95       github.com/docker/spdystream/spdy.writeHeaderValueBlock+0x105                                                                   /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/spdy/write.go:157
#       0x15ca9a7       github.com/docker/spdystream/spdy.(*Framer).writeSynReplyFrame+0x167                                                            /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/spdy/write.go:237
#       0x15c86a4       github.com/docker/spdystream/spdy.(*SynReplyFrame).write+0x44                                                                   /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/spdy/write.go:19
#       0x15c976a       github.com/docker/spdystream/spdy.(*Framer).WriteFrame+0x4a                                                                     /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/spdy/write.go:138
#       0x133ba00       github.com/docker/spdystream.(*idleAwareFramer).WriteFrame+0xd0                                                                 /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/connection.go:140
#       0x1340b5b       github.com/docker/spdystream.(*Connection).sendReply+0xdb                                                                       /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/connection.go:853
#       0x13436bf       github.com/docker/spdystream.(*Stream).SendReply+0x22f                                                                          /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/stream.go:220
#       0xf26dfc        k8s.io/kubernetes/pkg/util/httpstream/spdy.(*connection).newSpdyStream+0x22c                                                    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/util/httpstream/spdy/connection.go:133
#       0xf276e1        k8s.io/kubernetes/pkg/util/httpstream/spdy.*connection.(k8s.io/kubernetes/pkg/util/httpstream/spdy.newSpdyStream)·fm+0x31       /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/util/httpstream/spdy/connection.go:68
#       0x133e638       github.com/docker/spdystream.(*Connection).handleStreamFrame+0x178                                                              /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/connection.go:490
#       0x133db12       github.com/docker/spdystream.(*Connection).frameHandler+0x2c2                                                                   /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/connection.go:406
#       0x1344342       github.com/docker/spdystream.func·005+0x72                                                                                      /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/connection.go:315

/cc @ncdc, @kubernetes/sig-node

ncdc · 2016-03-09T21:37:54Z

@yujuhong what exec / attach / port-forward calls are in play during this test?

yujuhong · 2016-03-09T21:39:15Z

I am not sure which e2e test caused this, but https://github.com/kubernetes/kubernetes/blob/master/test/e2e/kubectl.go does test all those functions.

ncdc · 2016-03-09T21:40:41Z

Can you describe how this perf test relates to other e2es?

yujuhong · 2016-03-09T21:42:39Z

The test was just monitoring the memory/cpu usage of docker and kubelet, and it failed because of the high memory usage caused by the ~200 goroutines.

ncdc · 2016-03-09T21:43:13Z

Right, but what is creating those spdy goroutines? Those only come from exec / attach / port-forward.

yujuhong · 2016-03-09T21:43:40Z

No, they must have been leftover from previous e2e tests.

ncdc · 2016-03-09T21:44:07Z

So it runs a bunch of e2es against a cluster, and then this test eventually runs against the same cluster?

yujuhong · 2016-03-09T21:45:27Z

Yes, the e2e tests are run continuously in the same cluster (soak cluster), and this test is just one of them.

ncdc · 2016-03-09T21:47:26Z

Ok, thanks for the clarification. I'll see if I can reproduce locally. I had previously fixed one of these issues (spdy connection hanging while shutting down, which ultimately will clear after 10 minutes), but that fix afaik is in Kube already.

yujuhong · 2016-03-09T21:54:42Z

FYI, we create a new soak cluster every week, and the current version is v1.3.0-alpha.0.232+8b186991e2a857

yujuhong · 2016-03-09T22:01:34Z

@ncdc, I ran the "kubectl client" e2e test and after 15 minutes, my node was still left with 25 connections/goroutines. You should be able to reproduce this easily. Let me know if there's anything I can help.

ncdc · 2016-03-09T22:03:07Z

@yujuhong yes, I can reproduce. I'll let you know what I find out (might not be until tomorrow).

ncdc · 2016-03-10T16:46:33Z

The spdy goroutines are lingering because of the way that the httpstream wrapper handles closing a connection. The code currently calls stream.Close() for all streams, then conn.Close() for the underlying spdy connection. The problem with this approach is that stream.Close() is unidirectional and only indicates that no more data will be sent over that stream. And the spdystream connection Close() function is designed to be graceful, in that it waits up to 10 minutes for all streams to be torn down. In order for a stream to be fully torn down, either a) both the client and server must each issue stream.Close(), or b) either side can issue stream.Reset(), which fully tears down the stream in both the client and the server.

I have changed stream.Close() to stream.Reset() locally and that appears to fix the majority of the lingering goroutines. What I'm seeing now is that if you exec with a tty (e.g. kubectl exec -it somepod bash), sometimes 1 goroutine doesn't get cleaned up:

2: select [26~34 minutes] [Created by go-dockerclient.(*Client).hijack.func1 @ client.go:670]
    spdystream      stream.go:80                   (*Stream).Read(*, *, 0x8000, 0x8000, 0x1, 0, 0)
    server          <autogenerated>:73             (*streamAndReply).Read(*, *, 0x8000, 0x8000, 0x1, 0, 0)
    io              io.go:381                      copyBuffer(#372, *, #356, *, *, 0x8000, 0x8000, *, 0, 0)
    io              io.go:351                      Copy(#372, *, #356, *, *, 0, 0)
    go-dockerclient client.go:664                  (*Client).hijack.func1.2(*, #357, *, *)

I'm still looking into why the spdy stream isn't unblocking the Read call appropriately.

timothysc · 2016-03-10T17:21:37Z

How much memory are we talking about?
Also this seems like an easy test to automate.. Back-2-back fill and drains, as a separate test.

/cc @jayunit100

ncdc · 2016-03-10T17:40:22Z

moby/spdystream#66 will fix the lingering goroutine created by go-dockerclient.

ncdc · 2016-03-10T17:46:25Z

#22802 + the spdystream PR should fix the spdy aspect of this issue.

ncdc · 2016-03-10T17:54:39Z

spdystream PR merged. I'm preparing a bump PR now.

ncdc · 2016-03-10T18:00:19Z

Looks like the port forwarding e2es are still leaking some goroutines. I'll look into that next.

ncdc · 2016-03-10T18:58:53Z

The port forwarding e2es are leaking because of the way that the shelled out call to kubectl port-forward is terminated by the test. It sends SIGKILL to the process, which doesn't allow the client to tell the server it's going away. I would hope the server would eventually figure it out, but I need to see about that.

ncdc · 2016-03-10T19:35:29Z

Because the flow is kubectl -> apiserver -> kubelet, when kubectl is forcibly killed, the proxy code running in the apiserver isn't smart enough at the moment to terminate the proxying, which results in dangling goroutines in both the apiserver (the proxying) and the kubelet (the spdystream frame workers and connection handler).

ncdc · 2016-03-11T20:22:53Z

Now that 22802 and 22804 are merged, this should be better, at least for the spdystream parts. I'm going to close this for now.

yujuhong · 2016-03-11T21:50:26Z

@k8s-oncall, @dchen1107 the soak clusters will continue failing without the fixes.

gmarek added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. kind/flake Categorizes issue or PR as related to a flaky test. labels Dec 22, 2015

gmarek assigned yujuhong Dec 22, 2015

yujuhong added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jan 4, 2016

yujuhong closed this as completed Jan 22, 2016

gmarek reopened this Feb 25, 2016

gmarek added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Mar 9, 2016

yujuhong added this to the v1.2 milestone Mar 9, 2016

dchen1107 assigned ncdc and unassigned yujuhong Mar 9, 2016

yujuhong mentioned this issue Mar 9, 2016

Kubelet resource usage tracking flaked after node upgrades from v1.1 to v1.2 #22370

Closed

ncdc mentioned this issue Mar 10, 2016

Close streams with Reset instead of Close when closing streaming connections #22802

Merged

ncdc mentioned this issue Mar 10, 2016

bump(github.com/docker/spdystream): 449fdfce4d962303d702fec724ef0ad181c92528 #22804

Merged

ncdc closed this as completed Mar 11, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e flake: Kubelet [Serial] regular resource usage tracking for 35 pods per node over 20m0s #19014

e2e flake: Kubelet [Serial] regular resource usage tracking for 35 pods per node over 20m0s #19014

gmarek commented Dec 22, 2015

yujuhong commented Dec 22, 2015

yujuhong commented Jan 4, 2016

yujuhong commented Jan 21, 2016

dchen1107 commented Jan 22, 2016

yujuhong commented Jan 22, 2016

gmarek commented Feb 25, 2016

gmarek commented Mar 9, 2016

yujuhong commented Mar 9, 2016

yujuhong commented Mar 9, 2016

ncdc commented Mar 9, 2016

yujuhong commented Mar 9, 2016

ncdc commented Mar 9, 2016

yujuhong commented Mar 9, 2016

ncdc commented Mar 9, 2016

yujuhong commented Mar 9, 2016

ncdc commented Mar 9, 2016

yujuhong commented Mar 9, 2016

ncdc commented Mar 9, 2016

yujuhong commented Mar 9, 2016

yujuhong commented Mar 9, 2016

ncdc commented Mar 9, 2016

ncdc commented Mar 10, 2016

timothysc commented Mar 10, 2016

ncdc commented Mar 10, 2016

ncdc commented Mar 10, 2016

ncdc commented Mar 10, 2016

ncdc commented Mar 10, 2016

ncdc commented Mar 10, 2016

ncdc commented Mar 10, 2016

ncdc commented Mar 11, 2016

yujuhong commented Mar 11, 2016

e2e flake: Kubelet [Serial] regular resource usage tracking for 35 pods per node over 20m0s #19014

e2e flake: Kubelet [Serial] regular resource usage tracking for 35 pods per node over 20m0s #19014

Comments

gmarek commented Dec 22, 2015

yujuhong commented Dec 22, 2015

yujuhong commented Jan 4, 2016

yujuhong commented Jan 21, 2016

dchen1107 commented Jan 22, 2016

yujuhong commented Jan 22, 2016

gmarek commented Feb 25, 2016

gmarek commented Mar 9, 2016

yujuhong commented Mar 9, 2016

yujuhong commented Mar 9, 2016

ncdc commented Mar 9, 2016

yujuhong commented Mar 9, 2016

ncdc commented Mar 9, 2016

yujuhong commented Mar 9, 2016

ncdc commented Mar 9, 2016

yujuhong commented Mar 9, 2016

ncdc commented Mar 9, 2016

yujuhong commented Mar 9, 2016

ncdc commented Mar 9, 2016

yujuhong commented Mar 9, 2016

yujuhong commented Mar 9, 2016

ncdc commented Mar 9, 2016

ncdc commented Mar 10, 2016

timothysc commented Mar 10, 2016

ncdc commented Mar 10, 2016

ncdc commented Mar 10, 2016

ncdc commented Mar 10, 2016

ncdc commented Mar 10, 2016

ncdc commented Mar 10, 2016

ncdc commented Mar 10, 2016

ncdc commented Mar 11, 2016

yujuhong commented Mar 11, 2016