Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e flake: Kubelet [Serial] regular resource usage tracking for 35 pods per node over 20m0s #19014

Closed
gmarek opened this issue Dec 22, 2015 · 31 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Milestone

Comments

@gmarek
Copy link
Contributor

gmarek commented Dec 22, 2015

http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-slow/1226/

Kubelet [Serial] regular resource usage tracking for 35 pods per node over 20m0s

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubelet_perf.go:183 Dec 22 04:39:35.502: CPU usage exceeding limits: node e2e-slow-minion-2opu: container "/kubelet": expected 95th% usage < 0.350; got 0.521, container "/docker-daemon": expected 50th% usage < 0.060; got 0.066, container "/docker-daemon": expected 95th% usage < 0.300; got 0.381

cc @kubernetes/goog-node

@gmarek gmarek added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. kind/flake Categorizes issue or PR as related to a flaky test. labels Dec 22, 2015
@yujuhong
Copy link
Contributor

Thanks for reporting. The test flaked twice in the last 30 builds. Looking now.

@yujuhong
Copy link
Contributor

yujuhong commented Jan 4, 2016

It hasn't happened in the last 30 builds, so I'm lowering the priority to p2 for the time being.

This seems to happen in the extreme case where all add-on pods are run on the same node. As we increase the number of pods, the noise would become more negligible. The better solution is to disable the addon pods, but that could be more problematic for other tests run before and after this test. I'll wait until the next occurrence to confirm my theory.

@yujuhong yujuhong added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jan 4, 2016
@yujuhong
Copy link
Contributor

Update: no failure in the past week.

@dchen1107
Copy link
Member

Can we move the test out of flaky and close the issue here?

@yujuhong
Copy link
Contributor

The test is not in flaky and has be running fine. I'll close the issue for now until we see it happen again.

@gmarek
Copy link
Contributor Author

gmarek commented Feb 25, 2016

Happening again in soak cluster:

Feb 24 19:45:06.334: CPU usage exceeding limits: node gke-jenkins-e2e-20805c06-node-zyyk: container "/kubelet": expected 95th% usage < 0.140; got 0.141 node gke-jenkins-e2e-20805c06-node-feqr: container "/kubelet": expected 95th% usage < 0.140; got 0.151

@gmarek gmarek reopened this Feb 25, 2016
@gmarek
Copy link
Contributor Author

gmarek commented Mar 9, 2016

By looking at numbers it seems to me that there's a memory leak in Kubelet:
http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-soak-continuous-e2e-gce/4870/

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubelet_perf.go:251 Mar 9 07:42:13.166: Memory usage exceeding limits: node jenkins-e2e-minion-rcef: container "/kubelet": expected RSS memory (MB) < 78643200; got 307929088 node jenkins-e2e-minion-i8pl: container "/kubelet": expected RSS memory (MB) < 78643200; got 118616064

@gmarek gmarek added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Mar 9, 2016
@yujuhong
Copy link
Contributor

yujuhong commented Mar 9, 2016

I checked jenkins-e2e-minion-rcef and dumped the goroutines: https://gist.github.com/yujuhong/64f0f2c08a5f20f51099
There were only 10~15 containers running at the time, but pprof showed ~700 goroutines running.

215 @ 0x416b65 0x416bd8 0x41a281 0x4bbcde 0x13420fc 0x133d883 0x1344342 0x440291
#   0x4bbcde    sync.(*Cond).Wait+0x9e                      /usr/src/go/src/sync/cond.go:62
#   0x13420fc   github.com/docker/spdystream.(*PriorityFrameQueue).Pop+0x11c    /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/priority.go:86
#   0x133d883   github.com/docker/spdystream.(*Connection).frameHandler+0x33    /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/connection.go:398
#   0x1344342   github.com/docker/spdystream.func·005+0x72         /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/connection.go:315

43 @ 0x416b65 0x418e0c 0x4182a2 0x133b92e 0x440291
#   0x133b92e   github.com/docker/spdystream.(*idleAwareFramer).monitor+0x69e   /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/connection.go:61

39 @ 0x416b65 0x416bd8 0x41a281 0x4bbcde 0x1344688 0x440291
#   0x4bbcde    sync.(*Cond).Wait+0x9e              /usr/src/go/src/sync/cond.go:62

39 @ 0x416b65 0x418e0c 0x4182a2 0x13403d3 0x440291
#   0x13403d3   github.com/docker/spdystream.(*Connection).shutdown+0x473   /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/connection.go:704

@yujuhong
Copy link
Contributor

yujuhong commented Mar 9, 2016

heap profile: 130: 55846400 [343271: 18892453824] @ heap/1048576
43: 45088768 [492: 515899392] @ 0x41b01c 0x1203acc 0x1205361 0x1205b8d 0x1205cb3 0x13044ad 0x1304699 0xa7115e 0x15c9b95 0x15ca9a7 0x15c86a4 0x15c976a 0x133ba00 0x1340b5b 0x13436bf 0xf26dfc 0xf276e1 0x133e638 0x133db12 0x1344342 0x440291
#       0x1203acc       compress/flate.(*compressor).initDeflate+0x3c                                                                                   /usr/src/go/src/compress/flate/deflate.go:215
#       0x1205361       compress/flate.(*compressor).init+0x201                                                                                         /usr/src/go/src/compress/flate/deflate.go:410
#       0x1205b8d       compress/flate.NewWriter+0x6d                                                                                                   /usr/src/go/src/compress/flate/deflate.go:487
#       0x1205cb3       compress/flate.NewWriterDict+0xd3                                                                                               /usr/src/go/src/compress/flate/deflate.go:501
#       0x13044ad       compress/zlib.(*Writer).writeHeader+0x28d                                                                                       /usr/src/go/src/compress/zlib/writer.go:134
#       0x1304699       compress/zlib.(*Writer).Write+0x49                                                                                              /usr/src/go/src/compress/zlib/writer.go:148
#       0xa7115e        encoding/binary.Write+0x1afe                                                                                                    /usr/src/go/src/encoding/binary/binary.go:333
#       0x15c9b95       github.com/docker/spdystream/spdy.writeHeaderValueBlock+0x105                                                                   /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/spdy/write.go:157
#       0x15ca9a7       github.com/docker/spdystream/spdy.(*Framer).writeSynReplyFrame+0x167                                                            /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/spdy/write.go:237
#       0x15c86a4       github.com/docker/spdystream/spdy.(*SynReplyFrame).write+0x44                                                                   /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/spdy/write.go:19
#       0x15c976a       github.com/docker/spdystream/spdy.(*Framer).WriteFrame+0x4a                                                                     /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/spdy/write.go:138
#       0x133ba00       github.com/docker/spdystream.(*idleAwareFramer).WriteFrame+0xd0                                                                 /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/connection.go:140
#       0x1340b5b       github.com/docker/spdystream.(*Connection).sendReply+0xdb                                                                       /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/connection.go:853
#       0x13436bf       github.com/docker/spdystream.(*Stream).SendReply+0x22f                                                                          /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/stream.go:220
#       0xf26dfc        k8s.io/kubernetes/pkg/util/httpstream/spdy.(*connection).newSpdyStream+0x22c                                                    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/util/httpstream/spdy/connection.go:133
#       0xf276e1        k8s.io/kubernetes/pkg/util/httpstream/spdy.*connection.(k8s.io/kubernetes/pkg/util/httpstream/spdy.newSpdyStream)·fm+0x31       /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/util/httpstream/spdy/connection.go:68
#       0x133e638       github.com/docker/spdystream.(*Connection).handleStreamFrame+0x178                                                              /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/connection.go:490
#       0x133db12       github.com/docker/spdystream.(*Connection).frameHandler+0x2c2                                                                   /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/connection.go:406
#       0x1344342       github.com/docker/spdystream.func·005+0x72                                                                                      /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/connection.go:315

25: 6553600 [216: 56623104] @ 0x41b01c 0x1203b3c 0x1205361 0x1205b8d 0x1205cb3 0x13044ad 0x1304699 0xa7115e 0x15c9b95 0x15ca9a7 0x15c86a4 0x15c976a 0x133ba00 0x1340b5b 0x13436bf 0xf26dfc 0xf276e1 0x133e638 0x133db12 0x1344342 0x440291
#       0x1203b3c       compress/flate.(*compressor).initDeflate+0xac                                                                                   /usr/src/go/src/compress/flate/deflate.go:216
#       0x1205361       compress/flate.(*compressor).init+0x201                                                                                         /usr/src/go/src/compress/flate/deflate.go:410
#       0x1205b8d       compress/flate.NewWriter+0x6d                                                                                                   /usr/src/go/src/compress/flate/deflate.go:487
#       0x1205cb3       compress/flate.NewWriterDict+0xd3                                                                                               /usr/src/go/src/compress/flate/deflate.go:501
#       0x13044ad       compress/zlib.(*Writer).writeHeader+0x28d                                                                                       /usr/src/go/src/compress/zlib/writer.go:134
#       0x1304699       compress/zlib.(*Writer).Write+0x49                                                                                              /usr/src/go/src/compress/zlib/writer.go:148
#       0xa7115e        encoding/binary.Write+0x1afe                                                                                                    /usr/src/go/src/encoding/binary/binary.go:333
#       0x15c9b95       github.com/docker/spdystream/spdy.writeHeaderValueBlock+0x105                                                                   /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/spdy/write.go:157
#       0x15ca9a7       github.com/docker/spdystream/spdy.(*Framer).writeSynReplyFrame+0x167                                                            /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/spdy/write.go:237
#       0x15c86a4       github.com/docker/spdystream/spdy.(*SynReplyFrame).write+0x44                                                                   /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/spdy/write.go:19
#       0x15c976a       github.com/docker/spdystream/spdy.(*Framer).WriteFrame+0x4a                                                                     /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/spdy/write.go:138
#       0x133ba00       github.com/docker/spdystream.(*idleAwareFramer).WriteFrame+0xd0                                                                 /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/connection.go:140
#       0x1340b5b       github.com/docker/spdystream.(*Connection).sendReply+0xdb                                                                       /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/connection.go:853
#       0x13436bf       github.com/docker/spdystream.(*Stream).SendReply+0x22f                                                                          /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/stream.go:220
#       0xf26dfc        k8s.io/kubernetes/pkg/util/httpstream/spdy.(*connection).newSpdyStream+0x22c                                                    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/util/httpstream/spdy/connection.go:133
#       0xf276e1        k8s.io/kubernetes/pkg/util/httpstream/spdy.*connection.(k8s.io/kubernetes/pkg/util/httpstream/spdy.newSpdyStream)·fm+0x31       /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/util/httpstream/spdy/connection.go:68
#       0x133e638       github.com/docker/spdystream.(*Connection).handleStreamFrame+0x178                                                              /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/connection.go:490
#       0x133db12       github.com/docker/spdystream.(*Connection).frameHandler+0x2c2                                                                   /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/connection.go:406
#       0x1344342       github.com/docker/spdystream.func·005+0x72                                                                                      /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/docker/spdystream/connection.go:315

/cc @ncdc, @kubernetes/sig-node

@ncdc
Copy link
Member

ncdc commented Mar 9, 2016

@yujuhong what exec / attach / port-forward calls are in play during this test?

@yujuhong
Copy link
Contributor

yujuhong commented Mar 9, 2016

I am not sure which e2e test caused this, but https://github.com/kubernetes/kubernetes/blob/master/test/e2e/kubectl.go does test all those functions.

@yujuhong yujuhong added this to the v1.2 milestone Mar 9, 2016
@ncdc
Copy link
Member

ncdc commented Mar 9, 2016

Can you describe how this perf test relates to other e2es?

@yujuhong
Copy link
Contributor

yujuhong commented Mar 9, 2016

The test was just monitoring the memory/cpu usage of docker and kubelet, and it failed because of the high memory usage caused by the ~200 goroutines.

@ncdc
Copy link
Member

ncdc commented Mar 9, 2016

Right, but what is creating those spdy goroutines? Those only come from exec / attach / port-forward.

@yujuhong
Copy link
Contributor

yujuhong commented Mar 9, 2016

No, they must have been leftover from previous e2e tests.

@ncdc
Copy link
Member

ncdc commented Mar 9, 2016

So it runs a bunch of e2es against a cluster, and then this test eventually runs against the same cluster?

@yujuhong
Copy link
Contributor

yujuhong commented Mar 9, 2016

Yes, the e2e tests are run continuously in the same cluster (soak cluster), and this test is just one of them.

@ncdc
Copy link
Member

ncdc commented Mar 9, 2016

Ok, thanks for the clarification. I'll see if I can reproduce locally. I had previously fixed one of these issues (spdy connection hanging while shutting down, which ultimately will clear after 10 minutes), but that fix afaik is in Kube already.

@yujuhong
Copy link
Contributor

yujuhong commented Mar 9, 2016

FYI, we create a new soak cluster every week, and the current version is v1.3.0-alpha.0.232+8b186991e2a857

@dchen1107 dchen1107 assigned ncdc and unassigned yujuhong Mar 9, 2016
@yujuhong
Copy link
Contributor

yujuhong commented Mar 9, 2016

@ncdc, I ran the "kubectl client" e2e test and after 15 minutes, my node was still left with 25 connections/goroutines. You should be able to reproduce this easily. Let me know if there's anything I can help.

@ncdc
Copy link
Member

ncdc commented Mar 9, 2016

@yujuhong yes, I can reproduce. I'll let you know what I find out (might not be until tomorrow).

@ncdc
Copy link
Member

ncdc commented Mar 10, 2016

The spdy goroutines are lingering because of the way that the httpstream wrapper handles closing a connection. The code currently calls stream.Close() for all streams, then conn.Close() for the underlying spdy connection. The problem with this approach is that stream.Close() is unidirectional and only indicates that no more data will be sent over that stream. And the spdystream connection Close() function is designed to be graceful, in that it waits up to 10 minutes for all streams to be torn down. In order for a stream to be fully torn down, either a) both the client and server must each issue stream.Close(), or b) either side can issue stream.Reset(), which fully tears down the stream in both the client and the server.

I have changed stream.Close() to stream.Reset() locally and that appears to fix the majority of the lingering goroutines. What I'm seeing now is that if you exec with a tty (e.g. kubectl exec -it somepod bash), sometimes 1 goroutine doesn't get cleaned up:

2: select [26~34 minutes] [Created by go-dockerclient.(*Client).hijack.func1 @ client.go:670]
    spdystream      stream.go:80                   (*Stream).Read(*, *, 0x8000, 0x8000, 0x1, 0, 0)
    server          <autogenerated>:73             (*streamAndReply).Read(*, *, 0x8000, 0x8000, 0x1, 0, 0)
    io              io.go:381                      copyBuffer(#372, *, #356, *, *, 0x8000, 0x8000, *, 0, 0)
    io              io.go:351                      Copy(#372, *, #356, *, *, 0, 0)
    go-dockerclient client.go:664                  (*Client).hijack.func1.2(*, #357, *, *)

I'm still looking into why the spdy stream isn't unblocking the Read call appropriately.

@timothysc
Copy link
Member

How much memory are we talking about?
Also this seems like an easy test to automate.. Back-2-back fill and drains, as a separate test.

/cc @jayunit100

@ncdc
Copy link
Member

ncdc commented Mar 10, 2016

moby/spdystream#66 will fix the lingering goroutine created by go-dockerclient.

@ncdc
Copy link
Member

ncdc commented Mar 10, 2016

#22802 + the spdystream PR should fix the spdy aspect of this issue.

@ncdc
Copy link
Member

ncdc commented Mar 10, 2016

spdystream PR merged. I'm preparing a bump PR now.

@ncdc
Copy link
Member

ncdc commented Mar 10, 2016

Looks like the port forwarding e2es are still leaking some goroutines. I'll look into that next.

@ncdc
Copy link
Member

ncdc commented Mar 10, 2016

The port forwarding e2es are leaking because of the way that the shelled out call to kubectl port-forward is terminated by the test. It sends SIGKILL to the process, which doesn't allow the client to tell the server it's going away. I would hope the server would eventually figure it out, but I need to see about that.

@ncdc
Copy link
Member

ncdc commented Mar 10, 2016

Because the flow is kubectl -> apiserver -> kubelet, when kubectl is forcibly killed, the proxy code running in the apiserver isn't smart enough at the moment to terminate the proxying, which results in dangling goroutines in both the apiserver (the proxying) and the kubelet (the spdystream frame workers and connection handler).

@ncdc
Copy link
Member

ncdc commented Mar 11, 2016

Now that 22802 and 22804 are merged, this should be better, at least for the spdystream parts. I'm going to close this for now.

@ncdc ncdc closed this as completed Mar 11, 2016
@yujuhong
Copy link
Contributor

@k8s-oncall, @dchen1107 the soak clusters will continue failing without the fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

5 participants