-
Notifications
You must be signed in to change notification settings - Fork 39k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
e2e flake: Kubelet [Serial] regular resource usage tracking for 35 pods per node over 20m0s #19014
Comments
Thanks for reporting. The test flaked twice in the last 30 builds. Looking now. |
It hasn't happened in the last 30 builds, so I'm lowering the priority to p2 for the time being. This seems to happen in the extreme case where all add-on pods are run on the same node. As we increase the number of pods, the noise would become more negligible. The better solution is to disable the addon pods, but that could be more problematic for other tests run before and after this test. I'll wait until the next occurrence to confirm my theory. |
Update: no failure in the past week. |
Can we move the test out of flaky and close the issue here? |
The test is not in flaky and has be running fine. I'll close the issue for now until we see it happen again. |
Happening again in soak cluster:
|
By looking at numbers it seems to me that there's a memory leak in Kubelet:
|
I checked jenkins-e2e-minion-rcef and dumped the goroutines: https://gist.github.com/yujuhong/64f0f2c08a5f20f51099
|
/cc @ncdc, @kubernetes/sig-node |
@yujuhong what exec / attach / port-forward calls are in play during this test? |
I am not sure which e2e test caused this, but https://github.com/kubernetes/kubernetes/blob/master/test/e2e/kubectl.go does test all those functions. |
Can you describe how this perf test relates to other e2es? |
The test was just monitoring the memory/cpu usage of docker and kubelet, and it failed because of the high memory usage caused by the ~200 goroutines. |
Right, but what is creating those spdy goroutines? Those only come from exec / attach / port-forward. |
No, they must have been leftover from previous e2e tests. |
So it runs a bunch of e2es against a cluster, and then this test eventually runs against the same cluster? |
Yes, the e2e tests are run continuously in the same cluster (soak cluster), and this test is just one of them. |
Ok, thanks for the clarification. I'll see if I can reproduce locally. I had previously fixed one of these issues (spdy connection hanging while shutting down, which ultimately will clear after 10 minutes), but that fix afaik is in Kube already. |
FYI, we create a new soak cluster every week, and the current version is v1.3.0-alpha.0.232+8b186991e2a857 |
@ncdc, I ran the "kubectl client" e2e test and after 15 minutes, my node was still left with 25 connections/goroutines. You should be able to reproduce this easily. Let me know if there's anything I can help. |
@yujuhong yes, I can reproduce. I'll let you know what I find out (might not be until tomorrow). |
The spdy goroutines are lingering because of the way that the httpstream wrapper handles closing a connection. The code currently calls stream.Close() for all streams, then conn.Close() for the underlying spdy connection. The problem with this approach is that stream.Close() is unidirectional and only indicates that no more data will be sent over that stream. And the spdystream connection Close() function is designed to be graceful, in that it waits up to 10 minutes for all streams to be torn down. In order for a stream to be fully torn down, either a) both the client and server must each issue stream.Close(), or b) either side can issue stream.Reset(), which fully tears down the stream in both the client and the server. I have changed stream.Close() to stream.Reset() locally and that appears to fix the majority of the lingering goroutines. What I'm seeing now is that if you exec with a tty (e.g.
I'm still looking into why the spdy stream isn't unblocking the Read call appropriately. |
How much memory are we talking about? /cc @jayunit100 |
moby/spdystream#66 will fix the lingering goroutine created by go-dockerclient. |
#22802 + the spdystream PR should fix the spdy aspect of this issue. |
spdystream PR merged. I'm preparing a bump PR now. |
Looks like the port forwarding e2es are still leaking some goroutines. I'll look into that next. |
The port forwarding e2es are leaking because of the way that the shelled out call to |
Because the flow is kubectl -> apiserver -> kubelet, when kubectl is forcibly killed, the proxy code running in the apiserver isn't smart enough at the moment to terminate the proxying, which results in dangling goroutines in both the apiserver (the proxying) and the kubelet (the spdystream frame workers and connection handler). |
Now that 22802 and 22804 are merged, this should be better, at least for the spdystream parts. I'm going to close this for now. |
@k8s-oncall, @dchen1107 the soak clusters will continue failing without the fixes. |
http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-slow/1226/
cc @kubernetes/goog-node
The text was updated successfully, but these errors were encountered: