Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow "leak" of kubectl exec requests in kube 1.10.5 #67913

Closed
jmcmeek opened this issue Aug 27, 2018 · 6 comments
Closed

Slow "leak" of kubectl exec requests in kube 1.10.5 #67913

jmcmeek opened this issue Aug 27, 2018 · 6 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.

Comments

@jmcmeek
Copy link
Contributor

jmcmeek commented Aug 27, 2018

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:

Over a period of a week or so, we see apiservers grow to over 11 GB of memory. The apiserver /metrics endpoint showed a huge number of long running exec requests:

apiserver_longrunning_gauge{resource="pods",scope="namespace",subresource="exec",verb="CREATE"} 71816

And completed requests:

apiserver_request_count{client="Go-http-client/1.1",code="0",contentType="",resource="pods",scope="namespace",subresource="exec",verb="CONNECT"} 837824
apiserver_request_count{client="Go-http-client/1.1",code="404",contentType="application/json",resource="pods",scope="namespace",subresource="exec",verb="CONNECT"} 357043

Output from ss showed a similar number of connections from apiserver to kubelet, but only a small number of client connections to the apiserver.

All the exec requests are sh -c "df -P | sed 1d" and complete successfully and quickly when run from kubectl. The 404 responses appear to be due to the client making exec requests to pods that no longer exist.

We have seen multiple clusters doing this. All clusters are at 1.10.5 (though I do not know if the workload is also being run on 1.9 or earlier). If we restart the apiserver, memory growth and accumulation of long running exec requests resumes. The clusters in question are running customer applications, but appear to be looping through all pods in a given namespace issuing df -P requests every 10 seconds or so.

apiserver log entries show some number of these requests completing with times ranging from 1 to at least 12 hours.

What you expected to happen:

The exec request ought to complete and the connection to kubelet closed. On the surface, given that the client connections no longer exist, it seems that the apiserver ought to be able to terminate / cleanup a request if it really hasn't completed.

How to reproduce it (as minimally and precisely as possible):

We have not been able to reproduce this problem in portable manner. Killing kubectl exec sometimes results in "leaked" exec requests, but as far as we know, the actual requests are running normally.

Anything else we need to know?:

Kubernetes 1.11.2 seems to have similar behavior on a much worse scale, and readily reproducable. See #67659

Environment:

  • Kubernetes version (use kubectl version):
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.5+IKS", GitCommit:"7593549b33fb8ab65a9a112387f2e0f464a1ae87", GitTreeState:"clean", BuildDate:"2018-07-19T06:26:20Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: IBM Cloud Kubernetes Service
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels Aug 27, 2018
@jmcmeek
Copy link
Contributor Author

jmcmeek commented Aug 27, 2018

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 27, 2018
@dlamotte
Copy link

I witnessed similar behavior and believe its the same issue (or at least related). I can reproduce with traceback. It appears to be an issue within kubelet.

Effectively, to reproduce, spin up a large container that takes time to download (so it stays within the ContainerCreating state). Any kubectl exec calls to the container in this state will result in a kubelet goroutine panic and leak a single socket.

I0830 15:12:53.427355   70222 server.go:796] GET /exec/ns-a/pod-a/container-a?command=...&output=1: (30.920097ms) 500
goroutine 1878829 [running]:
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).recordStatus(0xc4459278f0, 0x1f4)
        /workspace/anago-v1.10.1-beta.0.68+d4ab47518836c7/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:207 +0xdd
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader(0xc4459278f0, 0x1f4)
        /workspace/anago-v1.10.1-beta.0.68+d4ab47518836c7/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:186 +0x35
k8s.io/kubernetes/pkg/kubelet/server/streaming.WriteError(0xb189460, 0xc44ef84620, 0xb1cf9e0, 0xc4459278f0, 0x0, 0xc4743865fe)
        /workspace/anago-v1.10.1-beta.0.68+d4ab47518836c7/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/server/streaming/errors.go:52 +0xd0
k8s.io/kubernetes/pkg/kubelet/server.(*Server).getExec(0xc4207c6be0, 0xc453607530, 0xc4549a5740)
        /workspace/anago-v1.10.1-beta.0.68+d4ab47518836c7/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/server/server.go:686 +0x446
k8s.io/kubernetes/pkg/kubelet/server.(*Server).(k8s.io/kubernetes/pkg/kubelet/server.getExec)-fm(0xc453607530, 0xc4549a5740)
        /workspace/anago-v1.10.1-beta.0.68+d4ab47518836c7/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/server/server.go:316 +0x3e
k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*Container).dispatch.func5(0xc453607530, 0xc4549a5740)
        /workspace/anago-v1.10.1-beta.0.68+d4ab47518836c7/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/emicklei/go-restful/container.go:272 +0x3e
k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*FilterChain).ProcessFilter(0xc4536075c0, 0xc453607530, 0xc4549a5740)
        /workspace/anago-v1.10.1-beta.0.68+d4ab47518836c7/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/emicklei/go-restful/filter.go:21 +0x8e
k8s.io/kubernetes/pkg/kubelet/server.(*Server).InstallAuthFilter.func1(0xc453607530, 0xc4549a5740, 0xc4536075c0)
        /workspace/anago-v1.10.1-beta.0.68+d4ab47518836c7/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/server/server.go:254 +0x4f0
k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*FilterChain).ProcessFilter(0xc4536075c0, 0xc453607530, 0xc4549a5740)
        /workspace/anago-v1.10.1-beta.0.68+d4ab47518836c7/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/emicklei/go-restful/filter.go:19 +0x68
k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*Container).dispatch(0xc420832090, 0xb1cf9e0, 0xc4459278f0, 0xc47e46e100)
        /workspace/anago-v1.10.1-beta.0.68+d4ab47518836c7/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/emicklei/go-restful/container.go:274 +0x8ff
k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*Container).(k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.dispatch)-fm(0xb1cf9e0, 0xc4459278f0, 0xc47e46e100)
        /workspace/anago-v1.10.1-beta.0.68+d4ab47518836c7/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/emicklei/go-restful/container.go:120 +0x48
net/http.HandlerFunc.ServeHTTP(0xc4208f96e0, 0xb1cf9e0, 0xc4459278f0, 0xc47e46e100)
        /usr/local/go/src/net/http/server.go:1918 +0x44
net/http.(*ServeMux).ServeHTTP(0xc420efe3f0, 0xb1cf9e0, 0xc4459278f0, 0xc47e46e100)
        /usr/local/go/src/net/http/server.go:2254 +0x130
k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*Container).ServeHTTP(0xc420832090, 0xb1cf9e0, 0xc4459278f0, 0xc47e46e100)
        /workspace/anago-v1.10.1-beta.0.68+d4ab47518836c7/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/emicklei/go-restful/container.go:292 +0x4d
k8s.io/kubernetes/pkg/kubelet/server.(*Server).ServeHTTP(0xc4207c6b90, 0xb1cf9e0, 0xc4459278f0, 0xc47e46e100)
        /workspace/anago-v1.10.1-beta.0.68+d4ab47518836c7/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/server/server.go:795 +0x106
net/http.serverHandler.ServeHTTP(0xc4213daa90, 0xb1d13a0, 0xc450a6c000, 0xc47e46e100)
        /usr/local/go/src/net/http/server.go:2619 +0xb4
net/http.(*conn).serve(0xc46b4d9860, 0xb1d4be0, 0xc45b5d0dc0)
        /usr/local/go/src/net/http/server.go:1801 +0x71d
created by net/http.(*Server).Serve
        /usr/local/go/src/net/http/server.go:2720 +0x288
logging error output: "container not found (\"container-a\")"
 [[Go-http-client/1.1] 10.81.251.193:27703]

(anonymized output)

HTH.

@fedebongio
Copy link
Contributor

/assign @caesarxuchao

@jmcmeek
Copy link
Contributor Author

jmcmeek commented Aug 30, 2018

I definitely see many stacks like that in the kubelet.log I just checked. I had noticed those before, but mistakenly associated them with apiserver server returning 404 errors - not the long running requests.

From the apiserver perspective... When the client has closed its connection, can the apiserver consider the exec request complete and close its connection to kubelet, etc? That might make the apiserver more tolerant of this kind of situation.

@caesarxuchao
Copy link
Member

Duplicate of #57992. Should be fixed by #67288. Please re-open if not so.

@jmcmeek
Copy link
Contributor Author

jmcmeek commented Sep 15, 2018

@caesarxuchao This appears to have fixed a related issue I saw in 1.11.2 - #67659. I do not see that in 1.12 beta.1.

What I have read suggests that the PR should fix this issue. We're awaiting the cherry picks to earlier releases to confirm that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
None yet
Development

No branches or pull requests

5 participants