controller-manager and apiserver leaking goroutines and TCP connections #39734

majewsky opened this Issue Jan 11, 2017 · 6 comments


None yet

4 participants


Is this a BUG REPORT or FEATURE REQUEST? (choose one): Bug report

Kubernetes version (use kubectl version): v1.4.6-1+530ad0e577004a


  • Cloud provider or hardware configuration: private cloud on bare metal
  • OS (e.g. from /etc/os-release): Container Linux by CoreOS 1235.6.0 (Ladybug)
  • Kernel (e.g. uname -a): Linux <hostname> 4.7.3-coreos-r2 #1 SMP Sun Jan 8 00:32:25 UTC 2017 x86_64 Intel(R) Xeon(R) CPU E7-4870 @ 2.40GHz GenuineIntel GNU/Linux
  • Install tools:
  • Others: Docker 1.12.3

My team's Kubernetes clusters performed the automatic update from CoreOS 1185.5.0 to 1235.6.0, which also upgraded Docker from 1.11.2 to 1.12.3. The production clusters all work as intended, but the staging cluster exhibits some resource leaks. This is the go_goroutines metric from our Prometheus:

screen shot 2017-01-11 at 12 43 16

The highest curve is the controller-manager that currently serves as leader, and the curve below it is the apiserver on the same node. Near the right of the graph, you can see the colors changing because a new leader was elected when a node went down.

Same pattern for the TCP connections, as observed by the Prometheus node exporter's node_netstat_Tcp_CurrEstab metric:

screen shot 2017-01-11 at 12 46 20

To witness, the leaking TCP connections are between apiserver and controller-manager:

core@master0 ~ $ sudo netstat -ntp | grep -E 'ESTABLISHED|^Proto' | head -n10
Proto Recv-Q Sned-Q Local Address             Foreign Address           State       PID/Program name
tcp    59696      0          ESTABLISHED 20149/hyperkube
tcp   855260      0          ESTABLISHED 20149/hyperkube
tcp        0      0        ESTABLISHED 6092/hyperkube
tcp        0      0        ESTABLISHED 6092/hyperkube
tcp        0      0        ESTABLISHED 6092/hyperkube
tcp        0      0        ESTABLISHED 6092/hyperkube
tcp   682982      0          ESTABLISHED 20149/hyperkube
tcp   651767      0          ESTABLISHED 20149/hyperkube
tcp        0      0        ESTABLISHED 6092/hyperkube
tcp        0 166880        ESTABLISHED 6092/hyperkube

core@master0 ~ $ ps aux | grep -Ew '20149|6092' | grep -v grep
root      6092 33.1  0.1 5700856 794184 ?      Ssl  04:10 151:59 /hyperkube apiserver --admission-control=NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,ResourceQuota --allow-privileged=true --advertise-address= --authorization-mode=ABAC --authorization-policy-file=/etc/kubernetes/authorizations.jsonl --client-ca-file=/etc/kubernetes/ssl/ca.pem --etcd-servers= --kubelet-certificate-authority=/etc/kubernetes/ssl/ca.pem --kubelet-client-certificate=/etc/kubernetes/ssl/kubelet-master0- --kubelet-client-key=/etc/kubernetes/ssl/kubelet-master0- --kubelet-https=true --runtime-config=extensions/v1beta1=true,extensions/v1beta1/thirdpartyresources=true --secure_port=443 --service-account-key-file=/etc/kubernetes/ssl/ca-key.pem --service-cluster-ip-range= --tls-cert-file=/etc/kubernetes/ssl/apiserver-master0- --tls-private-key-file=/etc/kubernetes/ssl/apiserver-master0-
root     20149 47.0  0.1 3038252 663256 ?      Ssl  07:07 132:18 /hyperkube controller-manager --kubeconfig=/etc/kubernetes/config/controller-manager --leader-elect=true --root-ca-file=/etc/kubernetes/ssl/ca.pem --service-account-private-key-file=/etc/kubernetes/ssl/ca-key.pem

I looked through the issue tracker and #33472 sounds very similar to what I'm observing, but I'm not sure since the controller-manager is not involved there. I looked at the logs of the offending components, but didn't see anything unusual (as in, nothing that doesn't also occur on the other clusters where no leaking can be observed).

If you have any additional suggestions on how to perform further analysis, I'll be happy to help.

cc @BugRoger @SchwarzM from my team who might be able to provide additional details


The problem appears to be specific to the node where the leading controller-manager runs. Since the switch to a new leader, the situation has stabilized:

screen shot 2017-01-12 at 13 01 49


We're seeing an issue similar to this on 1.4.7. Looking at the go routines in apiserver, controller-manager, scheulder we see that they are largely stuck ListAndWatch.func1 in:

curl localhost:8080/debug/pprof/goroutine?debug=2
goroutine 441 [select, 1776 minutes]:*Reflector).ListAndWatch.func1(0xc820e9a0b0, 0xc82008cfc0, 0xc820e98f80, 0xc8224a8240, 0xc820e9a0b8)
        /go/src/ +0x31e
created by*Reflector).ListAndWatch
        /go/src/ +0xde3

Which makes us think it is fixed by:

But those were only cherry-picked into 1.5.x.




Yeah - I think this is #35672 .


Any chance of getting the fix applied to 1.4.9?


Here's what our goroutine graphs look like just to establish that it presents the same way from a metric perspective:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment