controller-manager and apiserver leaking goroutines and TCP connections #39734

Open
majewsky opened this Issue Jan 11, 2017 · 6 comments

Projects

None yet

4 participants

@majewsky

Is this a BUG REPORT or FEATURE REQUEST? (choose one): Bug report

Kubernetes version (use kubectl version): v1.4.6-1+530ad0e577004a

Environment:

  • Cloud provider or hardware configuration: private cloud on bare metal
  • OS (e.g. from /etc/os-release): Container Linux by CoreOS 1235.6.0 (Ladybug)
  • Kernel (e.g. uname -a): Linux <hostname> 4.7.3-coreos-r2 #1 SMP Sun Jan 8 00:32:25 UTC 2017 x86_64 Intel(R) Xeon(R) CPU E7-4870 @ 2.40GHz GenuineIntel GNU/Linux
  • Install tools:
  • Others: Docker 1.12.3

My team's Kubernetes clusters performed the automatic update from CoreOS 1185.5.0 to 1235.6.0, which also upgraded Docker from 1.11.2 to 1.12.3. The production clusters all work as intended, but the staging cluster exhibits some resource leaks. This is the go_goroutines metric from our Prometheus:

screen shot 2017-01-11 at 12 43 16

The highest curve is the controller-manager that currently serves as leader, and the curve below it is the apiserver on the same node. Near the right of the graph, you can see the colors changing because a new leader was elected when a node went down.

Same pattern for the TCP connections, as observed by the Prometheus node exporter's node_netstat_Tcp_CurrEstab metric:

screen shot 2017-01-11 at 12 46 20

To witness, the leaking TCP connections are between apiserver and controller-manager:

core@master0 ~ $ sudo netstat -ntp | grep -E 'ESTABLISHED|^Proto' | head -n10
Proto Recv-Q Sned-Q Local Address             Foreign Address           State       PID/Program name
tcp    59696      0 192.168.8.63:48968        192.168.8.63:443          ESTABLISHED 20149/hyperkube
tcp   855260      0 192.168.8.63:41380        192.168.8.63:443          ESTABLISHED 20149/hyperkube
tcp        0      0 192.168.8.63:443          192.168.8.63:37446        ESTABLISHED 6092/hyperkube
tcp        0      0 192.168.8.63:443          192.168.8.63:41292        ESTABLISHED 6092/hyperkube
tcp        0      0 192.168.8.63:443          192.168.8.63:51962        ESTABLISHED 6092/hyperkube
tcp        0      0 192.168.8.63:443          192.168.8.63:54896        ESTABLISHED 6092/hyperkube
tcp   682982      0 192.168.8.63:58054        192.168.8.63:443          ESTABLISHED 20149/hyperkube
tcp   651767      0 192.168.8.63:46152        192.168.8.63:443          ESTABLISHED 20149/hyperkube
tcp        0      0 192.168.8.63:443          192.168.8.63:38200        ESTABLISHED 6092/hyperkube
tcp        0 166880 192.168.8.63:443          192.168.8.63:56650        ESTABLISHED 6092/hyperkube

core@master0 ~ $ ps aux | grep -Ew '20149|6092' | grep -v grep
root      6092 33.1  0.1 5700856 794184 ?      Ssl  04:10 151:59 /hyperkube apiserver --admission-control=NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,ResourceQuota --allow-privileged=true --advertise-address=192.168.8.63 --authorization-mode=ABAC --authorization-policy-file=/etc/kubernetes/authorizations.jsonl --client-ca-file=/etc/kubernetes/ssl/ca.pem --etcd-servers=http://127.0.0.1:2379 --kubelet-certificate-authority=/etc/kubernetes/ssl/ca.pem --kubelet-client-certificate=/etc/kubernetes/ssl/kubelet-master0-192.168.8.140.pem --kubelet-client-key=/etc/kubernetes/ssl/kubelet-master0-192.168.8.140-key.pem --kubelet-https=true --runtime-config=extensions/v1beta1=true,extensions/v1beta1/thirdpartyresources=true --secure_port=443 --service-account-key-file=/etc/kubernetes/ssl/ca-key.pem --service-cluster-ip-range=100.127.0.0/16 --tls-cert-file=/etc/kubernetes/ssl/apiserver-master0-192.168.8.140.pem --tls-private-key-file=/etc/kubernetes/ssl/apiserver-master0-192.168.8.140-key.pem
root     20149 47.0  0.1 3038252 663256 ?      Ssl  07:07 132:18 /hyperkube controller-manager --kubeconfig=/etc/kubernetes/config/controller-manager --leader-elect=true --root-ca-file=/etc/kubernetes/ssl/ca.pem --service-account-private-key-file=/etc/kubernetes/ssl/ca-key.pem

I looked through the issue tracker and #33472 sounds very similar to what I'm observing, but I'm not sure since the controller-manager is not involved there. I looked at the logs of the offending components, but didn't see anything unusual (as in, nothing that doesn't also occur on the other clusters where no leaking can be observed).

If you have any additional suggestions on how to perform further analysis, I'll be happy to help.

cc @BugRoger @SchwarzM from my team who might be able to provide additional details

@majewsky

The problem appears to be specific to the node where the leading controller-manager runs. Since the switch to a new leader, the situation has stabilized:

screen shot 2017-01-12 at 13 01 49

@huggsboson
Contributor

We're seeing an issue similar to this on 1.4.7. Looking at the go routines in apiserver, controller-manager, scheulder we see that they are largely stuck ListAndWatch.func1 in:

curl localhost:8080/debug/pprof/goroutine?debug=2
...
goroutine 441 [select, 1776 minutes]:
k8s.io/kubernetes/pkg/client/cache.(*Reflector).ListAndWatch.func1(0xc820e9a0b0, 0xc82008cfc0, 0xc820e98f80, 0xc8224a8240, 0xc820e9a0b8)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/client/cache/reflector.go:279 +0x31e
created by k8s.io/kubernetes/pkg/client/cache.(*Reflector).ListAndWatch
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/client/cache/reflector.go:292 +0xde3
...

Which makes us think it is fixed by:
#35672
#35015

But those were only cherry-picked into 1.5.x.

@kargakis
Member

@kubernetes/sig-api-machinery-misc

@wojtek-t
Member

Yeah - I think this is #35672 .

@huggsboson
Contributor

Any chance of getting the fix applied to 1.4.9?

@huggsboson
Contributor

Here's what our goroutine graphs look like just to establish that it presents the same way from a metric perspective:
goroutines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment