-
Notifications
You must be signed in to change notification settings - Fork 18.3k
Description
What version of Go are you using (go version
)?
go version go1.13.9 linux/amd64
Does this issue reproduce with the latest release?
Yes. It happens after kubelet re-complied with 1.14.4
But I saw a commit connection health check
on golang/net@0ba52f6
It should help avoid this problem partly, yet not enabled by default.
What operating system and processor architecture are you using (go env
)?
go env
Output
GO111MODULE="" GOARCH="amd64" GOBIN="" GOCACHE="/root/.cache/go-build" GOENV="/root/.config/go/env" GOEXE="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="linux" GOINSECURE="" GONOPROXY="" GONOSUMDB="" GOOS="linux" GOPATH="/root/go" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/root/go" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/root/go/pkg/tool/linux_amd64" GCCGO="gccgo" AR="ar" CC="gcc" CXX="g++" CGO_ENABLED="0" GOMOD="" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config"
What did you do?
If we close the connection via conn.Close
, the readLoop
will immediately receive a read tcp ... use of closed connection error
. Then a cleanup
will be performed: notifying every stream on this connection, removing the broken connection from clientConnPool
... etc
Everything seems fine. right ?
But, sometimes the readLoop
of a clientConn may block on a idle connection.
And the writeHeaders
method of clientConn
in roundTrip
never close a broken connection. (It just report errors to its caller)
there is the problem begins
Imaging this: An incoming request is about to using this idle connection. After getting it from the pool, the connection died very soon, so writeHeaders
method failed with an error write tcp ... use of closed connection
and left the broken connection untouched in the pool.
Since there is no traffic on the connection. readLoop
will block, either no cleanup
will be triggered.
Plus, incoming requests keep calling roundTrip -> newStream
, add stream count on the conn, making the conn "busy". So the closeIfIdle
will not cleanup the dead connection either.
Some actual case
We encountered this problem several times on highly loaded VMs. Heavy load may lead to temporarily suspend of some VMs by master machine. That's why connection may died right after getting from the pool.
But it is hard to reproduce this scenario.
kubelet keep using dead connections
kubelet.service - Kubernetes Kubelet Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2020-06-16 08:45:59 CST; 1h 25min ago Docs: https://github.com/GoogleCloudPlatform/kubernetes Main PID: 21344 (kubelet) Tasks: 0 CGroup: /system.slice/kubelet.service ‣ 21344 /usr/bin/kubelet --config=/opt/kubernetes/cfg/kubelet-config --root-dir=/opt/kubernetes/data/kubelet --cert-dir=/opt/kubernetes/data/kubelet/pki --network-plugin=cni --cni-bin-dir=/opt/kubernetes/cni --cni-conf-dir=/opt/kubernetes/cni/conf --kubeconfig=/opt/kubernetes/ssl/native.kubeconfig --hostname-override=autovmcos76vm01.company.com --node-labels=master=true,role=loadbalancer --v=1 --logtostderr=true --runtime-cgroups=/systemd/system.slice --pod-infra-container-image=localhost:5000/sandbox/pause:3.1Jun 16 10:11:39 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:39.160008 21344 reflector.go:153] object-"kube-system"/"dns-hosts-configmap": Failed to list *v1.ConfigMap: Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Ddns-hosts-configmap&limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection
Jun 16 10:11:39 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:39.195291 21344 controller.go:135] failed to ensure node lease exists, will retry in 7s, error: Get https://autovmcos76vm01.company.com:8443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/autovmcos76vm01.company.com?timeout=10s: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection
Jun 16 10:11:39 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:39.358961 21344 reflector.go:153] object-"core"/"cdf-deployer-token-4qr6w": Failed to list *v1.Secret: Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/core/secrets?fieldSelector=metadata.name%3Dcdf-deployer-token-4qr6w&limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection
Jun 16 10:11:39 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:39.558938 21344 reflector.go:153] k8s.io/client-go/informers/factory.go:135: Failed to list *v1beta1.RuntimeClass: Get https://autovmcos76vm01.company.com:8443/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection
Jun 16 10:11:39 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:39.759003 21344 reflector.go:153] object-"demo2ns"/"registrypullsecret": Failed to list *v1.Secret: Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/demo2ns/secrets?fieldSelector=metadata.name%3Dregistrypullsecret&limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection
Jun 16 10:11:39 autovmCOS76VM01.company.com kubelet[21344]: W0616 10:11:39.958867 21344 status_manager.go:530] Failed to get status for pod "kube-registry-h67dx_core(4603bbf2-b773-4155-95a4-f73bdf2048eb)": Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/core/pods/kube-registry-h67dx: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection
Jun 16 10:11:40 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:40.159135 21344 reflector.go:153] object-"core"/"base-configmap": Failed to list *v1.ConfigMap: Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/core/configmaps?fieldSelector=metadata.name%3Dbase-configmap&limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection
Jun 16 10:11:40 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:40.358980 21344 reflector.go:153] object-"kube-system"/"coredns": Failed to list *v1.ConfigMap: Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dcoredns&limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection
Jun 16 10:11:40 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:40.559314 21344 reflector.go:153] object-"kube-system"/"itom-cdf-keepalived-token-c76dd": Failed to list *v1.Secret: Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/kube-system/secrets?fieldSelector=metadata.name%3Ditom-cdf-keepalived-token-c76dd&limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection
Jun 16 10:11:40 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:40.761201 21344 reflector.go:153] object-"core"/"default-database-configmap": Failed to list *v1.ConfigMap: Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/core/configmaps?fieldSelector=metadata.name%3Ddefault-database-configmap&limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection