Skip to content

net/http: http2 clientConnPool may cache a dead connection forever (write tcp use of closed network connection) #39750

@povsister

Description

@povsister

What version of Go are you using (go version)?

go version go1.13.9 linux/amd64

Does this issue reproduce with the latest release?

Yes. It happens after kubelet re-complied with 1.14.4
But I saw a commit connection health check on golang/net@0ba52f6
It should help avoid this problem partly, yet not enabled by default.

What operating system and processor architecture are you using (go env)?

go env Output
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/root/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/root/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/root/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="0"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

What did you do?

If we close the connection via conn.Close, the readLoop will immediately receive a read tcp ... use of closed connection error. Then a cleanup will be performed: notifying every stream on this connection, removing the broken connection from clientConnPool... etc
Everything seems fine. right ?

But, sometimes the readLoop of a clientConn may block on a idle connection.
And the writeHeaders method of clientConn in roundTrip never close a broken connection. (It just report errors to its caller)
there is the problem begins

Imaging this: An incoming request is about to using this idle connection. After getting it from the pool, the connection died very soon, so writeHeaders method failed with an error write tcp ... use of closed connection and left the broken connection untouched in the pool.

https://github.com/golang/net/blob/627f9648deb96c27737b83199d44bb5c1010cbcf/http2/transport.go#L1083-L1100

Since there is no traffic on the connection. readLoop will block, either no cleanup will be triggered.
Plus, incoming requests keep calling roundTrip -> newStream, add stream count on the conn, making the conn "busy". So the closeIfIdle will not cleanup the dead connection either.

Some actual case

kubernetes/kubernetes#87615

We encountered this problem several times on highly loaded VMs. Heavy load may lead to temporarily suspend of some VMs by master machine. That's why connection may died right after getting from the pool.
But it is hard to reproduce this scenario.

kubelet keep using dead connections
kubelet.service - Kubernetes Kubelet
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2020-06-16 08:45:59 CST; 1h 25min ago
     Docs: https://github.com/GoogleCloudPlatform/kubernetes
Main PID: 21344 (kubelet)
    Tasks: 0
   CGroup: /system.slice/kubelet.service
           ‣ 21344 /usr/bin/kubelet --config=/opt/kubernetes/cfg/kubelet-config --root-dir=/opt/kubernetes/data/kubelet --cert-dir=/opt/kubernetes/data/kubelet/pki --network-plugin=cni --cni-bin-dir=/opt/kubernetes/cni --cni-conf-dir=/opt/kubernetes/cni/conf --kubeconfig=/opt/kubernetes/ssl/native.kubeconfig --hostname-override=autovmcos76vm01.company.com --node-labels=master=true,role=loadbalancer --v=1 --logtostderr=true --runtime-cgroups=/systemd/system.slice --pod-infra-container-image=localhost:5000/sandbox/pause:3.1

Jun 16 10:11:39 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:39.160008 21344 reflector.go:153] object-"kube-system"/"dns-hosts-configmap": Failed to list *v1.ConfigMap: Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Ddns-hosts-configmap&limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

Jun 16 10:11:39 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:39.195291 21344 controller.go:135] failed to ensure node lease exists, will retry in 7s, error: Get https://autovmcos76vm01.company.com:8443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/autovmcos76vm01.company.com?timeout=10s: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

Jun 16 10:11:39 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:39.358961 21344 reflector.go:153] object-"core"/"cdf-deployer-token-4qr6w": Failed to list *v1.Secret: Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/core/secrets?fieldSelector=metadata.name%3Dcdf-deployer-token-4qr6w&limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

Jun 16 10:11:39 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:39.558938 21344 reflector.go:153] k8s.io/client-go/informers/factory.go:135: Failed to list *v1beta1.RuntimeClass: Get https://autovmcos76vm01.company.com:8443/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

Jun 16 10:11:39 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:39.759003 21344 reflector.go:153] object-"demo2ns"/"registrypullsecret": Failed to list *v1.Secret: Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/demo2ns/secrets?fieldSelector=metadata.name%3Dregistrypullsecret&limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

Jun 16 10:11:39 autovmCOS76VM01.company.com kubelet[21344]: W0616 10:11:39.958867 21344 status_manager.go:530] Failed to get status for pod "kube-registry-h67dx_core(4603bbf2-b773-4155-95a4-f73bdf2048eb)": Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/core/pods/kube-registry-h67dx: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

Jun 16 10:11:40 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:40.159135 21344 reflector.go:153] object-"core"/"base-configmap": Failed to list *v1.ConfigMap: Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/core/configmaps?fieldSelector=metadata.name%3Dbase-configmap&limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

Jun 16 10:11:40 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:40.358980 21344 reflector.go:153] object-"kube-system"/"coredns": Failed to list *v1.ConfigMap: Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dcoredns&limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

Jun 16 10:11:40 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:40.559314 21344 reflector.go:153] object-"kube-system"/"itom-cdf-keepalived-token-c76dd": Failed to list *v1.Secret: Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/kube-system/secrets?fieldSelector=metadata.name%3Ditom-cdf-keepalived-token-c76dd&limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

Jun 16 10:11:40 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:40.761201 21344 reflector.go:153] object-"core"/"default-database-configmap": Failed to list *v1.ConfigMap: Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/core/configmaps?fieldSelector=metadata.name%3Ddefault-database-configmap&limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

Metadata

Metadata

Assignees

No one assigned

    Labels

    FrozenDueToAgeNeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions