Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(1.17) Kubelet won't reconnect to Apiserver after NIC failure (use of closed network connection) #87615

Closed
ghost opened this issue Jan 28, 2020 · 163 comments · Fixed by #104444
Closed
Assignees
Labels
kind/support Categorizes issue or PR as a support question. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@ghost
Copy link

ghost commented Jan 28, 2020

We've just upgrade our production cluster to 1.17.2.

Since the update on saturday, we've had this strange outage: Kubelet, after a NIC bond fail (that recovers not long after), will have all of its connections broken and won't retry to restablish them unless manually restarted.

Here is the timeline of last time it occured:

01:31:16: Kernel recognizes a fail on the bond interface. It goes for a while. Eventually it recovers.

Jan 28 01:31:16 baremetal044 kernel: bond-mngmt: link status definitely down for interface eno1, disabling it
...
Jan 28 01:31:37 baremetal044  systemd-networkd[1702]: bond-mngmt: Lost carrier
Jan 28 01:31:37 baremetal044  systemd-networkd[1702]: bond-mngmt: Gained carrier
Jan 28 01:31:37 baremetal044  systemd-networkd[1702]: bond-mngmt: Configured

As expected, all watches are closed. Message is the same for them all:

...
Jan 28 01:31:44 baremetal044 kubelet-wrapper[2039]: W0128 04:31:44.352736    2039 reflector.go:326] object-"namespace"/"default-token-fjzcz": watch of *v1.Secret ended with: very short watch: object-"namespace"/"default-token-fjzcz": Unexpected watch close - watch lasted less than a second and no items received
...

So these messages begin:

`Jan 28 01:31:44 baremetal44 kubelet-wrapper[2039]: E0128 04:31:44.361582 2039 desired_state_of_world_populator.go:320] Error processing volume "disco-arquivo" for pod "pod-bb8854ddb-xkwm9_namespace(8151bfdc-ec91-48d4-9170-383f5070933f)": error processing PVC namespace/disco-arquivo: failed to fetch PVC from API server: Get https://apiserver:443/api/v1/namespaces/namespace/persistentvolumeclaims/disco-arquivo: write tcp baremetal44.ip:42518->10.79.32.131:443: use of closed network connection`

Which I'm guessing shouldn't be a problem for a while. But it never recovers. Our event came to happen at 01:31 AM, and had to manually restart Kubelet around 9h to get stuff normalized.

# journalctl --since '2020-01-28 01:31'   | fgrep 'use of closed' | cut -f3 -d' ' | cut -f1 -d1 -d':' | sort | uniq -dc
   9757 01
  20663 02
  20622 03
  20651 04
  20664 05
  20666 06
  20664 07
  20661 08
  16655 09
      3 10

Apiservers were up and running, all other nodes were up and running, everything else pretty uneventful. This one was the only one affected (today) by this problem.

Is there any way to mitigate this kind of event?

Would this be a bug?

@ghost ghost added the kind/support Categorizes issue or PR as a support question. label Jan 28, 2020
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 28, 2020
@rikatz
Copy link
Contributor

rikatz commented Jan 28, 2020

/sig node
/sig api-machinery

Taking a look into the code the error happens here

The explanation of the code is that it assumes its probably EOF (IsProbableEOF) while in this case this doesn't seems to be.

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 28, 2020
@fedebongio
Copy link
Contributor

/assign @caesarxuchao

@caesarxuchao
Copy link
Member

@rikatz can you elaborate how did you track down to the code you pasted?

My thought is that the reflector would have restarted the watch no matter how it handles the error (code), so it doesn't explain the failure to recover.

@rikatz
Copy link
Contributor

rikatz commented Jan 29, 2020

Exactly @caesarxuchao so this is our question.

I've tracked the error basically grepping it through the Code and crossing with what kubelet was doing that time (watching secrets) to get into that part.

Not an advanced way, through this seems to be the exact point of the error code.

The question is, because the connection is closed is there somewhere flagging that this is the watch EOF instead of understanding this is an error?

@ghost
Copy link
Author

ghost commented Jan 29, 2020

I do not have nothing else smarter to add other than we had another node fail the same way, increasing the ocurrences from last 4 days to 4.

Will try to map if bond disconects events are happening on other nodes and if kubelet is recovering - it might be bad luck on some recovers, and not a 100% event.

@towolf
Copy link
Contributor

towolf commented Feb 1, 2020

I think we are seeing this too, but we do not have bonds, we only see these networkd "carrier lost" messages for Calico cali* interfaces, and they are local veth devices.

@abays
Copy link

abays commented Feb 4, 2020

I have encountered this as well, with no bonds involved. Restarting the node fixes the problem, but just restarting the Kubelet service does not (all API calls fail with "Unauthorized").

@abays
Copy link

abays commented Feb 4, 2020

I have encountered this as well, with no bonds involved. Restarting the node fixes the problem, but just restarting the Kubelet service does not (all API calls fail with "Unauthorized").

Update: restarting Kubelet did fix the problem after enough time (1 hour?) was allowed to pass.

@cranky-coder
Copy link

I am seeing this same behavior. Ubuntu 18.04.3 LTS clean installs. Cluster built with rancher 2.3.4. I have seen this happen periodically lately and just restarting kubelet tends to fix it for me. Last night all 3 of my worker nodes exhibited this same behavior. I corrected 2 to bring my cluster up. Third is still in this state while im digging around.

@r-catania
Copy link

we are seeing the same issue on CentOS 7, cluster freshly built with rancher (1.17.2). We are using weave. All of 3 worker nodes are showing this issue. Restarting kubelet does not work for us we have to restart the entire node

@mlmhl
Copy link
Contributor

mlmhl commented Mar 9, 2020

/sig node
/sig api-machinery

Taking a look into the code the error happens here

The explanation of the code is that it assumes its probably EOF (IsProbableEOF) while in this case this doesn't seems to be.

We are also seeing the same issue. From the log, we found that after the problem occurred all subsequent requests were still send on the same connection. It seems that although the client will resend the request to apiserver, but the underlay http2 library still maintains the old connection so all subsequent requests are still send on this connection and received the same error use of closed connection.

So the question is why http2 still maintains an already closed connection? Maybe the connection it maintained is indeed alive but some intermediate connections are closed unexpectedly?

@sbiermann
Copy link

I have the same issue with a Raspberry Pi cluster with k8s 1.17.3 very often. Based on some older issues, i have set the kube API server http connection limit to 1000 "- --http2-max-streams-per-connection=1000", it was fine for more than 2 weeks after that it starts now again.

@ik9999
Copy link

ik9999 commented Mar 12, 2020

Is it possible to rebuild kube-apiserver https://github.com/kubernetes/apiserver/blob/b214a49983bcd70ced138bd2717f78c0cff351b2/pkg/server/secure_serving.go#L50
setting the s.DisableHTTP2 to true by default?
Is there a dockerfile for an official image (k8s.gcr.io/kube-apiserver:v1.17.3)?

@mritd
Copy link

mritd commented Mar 13, 2020

same here.(ubuntu 18.04, kubernetes 1.17.3)

@JensErat
Copy link

We also observed this in two of our clusters. Not entirely sure about the root cause, but at least we were able to see this happened in cluster with very high watch counts. I was not able to reproduce by forcing high number of watches per kubelet though (started pods with 300 secrets per pod, which also resulted in 300 watches per pod in Prometheus metrics). Also setting very low http2-max-streams-per-connection values did not trigger the issue, but at least I was able to observe some unexpected scheduler and controller-manager behavior (might have been just overload after endless re-watch loops or something like this, though).

@sbiermann
Copy link

As workarround all of my nodes restarting every night kublet via local cronjob. Now after 10 days ago, i can say it works for me, i have no more "use of closed network connection" on my nodes.

@ik9999
Copy link

ik9999 commented Mar 20, 2020

@sbiermann
Thank you for posting this. What time interval you use for cronjob?

@sbiermann
Copy link

24 hours

@chrischdi
Copy link
Member

I can also confirm this issue, we are not yet on 1.17.3, currently running Ubuntu 19.10:

Linux <STRIPPED>-kube-node02 5.3.0-29-generic #31-Ubuntu SMP Fri Jan 17 17:27:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

NAME                  STATUS   ROLES    AGE   VERSION       INTERNAL-IP   EXTERNAL-IP   OS-IMAGE       KERNEL-VERSION     CONTAINER-RUNTIME
STRIPPED-kube-node02   Ready    <none>   43d   v1.16.6   10.6.0.12     <none>        Ubuntu 19.10   5.3.0-29-generic   docker://19.3.3

@RikuXan
Copy link

RikuXan commented Mar 30, 2020

I can also confirm this on Kubernetes 1.17.4 deployed through Rancher 2.3.5 on RancherOS 1.5.5 nodes. Restarting the kubelet seems to work for me, I don't have to restart the whole node.

The underlying cause for me seems to be RAM getting close to running out and kswapd0 getting up to 100% CPU usage due to that, since I forgot to set the swappiness to 0 for my Kubernetes nodes. After setting the swappiness to 0 and adding some RAM to the machines, the issue hasn't reoccurred for me yet.

@caesarxuchao
Copy link
Member

If the underlying issue was "http2 using dead connections", then restarting kubelet should fix the problem. #48670 suggested reducing TCP_USER_TIMEOUT can mitigate the problem. I have opened golang/net#55 to add client-side connection health check to the http2 library, but it's going to take more time to land.

If restarting kubelet didn't solve the issue, then probably it's a different root cause.

@pytimer
Copy link
Contributor

pytimer commented Apr 4, 2020

I have the same issue with v1.17.2 when restart network, but only one of node have this issue(my cluster have five nodes), i can't reproduce it. Restart kubelet solved this problem.

How can i avoid to this issue? Upgrade the latest version or have other way to fix it?

@ik9999
Copy link

ik9999 commented Apr 5, 2020

I've fixed it by running this bash script every 5 minutes:

#!/bin/bash
output=$(journalctl -u kubelet -n 1 | grep "use of closed network connection")
if [[ $? != 0 ]]; then
  echo "Error not found in logs"
elif [[ $output ]]; then
  echo "Restart kubelet"
  systemctl restart kubelet
fi

christarazi pushed a commit to cilium/metallb that referenced this issue Jun 14, 2021
Golang 1.13 has horrible bugs that might have affected MetalLB
as it's using client-go / watches,
see kubernetes/kubernetes#87615
Newer client-go version (that we are now using) have implemented
HTTP/2 health check by default but this is a workaround not a fix.

Also Golang 1.13 is not supported anymore

Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
brb added a commit to cilium/cilium that referenced this issue Nov 26, 2021
As reported in [1], Go's HTTP2 client < 1.16 had some serious bugs which
could result in lost connections to kube-apiserver. Worse than this was
that the client couldn't recover.

In the case of CoreDNS the loose of connectivity to kube-apiserver was
even not logged. I have validated this by adding the following rule on
the node which was running the CoreDNS pod (6443 port as the socket-lb
was doing the service xlation):

    iptables -I FORWARD 1 -m tcp --proto tcp --src $CORE_DNS_POD_IP \
        --dport=6443 -j DROP

After upgrading CoreDNS to the one which was compiled with Go >= 1.16,
the pod was not only logging the errors, but also was able to recover
from them in a fast way. An example of such an error:

    W1126 12:45:08.403311       1 reflector.go:436]
    pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/reflector.go:167: watch
    of *v1.Endpoints ended with: an error on the server ("unable to
    decode an event from the watch stream: http2: client connection
    lost") has prevented the request from succeeding

To determine the min vsn bump, I was using the following:

    for i in 1.7.0 1.7.1 1.8.0 1.8.1 1.8.2 1.8.3 1.8.4; do
        docker run --rm -ti "k8s.gcr.io/coredns/coredns:v$i" \
            --version
    done

    CoreDNS-1.7.0
    linux/amd64, go1.14.4, f59c03d
    CoreDNS-1.7.1
    linux/amd64, go1.15.2, aa82ca6
    CoreDNS-1.8.0
    linux/amd64, go1.15.3, 054c9ae
    k8s.gcr.io/coredns/coredns:v1.8.1 not found: manifest unknown:
    k8s.gcr.io/coredns/coredns:v1.8.2 not found: manifest unknown:
    CoreDNS-1.8.3
    linux/amd64, go1.16, 4293992
    CoreDNS-1.8.4
    linux/amd64, go1.16.4, 053c4d5

Hopefully, the bumped version will fix the CI flakes in which a service
domain name is not available after 7min. In other words, CoreDNS is not
able to resolve the name which means that it hasn't received update from
the kube-apiserver for the service.

[1]: kubernetes/kubernetes#87615 (comment)

Signed-off-by: Martynas Pumputis <m@lambda.lt>
brb added a commit to cilium/cilium that referenced this issue Nov 27, 2021
As reported in [1], Go's HTTP2 client < 1.16 had some serious bugs which
could result in lost connections to kube-apiserver. Worse than this was
that the client couldn't recover.

In the case of CoreDNS the loose of connectivity to kube-apiserver was
even not logged. I have validated this by adding the following rule on
the node which was running the CoreDNS pod (6443 port as the socket-lb
was doing the service xlation):

    iptables -I FORWARD 1 -m tcp --proto tcp --src $CORE_DNS_POD_IP \
        --dport=6443 -j DROP

After upgrading CoreDNS to the one which was compiled with Go >= 1.16,
the pod was not only logging the errors, but also was able to recover
from them in a fast way. An example of such an error:

    W1126 12:45:08.403311       1 reflector.go:436]
    pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/reflector.go:167: watch
    of *v1.Endpoints ended with: an error on the server ("unable to
    decode an event from the watch stream: http2: client connection
    lost") has prevented the request from succeeding

To determine the min vsn bump, I was using the following:

    for i in 1.7.0 1.7.1 1.8.0 1.8.1 1.8.2 1.8.3 1.8.4; do
        docker run --rm -ti "k8s.gcr.io/coredns/coredns:v$i" \
            --version
    done

    CoreDNS-1.7.0
    linux/amd64, go1.14.4, f59c03d
    CoreDNS-1.7.1
    linux/amd64, go1.15.2, aa82ca6
    CoreDNS-1.8.0
    linux/amd64, go1.15.3, 054c9ae
    k8s.gcr.io/coredns/coredns:v1.8.1 not found: manifest unknown:
    k8s.gcr.io/coredns/coredns:v1.8.2 not found: manifest unknown:
    CoreDNS-1.8.3
    linux/amd64, go1.16, 4293992
    CoreDNS-1.8.4
    linux/amd64, go1.16.4, 053c4d5

Hopefully, the bumped version will fix the CI flakes in which a service
domain name is not available after 7min. In other words, CoreDNS is not
able to resolve the name which means that it hasn't received update from
the kube-apiserver for the service.

[1]: kubernetes/kubernetes#87615 (comment)

Signed-off-by: Martynas Pumputis <m@lambda.lt>
qmonnet pushed a commit to cilium/cilium that referenced this issue Nov 29, 2021
As reported in [1], Go's HTTP2 client < 1.16 had some serious bugs which
could result in lost connections to kube-apiserver. Worse than this was
that the client couldn't recover.

In the case of CoreDNS the loose of connectivity to kube-apiserver was
even not logged. I have validated this by adding the following rule on
the node which was running the CoreDNS pod (6443 port as the socket-lb
was doing the service xlation):

    iptables -I FORWARD 1 -m tcp --proto tcp --src $CORE_DNS_POD_IP \
        --dport=6443 -j DROP

After upgrading CoreDNS to the one which was compiled with Go >= 1.16,
the pod was not only logging the errors, but also was able to recover
from them in a fast way. An example of such an error:

    W1126 12:45:08.403311       1 reflector.go:436]
    pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/reflector.go:167: watch
    of *v1.Endpoints ended with: an error on the server ("unable to
    decode an event from the watch stream: http2: client connection
    lost") has prevented the request from succeeding

To determine the min vsn bump, I was using the following:

    for i in 1.7.0 1.7.1 1.8.0 1.8.1 1.8.2 1.8.3 1.8.4; do
        docker run --rm -ti "k8s.gcr.io/coredns/coredns:v$i" \
            --version
    done

    CoreDNS-1.7.0
    linux/amd64, go1.14.4, f59c03d
    CoreDNS-1.7.1
    linux/amd64, go1.15.2, aa82ca6
    CoreDNS-1.8.0
    linux/amd64, go1.15.3, 054c9ae
    k8s.gcr.io/coredns/coredns:v1.8.1 not found: manifest unknown:
    k8s.gcr.io/coredns/coredns:v1.8.2 not found: manifest unknown:
    CoreDNS-1.8.3
    linux/amd64, go1.16, 4293992
    CoreDNS-1.8.4
    linux/amd64, go1.16.4, 053c4d5

Hopefully, the bumped version will fix the CI flakes in which a service
domain name is not available after 7min. In other words, CoreDNS is not
able to resolve the name which means that it hasn't received update from
the kube-apiserver for the service.

[1]: kubernetes/kubernetes#87615 (comment)

Signed-off-by: Martynas Pumputis <m@lambda.lt>
nathanjsweet pushed a commit to cilium/cilium that referenced this issue Dec 2, 2021
[ upstream commit 398d55c ]

As reported in [1], Go's HTTP2 client < 1.16 had some serious bugs which
could result in lost connections to kube-apiserver. Worse than this was
that the client couldn't recover.

In the case of CoreDNS the loose of connectivity to kube-apiserver was
even not logged. I have validated this by adding the following rule on
the node which was running the CoreDNS pod (6443 port as the socket-lb
was doing the service xlation):

    iptables -I FORWARD 1 -m tcp --proto tcp --src $CORE_DNS_POD_IP \
        --dport=6443 -j DROP

After upgrading CoreDNS to the one which was compiled with Go >= 1.16,
the pod was not only logging the errors, but also was able to recover
from them in a fast way. An example of such an error:

    W1126 12:45:08.403311       1 reflector.go:436]
    pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/reflector.go:167: watch
    of *v1.Endpoints ended with: an error on the server ("unable to
    decode an event from the watch stream: http2: client connection
    lost") has prevented the request from succeeding

To determine the min vsn bump, I was using the following:

    for i in 1.7.0 1.7.1 1.8.0 1.8.1 1.8.2 1.8.3 1.8.4; do
        docker run --rm -ti "k8s.gcr.io/coredns/coredns:v$i" \
            --version
    done

    CoreDNS-1.7.0
    linux/amd64, go1.14.4, f59c03d
    CoreDNS-1.7.1
    linux/amd64, go1.15.2, aa82ca6
    CoreDNS-1.8.0
    linux/amd64, go1.15.3, 054c9ae
    k8s.gcr.io/coredns/coredns:v1.8.1 not found: manifest unknown:
    k8s.gcr.io/coredns/coredns:v1.8.2 not found: manifest unknown:
    CoreDNS-1.8.3
    linux/amd64, go1.16, 4293992
    CoreDNS-1.8.4
    linux/amd64, go1.16.4, 053c4d5

Hopefully, the bumped version will fix the CI flakes in which a service
domain name is not available after 7min. In other words, CoreDNS is not
able to resolve the name which means that it hasn't received update from
the kube-apiserver for the service.

[1]: kubernetes/kubernetes#87615 (comment)

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: nathanjsweet <nathanjsweet@pm.me>
nathanjsweet pushed a commit to cilium/cilium that referenced this issue Dec 3, 2021
[ upstream commit 398d55c ]

As reported in [1], Go's HTTP2 client < 1.16 had some serious bugs which
could result in lost connections to kube-apiserver. Worse than this was
that the client couldn't recover.

In the case of CoreDNS the loose of connectivity to kube-apiserver was
even not logged. I have validated this by adding the following rule on
the node which was running the CoreDNS pod (6443 port as the socket-lb
was doing the service xlation):

    iptables -I FORWARD 1 -m tcp --proto tcp --src $CORE_DNS_POD_IP \
        --dport=6443 -j DROP

After upgrading CoreDNS to the one which was compiled with Go >= 1.16,
the pod was not only logging the errors, but also was able to recover
from them in a fast way. An example of such an error:

    W1126 12:45:08.403311       1 reflector.go:436]
    pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/reflector.go:167: watch
    of *v1.Endpoints ended with: an error on the server ("unable to
    decode an event from the watch stream: http2: client connection
    lost") has prevented the request from succeeding

To determine the min vsn bump, I was using the following:

    for i in 1.7.0 1.7.1 1.8.0 1.8.1 1.8.2 1.8.3 1.8.4; do
        docker run --rm -ti "k8s.gcr.io/coredns/coredns:v$i" \
            --version
    done

    CoreDNS-1.7.0
    linux/amd64, go1.14.4, f59c03d
    CoreDNS-1.7.1
    linux/amd64, go1.15.2, aa82ca6
    CoreDNS-1.8.0
    linux/amd64, go1.15.3, 054c9ae
    k8s.gcr.io/coredns/coredns:v1.8.1 not found: manifest unknown:
    k8s.gcr.io/coredns/coredns:v1.8.2 not found: manifest unknown:
    CoreDNS-1.8.3
    linux/amd64, go1.16, 4293992
    CoreDNS-1.8.4
    linux/amd64, go1.16.4, 053c4d5

Hopefully, the bumped version will fix the CI flakes in which a service
domain name is not available after 7min. In other words, CoreDNS is not
able to resolve the name which means that it hasn't received update from
the kube-apiserver for the service.

[1]: kubernetes/kubernetes#87615 (comment)

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: nathanjsweet <nathanjsweet@pm.me>
nathanjsweet pushed a commit to cilium/cilium that referenced this issue Dec 6, 2021
[ upstream commit 398d55c ]

As reported in [1], Go's HTTP2 client < 1.16 had some serious bugs which
could result in lost connections to kube-apiserver. Worse than this was
that the client couldn't recover.

In the case of CoreDNS the loose of connectivity to kube-apiserver was
even not logged. I have validated this by adding the following rule on
the node which was running the CoreDNS pod (6443 port as the socket-lb
was doing the service xlation):

    iptables -I FORWARD 1 -m tcp --proto tcp --src $CORE_DNS_POD_IP \
        --dport=6443 -j DROP

After upgrading CoreDNS to the one which was compiled with Go >= 1.16,
the pod was not only logging the errors, but also was able to recover
from them in a fast way. An example of such an error:

    W1126 12:45:08.403311       1 reflector.go:436]
    pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/reflector.go:167: watch
    of *v1.Endpoints ended with: an error on the server ("unable to
    decode an event from the watch stream: http2: client connection
    lost") has prevented the request from succeeding

To determine the min vsn bump, I was using the following:

    for i in 1.7.0 1.7.1 1.8.0 1.8.1 1.8.2 1.8.3 1.8.4; do
        docker run --rm -ti "k8s.gcr.io/coredns/coredns:v$i" \
            --version
    done

    CoreDNS-1.7.0
    linux/amd64, go1.14.4, f59c03d
    CoreDNS-1.7.1
    linux/amd64, go1.15.2, aa82ca6
    CoreDNS-1.8.0
    linux/amd64, go1.15.3, 054c9ae
    k8s.gcr.io/coredns/coredns:v1.8.1 not found: manifest unknown:
    k8s.gcr.io/coredns/coredns:v1.8.2 not found: manifest unknown:
    CoreDNS-1.8.3
    linux/amd64, go1.16, 4293992
    CoreDNS-1.8.4
    linux/amd64, go1.16.4, 053c4d5

Hopefully, the bumped version will fix the CI flakes in which a service
domain name is not available after 7min. In other words, CoreDNS is not
able to resolve the name which means that it hasn't received update from
the kube-apiserver for the service.

[1]: kubernetes/kubernetes#87615 (comment)

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: nathanjsweet <nathanjsweet@pm.me>
nathanjsweet pushed a commit to cilium/cilium that referenced this issue Dec 6, 2021
[ upstream commit 398d55c ]

As reported in [1], Go's HTTP2 client < 1.16 had some serious bugs which
could result in lost connections to kube-apiserver. Worse than this was
that the client couldn't recover.

In the case of CoreDNS the loose of connectivity to kube-apiserver was
even not logged. I have validated this by adding the following rule on
the node which was running the CoreDNS pod (6443 port as the socket-lb
was doing the service xlation):

    iptables -I FORWARD 1 -m tcp --proto tcp --src $CORE_DNS_POD_IP \
        --dport=6443 -j DROP

After upgrading CoreDNS to the one which was compiled with Go >= 1.16,
the pod was not only logging the errors, but also was able to recover
from them in a fast way. An example of such an error:

    W1126 12:45:08.403311       1 reflector.go:436]
    pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/reflector.go:167: watch
    of *v1.Endpoints ended with: an error on the server ("unable to
    decode an event from the watch stream: http2: client connection
    lost") has prevented the request from succeeding

To determine the min vsn bump, I was using the following:

    for i in 1.7.0 1.7.1 1.8.0 1.8.1 1.8.2 1.8.3 1.8.4; do
        docker run --rm -ti "k8s.gcr.io/coredns/coredns:v$i" \
            --version
    done

    CoreDNS-1.7.0
    linux/amd64, go1.14.4, f59c03d
    CoreDNS-1.7.1
    linux/amd64, go1.15.2, aa82ca6
    CoreDNS-1.8.0
    linux/amd64, go1.15.3, 054c9ae
    k8s.gcr.io/coredns/coredns:v1.8.1 not found: manifest unknown:
    k8s.gcr.io/coredns/coredns:v1.8.2 not found: manifest unknown:
    CoreDNS-1.8.3
    linux/amd64, go1.16, 4293992
    CoreDNS-1.8.4
    linux/amd64, go1.16.4, 053c4d5

Hopefully, the bumped version will fix the CI flakes in which a service
domain name is not available after 7min. In other words, CoreDNS is not
able to resolve the name which means that it hasn't received update from
the kube-apiserver for the service.

[1]: kubernetes/kubernetes#87615 (comment)

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: nathanjsweet <nathanjsweet@pm.me>
joestringer pushed a commit to cilium/cilium that referenced this issue Dec 10, 2021
[ upstream commit 398d55c ]

As reported in [1], Go's HTTP2 client < 1.16 had some serious bugs which
could result in lost connections to kube-apiserver. Worse than this was
that the client couldn't recover.

In the case of CoreDNS the loose of connectivity to kube-apiserver was
even not logged. I have validated this by adding the following rule on
the node which was running the CoreDNS pod (6443 port as the socket-lb
was doing the service xlation):

    iptables -I FORWARD 1 -m tcp --proto tcp --src $CORE_DNS_POD_IP \
        --dport=6443 -j DROP

After upgrading CoreDNS to the one which was compiled with Go >= 1.16,
the pod was not only logging the errors, but also was able to recover
from them in a fast way. An example of such an error:

    W1126 12:45:08.403311       1 reflector.go:436]
    pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/reflector.go:167: watch
    of *v1.Endpoints ended with: an error on the server ("unable to
    decode an event from the watch stream: http2: client connection
    lost") has prevented the request from succeeding

To determine the min vsn bump, I was using the following:

    for i in 1.7.0 1.7.1 1.8.0 1.8.1 1.8.2 1.8.3 1.8.4; do
        docker run --rm -ti "k8s.gcr.io/coredns/coredns:v$i" \
            --version
    done

    CoreDNS-1.7.0
    linux/amd64, go1.14.4, f59c03d
    CoreDNS-1.7.1
    linux/amd64, go1.15.2, aa82ca6
    CoreDNS-1.8.0
    linux/amd64, go1.15.3, 054c9ae
    k8s.gcr.io/coredns/coredns:v1.8.1 not found: manifest unknown:
    k8s.gcr.io/coredns/coredns:v1.8.2 not found: manifest unknown:
    CoreDNS-1.8.3
    linux/amd64, go1.16, 4293992
    CoreDNS-1.8.4
    linux/amd64, go1.16.4, 053c4d5

Hopefully, the bumped version will fix the CI flakes in which a service
domain name is not available after 7min. In other words, CoreDNS is not
able to resolve the name which means that it hasn't received update from
the kube-apiserver for the service.

[1]: kubernetes/kubernetes#87615 (comment)

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: nathanjsweet <nathanjsweet@pm.me>
nbusseneau pushed a commit to cilium/cilium that referenced this issue Dec 14, 2021
[ upstream commit 398d55c ]

As reported in [1], Go's HTTP2 client < 1.16 had some serious bugs which
could result in lost connections to kube-apiserver. Worse than this was
that the client couldn't recover.

In the case of CoreDNS the loose of connectivity to kube-apiserver was
even not logged. I have validated this by adding the following rule on
the node which was running the CoreDNS pod (6443 port as the socket-lb
was doing the service xlation):

    iptables -I FORWARD 1 -m tcp --proto tcp --src $CORE_DNS_POD_IP \
        --dport=6443 -j DROP

After upgrading CoreDNS to the one which was compiled with Go >= 1.16,
the pod was not only logging the errors, but also was able to recover
from them in a fast way. An example of such an error:

    W1126 12:45:08.403311       1 reflector.go:436]
    pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/reflector.go:167: watch
    of *v1.Endpoints ended with: an error on the server ("unable to
    decode an event from the watch stream: http2: client connection
    lost") has prevented the request from succeeding

To determine the min vsn bump, I was using the following:

    for i in 1.7.0 1.7.1 1.8.0 1.8.1 1.8.2 1.8.3 1.8.4; do
        docker run --rm -ti "k8s.gcr.io/coredns/coredns:v$i" \
            --version
    done

    CoreDNS-1.7.0
    linux/amd64, go1.14.4, f59c03d
    CoreDNS-1.7.1
    linux/amd64, go1.15.2, aa82ca6
    CoreDNS-1.8.0
    linux/amd64, go1.15.3, 054c9ae
    k8s.gcr.io/coredns/coredns:v1.8.1 not found: manifest unknown:
    k8s.gcr.io/coredns/coredns:v1.8.2 not found: manifest unknown:
    CoreDNS-1.8.3
    linux/amd64, go1.16, 4293992
    CoreDNS-1.8.4
    linux/amd64, go1.16.4, 053c4d5

Hopefully, the bumped version will fix the CI flakes in which a service
domain name is not available after 7min. In other words, CoreDNS is not
able to resolve the name which means that it hasn't received update from
the kube-apiserver for the service.

[1]: kubernetes/kubernetes#87615 (comment)

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: nathanjsweet <nathanjsweet@pm.me>
tklauser pushed a commit to cilium/cilium that referenced this issue Dec 15, 2021
[ upstream commit 398d55c ]

As reported in [1], Go's HTTP2 client < 1.16 had some serious bugs which
could result in lost connections to kube-apiserver. Worse than this was
that the client couldn't recover.

In the case of CoreDNS the loose of connectivity to kube-apiserver was
even not logged. I have validated this by adding the following rule on
the node which was running the CoreDNS pod (6443 port as the socket-lb
was doing the service xlation):

    iptables -I FORWARD 1 -m tcp --proto tcp --src $CORE_DNS_POD_IP \
        --dport=6443 -j DROP

After upgrading CoreDNS to the one which was compiled with Go >= 1.16,
the pod was not only logging the errors, but also was able to recover
from them in a fast way. An example of such an error:

    W1126 12:45:08.403311       1 reflector.go:436]
    pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/reflector.go:167: watch
    of *v1.Endpoints ended with: an error on the server ("unable to
    decode an event from the watch stream: http2: client connection
    lost") has prevented the request from succeeding

To determine the min vsn bump, I was using the following:

    for i in 1.7.0 1.7.1 1.8.0 1.8.1 1.8.2 1.8.3 1.8.4; do
        docker run --rm -ti "k8s.gcr.io/coredns/coredns:v$i" \
            --version
    done

    CoreDNS-1.7.0
    linux/amd64, go1.14.4, f59c03d
    CoreDNS-1.7.1
    linux/amd64, go1.15.2, aa82ca6
    CoreDNS-1.8.0
    linux/amd64, go1.15.3, 054c9ae
    k8s.gcr.io/coredns/coredns:v1.8.1 not found: manifest unknown:
    k8s.gcr.io/coredns/coredns:v1.8.2 not found: manifest unknown:
    CoreDNS-1.8.3
    linux/amd64, go1.16, 4293992
    CoreDNS-1.8.4
    linux/amd64, go1.16.4, 053c4d5

Hopefully, the bumped version will fix the CI flakes in which a service
domain name is not available after 7min. In other words, CoreDNS is not
able to resolve the name which means that it hasn't received update from
the kube-apiserver for the service.

[1]: kubernetes/kubernetes#87615 (comment)

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: nathanjsweet <nathanjsweet@pm.me>
@LuChenjing
Copy link

@cloud-66 I know, the point here is that a kubelet restart fixes the issue, so this is not fixed in 1.18.18, you need 1.19+

@champtar Hi, May I ask the solution(or which exact K8S version) you tried, I noticed someone merged this into 1.18 and able to find it in 1.18.18's CHANGELOG. #100376
However we met this issue in 1.18.20 also...

@hashwing
Copy link

hashwing commented Dec 27, 2021 via email

@champtar
Copy link
Contributor

@cloud-66 I know, the point here is that a kubelet restart fixes the issue, so this is not fixed in 1.18.18, you need 1.19+

@champtar Hi, May I ask the solution(or which exact K8S version) you tried, I noticed someone merged this into 1.18 and able to find it in 1.18.18's CHANGELOG. #100376 However we met this issue in 1.18.20 also...

The exact version at the time was v1.19.10

rata pushed a commit to kinvolk/metallb that referenced this issue Feb 17, 2022
Golang 1.13 has horrible bugs that might have affected MetalLB
as it's using client-go / watches,
see kubernetes/kubernetes#87615
Newer client-go version (that we are now using) have implemented
HTTP/2 health check by default but this is a workaround not a fix.

Also Golang 1.13 is not supported anymore

Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
@kwenzh
Copy link

kwenzh commented Jul 17, 2023

the same issue , in kernel 5.15, kubelet 1.18.19, kube-apiserver v1.18.6·

E0716 17:14:05.352004 55625 server.go:253] Unable to authenticate the request due to an error: Post "https://1xxxxx:8443/apis/authentication.k8s.io/v1/tokenreviews": read tcp 10.2xxxx:33288->10.xxxx 8443: use of closed network connection

novad03 pushed a commit to novad03/k8s-meta that referenced this issue Nov 25, 2023
Golang 1.13 has horrible bugs that might have affected MetalLB
as it's using client-go / watches,
see kubernetes/kubernetes#87615
Newer client-go version (that we are now using) have implemented
HTTP/2 health check by default but this is a workaround not a fix.

Also Golang 1.13 is not supported anymore

Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet