Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes not hosting API VIP IP fail dial tcp 192.168.123.5:6443: connect: invalid argument #344

Closed
yprokule opened this issue Apr 12, 2019 · 14 comments

Comments

@yprokule
Copy link
Contributor

After some time of cluster being up it starts to fail with errors like:

Apr 12 09:48:31 master-1 hyperkube[23740]: E0412 09:48:31.776526   23740 kubelet.go:2273] node "master-1" not found
Apr 12 09:48:31 master-1 hyperkube[23740]: E0412 09:48:31.843612   23740 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:453: Failed to list *v1.Node: Get https://api.ostest.test.metalkube.org:6443/ap
i/v1/nodes?fieldSelector=metadata.name%3Dmaster-1&limit=500&resourceVersion=0: dial tcp 192.168.123.5:6443: connect: invalid argument
Apr 12 09:48:31 master-1 hyperkube[23740]: E0412 09:48:31.844422   23740 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://api.ostest.test.metalkube.org:
6443/api/v1/pods?fieldSelector=spec.nodeName%3Dmaster-1&limit=500&resourceVersion=0: dial tcp 192.168.123.5:6443: connect: invalid argument
Apr 12 09:48:31 master-1 hyperkube[23740]: E0412 09:48:31.876801   23740 kubelet.go:2273] node "master-1" not found
Apr 12 09:48:31 master-1 hyperkube[23740]: E0412 09:48:31.977082   23740 kubelet.go:2273] node "master-1" not found
Apr 12 09:48:32 master-1 hyperkube[23740]: E0412 09:48:32.054813   23740 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:444: Failed to list *v1.Service: Get https://api.ostest.test.metalkube.org:6443
/api/v1/services?limit=500&resourceVersion=0: dial tcp 192.168.123.5:6443: connect: invalid argument

Attempt to ping this ip from any other master nodes(except the one that hosts the ip) fails:

[root@master-1 ~]# ip a | grep 192.168.123.5
[root@master-1 ~]# ping 192.168.123.5
connect: Invalid argument
[root@master-1 ~]# curl -4kvL 192.168.123.5:6443
* Rebuilt URL to: 192.168.123.5:6443/
*   Trying 192.168.123.5...
* TCP_NODELAY set
* Immediate connect fail for 192.168.123.5: Invalid argument
* Closing connection 0
curl: (7) Couldn't connect to server
[root@master-1 ~]# curl -4kvL api.ostest.test.metalkube.org:6443                                                                                                                                                   
* Rebuilt URL to: api.ostest.test.metalkube.org:6443/
*   Trying 192.168.123.5...
* TCP_NODELAY set
* Immediate connect fail for 192.168.123.5: Invalid argument
* Closing connection 0
curl: (7) Couldn't connect to server
[root@master-1 ~]# 
[root@master-2 ~]# ip a | grep 192.168.123.5
[root@master-2 ~]# ping 192.168.123.5
connect: Invalid argument
[root@master-2 ~]# curl -4kvL 192.168.123.5:6443
* Rebuilt URL to: 192.168.123.5:6443/
*   Trying 192.168.123.5...
* TCP_NODELAY set
* Immediate connect fail for 192.168.123.5: Invalid argument
* Closing connection 0
curl: (7) Couldn't connect to server
[root@master-2 ~]# curl -4kvL api.ostest.test.metalkube.org:6443                                                                                                                                                   
* Rebuilt URL to: api.ostest.test.metalkube.org:6443/
*   Trying 192.168.123.5...
* TCP_NODELAY set
* Immediate connect fail for 192.168.123.5: Invalid argument
* Closing connection 0
curl: (7) Couldn't connect to server
[root@master-2 ~]# 

and from master-0 that handles the API VIP IP

[root@master-0 ~]# ip a |grep 192.168.123.5
    inet 192.168.123.5/32 scope global ens4
[root@master-0 ~]# curl -4klv api.ostest.test.metalkube.org:6443
* Rebuilt URL to: api.ostest.test.metalkube.org:6443/
*   Trying 192.168.123.5...
* TCP_NODELAY set
* Connected to api.ostest.test.metalkube.org (192.168.123.5) port 6443 (#0)
> GET / HTTP/1.1
> Host: api.ostest.test.metalkube.org:6443
> User-Agent: curl/7.61.1
> Accept: */*
> 
Warning: Binary output can mess up your terminal. Use "--output -" to tell 
Warning: curl to output it to your terminal anyway, or consider "--output 
Warning: <FILE>" to save to a file.
* Failed writing body (0 != 7)
* Closing connection 0
[root@master-0 ~]# curl -4klv 192.168.123.5:6443
* Rebuilt URL to: 192.168.123.5:6443/
*   Trying 192.168.123.5...
* TCP_NODELAY set
* Connected to 192.168.123.5 (192.168.123.5) port 6443 (#0)
> GET / HTTP/1.1
> Host: 192.168.123.5:6443
> User-Agent: curl/7.61.1
> Accept: */*
> 
Warning: Binary output can mess up your terminal. Use "--output -" to tell 
Warning: curl to output it to your terminal anyway, or consider "--output 
Warning: <FILE>" to save to a file.
* Failed writing body (0 != 7)
* Closing connection 0
[root@master-0 ~]# 
@yprokule
Copy link
Contributor Author

/cc @mcornea @achuzhoy @celebdor

@hardys
Copy link

hardys commented Apr 12, 2019

Thanks for the report, can you provide the keepalived logs from all masters please?

@yprokule
Copy link
Contributor Author

yprokule commented Apr 12, 2019

@hardys logs from keepalieved containers

@yprokule
Copy link
Contributor Author

Worth mentioning that both masters and workers nodes end up in NotReady status:

oc get nodes
NAME       STATUS     ROLES    AGE     VERSION
master-0   NotReady   master   3d2h    v1.13.4+1ad602308
master-1   NotReady   master   3d2h    v1.13.4+1ad602308
master-2   NotReady   master   3d2h    v1.13.4+1ad602308
worker-0   NotReady   worker   2d22h   v1.13.4+1ad602308

@jtaleric
Copy link
Contributor

Hit this issue in my deployment, only master-2 went to NotReady. Exact same issue described in this issue.

@yboaron
Copy link
Contributor

yboaron commented Apr 15, 2019

@yprokule , seems like there's L2 connectivity issue for 192.168.123.5 (as ping doesn't work) ,

  1. Could you please run the same test for 192.168.123.6 (DNS) and 192.168.123.10 (INGRESS)?
  2. Could you please attach the output of arp table (arp -a) from all nodes?

@yprokule
Copy link
Contributor Author

@yboaron

master-0

[root@master-0 ~]# ping -c1 192.168.123.6
PING 192.168.123.6 (192.168.123.6) 56(84) bytes of data.
64 bytes from 192.168.123.6: icmp_seq=1 ttl=64 time=0.029 ms

--- 192.168.123.6 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.029/0.029/0.029/0.000 ms
[root@master-0 ~]# ping -c1 192.168.123.5
connect: Invalid argument

master-1

[root@master-1 ~]# ping -c1 192.168.123.5
PING 192.168.123.5 (192.168.123.5) 56(84) bytes of data.
64 bytes from 192.168.123.5: icmp_seq=1 ttl=64 time=0.174 ms

--- 192.168.123.5 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.174/0.174/0.174/0.000 ms
[root@master-1 ~]# ping -c1 192.168.123.6
PING 192.168.123.6 (192.168.123.6) 56(84) bytes of data.
64 bytes from 192.168.123.6: icmp_seq=1 ttl=64 time=0.213 ms

--- 192.168.123.6 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.213/0.213/0.213/0.000 ms

master-2

root@master-2 ~]# ping -c1 192.168.123.6
PING 192.168.123.6 (192.168.123.6) 56(84) bytes of data.
64 bytes from 192.168.123.6: icmp_seq=1 ttl=64 time=0.163 ms

--- 192.168.123.6 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.163/0.163/0.163/0.000 ms
[root@master-2 ~]# ping -c1 192.168.123.5
PING 192.168.123.5 (192.168.123.5) 56(84) bytes of data.
64 bytes from 192.168.123.5: icmp_seq=1 ttl=64 time=0.030 ms

--- 192.168.123.5 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.030/0.030/0.030/0.000 ms

@yprokule
Copy link
Contributor Author

on other nodes

Apr 15 16:15:41 master-2 hyperkube[121307]: E0415 16:15:41.413449  121307 kubelet.go:2273] node "master-2" not found
Apr 15 16:15:41 master-2 hyperkube[121307]: E0415 16:15:41.497345  121307 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:444: Failed to list *v1.Service: services is forbidden: User "system:anonymous" cannot list resource "services" in API group "" at the cluster scope
Apr 15 16:15:41 master-2 hyperkube[121307]: E0415 16:15:41.498449  121307 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: pods is forbidden: User "system:anonymous" cannot list resource "pods" in API group "" at the cluster scope

@yboaron
Copy link
Contributor

yboaron commented Apr 16, 2019

@yprokule , I think that I found something.

Master-0 doesn't hold the API VIP (192.168.123.5) but I can still see the following HOST entry in the routing table:
192.168.123.5 dev ens4 proto kernel scope link src 192.168.123.5 metric 101

So, when Master-0 try to send any packet to 192.168.123.5, network stack fail with 'connect: Invalid argument' .

I deleted the 192.168.123.5 route from Master-0 , and now I'm able to ping 192.168.123.5.

[core@master-0 ~]$ sudo ip route del 192.168.123.5/32
[core@master-0 ~]$ ping 192.168.123.5
PING 192.168.123.5 (192.168.123.5) 56(84) bytes of data.
64 bytes from 192.168.123.5: icmp_seq=1 ttl=64 time=0.170 ms
64 bytes from 192.168.123.5: icmp_seq=2 ttl=64 time=0.098 ms
64 bytes from 192.168.123.5: icmp_seq=3 ttl=64 time=0.252 ms
64 bytes from 192.168.123.5: icmp_seq=4 ttl=64 time=0.210 ms
^C
--- 192.168.123.5 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 92ms
rtt min/avg/max/mdev = 0.098/0.182/0.252/0.058 ms
[core@master-0 ~]$

@karmab
Copy link
Contributor

karmab commented Apr 16, 2019

yep deleting the incorrect route fixed the notready state of the node, which was simply not able to reach the api to report status

@russellb
Copy link
Member

I See @karmab has a WIP patch for this here: #369

@yboaron
Copy link
Contributor

yboaron commented Apr 16, 2019

seems like a RHCOS/RHEL bug, I filed bz for that https://bugzilla.redhat.com/show_bug.cgi?id=1700415

@russellb
Copy link
Member

a different workaround here: #377

@russellb
Copy link
Member

We've got an open bug tracking the kernel issue. In the meantime, we've updated our config such that the undeleted route won't cause a problem anymore. See #377

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants