Skip to content

Cluster failing to start: Failed to list with EOF and net/http: TLS handshake timeout #2687

@glennodickson

Description

@glennodickson

Please any help would be really appreciated!!

  • Openshift: 4.1.18
  • Platform: UPI
  • Loadbalancer: haproxy
  • DNS: Dnsmasq (as attached)
  • PXEBoot: Default (as attached)

Attached:

What happened?

  1. Generated all ignition files and ensured that hidden files have been cleared.
  2. Boostrap start and waits for etcd
  3. Masters 0,1,2 starts and downloads config from bootstrap
  4. Bootstrap signals as above "committed proposal" for each etcd
  5. Then errors "Failed to fetch discovery .... :6443 ... connection refused" (see below)
  6. Now bootstrap process Kube-apiserver (6443) seems to restarts constantly

all firewalls are disabled on DNS and load balancer servers.

Could it be:

  • Certificate related?
  • Race condition where PC not faster enough (shouldn't be i9 8 phy. cores)

Output Hightlights:

Bootstrap starts:
Masters 0/1/2 dowload config and etcd are noticed

Nov 19 12:48:07 bootstrap bootkube.sh[1379]: Error: unhealthy cluster
Nov 19 12:48:08 bootstrap bootkube.sh[1379]: etcdctl failed. Retrying in 5 seconds...
Nov 19 12:49:42 bootstrap bootkube.sh[1379]: https://etcd-0.test.fritz.box:2379 is healthy: successfully committed proposal: took = 502.844138ms
Nov 19 12:50:20 bootstrap bootkube.sh[1379]: https://etcd-2.test.fritz.box:2379 is healthy: successfully committed proposal: took = 109.785824ms
Nov 19 12:50:25 bootstrap bootkube.sh[1379]: https://etcd-1.test.fritz.box:2379 is healthy: successfully committed proposal: took = 43.563555ms
Nov 19 12:50:26 bootstrap bootkube.sh[1379]: etcd cluster up. Killing etcd certificate signer...
Nov 19 12:50:32 bootstrap bootkube.sh[1379]: e0d21c36b68a594bc7b3260bfe9521df3c3dde39beaa352c5a690a4d33744eb2
Nov 19 12:50:33 bootstrap bootkube.sh[1379]: Starting cluster-bootstrap...
Nov 19 12:51:08 bootstrap bootkube.sh[1379]: Starting temporary bootstrap control plane...
Nov 19 12:51:08 bootstrap bootkube.sh[1379]: [#1] failed to fetch discovery: Get https://localhost:6443/api?timeout=32s: dial tcp [::1]:6443: connect: connection refused
Nov 19 12:51:09 bootstrap bootkube.sh[1379]: E1119 12:51:09.175787       1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get https://api.test.fritz.box:6443/api/v1/pods: EOF
[core@bootstrap ~]$ watch netstat -plnt
tcp        0      0 0.0.0.0:57839           0.0.0.0:*               LISTEN      -
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      -
tcp6       0      0 :::22                   :::*                    LISTEN      -
tcp6       0      0 :::53471                :::*                    LISTEN      -
tcp6       0      0 :::9537                 :::*                    LISTEN      -
tcp6       0      0 :::10250                :::*                    LISTEN      -
tcp6       0      0 :::6443                 :::*                    LISTEN      -
tcp6       0      0 :::9099                 :::*                    LISTEN      -
tcp6       0      0 :::19531                :::*                    LISTEN      -
tcp6       0      0 :::10255                :::*                    LISTEN      -
tcp6       0      0 :::111                  :::*                    LISTEN      -
tcp6       0      0 :::10259                :::*                    LISTEN      -

Masters restart and shortly after bootstrap service (6443) restarts :

[core@bootstrap ~]$ journalctl -b -f -u bootkube.service
...
Nov 19 13:24:04 bootstrap bootkube.sh[31694]: E1119 13:24:04.841357       1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get https://api.test.fritz.box:6443/api/v1/pods: EOF
Nov 19 13:24:06 bootstrap bootkube.sh[31694]: E1119 13:24:06.332132       1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get https://api.test.fritz.box:6443/api/v1/pods: EOF
Nov 19 13:24:07 bootstrap bootkube.sh[31694]: E1119 13:24:07.440597       1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get https://api.test.fritz.box:6443/api/v1/pods: EOF
Nov 19 13:24:08 bootstrap bootkube.sh[31694]: E1119 13:24:08.497494       1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get https://api.test.fritz.box:6443/api/v1/pods: EOF
Nov 19 13:24:09 bootstrap bootkube.sh[31694]: E1119 13:24:09.522265       1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get https://api.test.fritz.box:6443/api/v1/pods: EOF
Nov 19 13:24:10 bootstrap bootkube.sh[31694]: E1119 13:24:10.597813       1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get https://api.test.fritz.box:6443/api/v1/pods: EOF
Nov 19 13:24:21 bootstrap bootkube.sh[31694]: E1119 13:24:21.630283       1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get https://api.test.fritz.box:6443/api/v1/pods: net/http: TLS handshake timeout
Nov 19 13:24:33 bootstrap bootkube.sh[31694]: E1119 13:24:33.132800       1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get https://api.test.fritz.box:6443/api/v1/pods: net/http: TLS handshake timeout
Nov 19 13:24:44 bootstrap bootkube.sh[31694]: E1119 13:24:44.168483       1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get https://api.test.fritz.box:6443/api/v1/pods: net/http: TLS handshake timeout
Nov 19 13:24:55 bootstrap bootkube.sh[31694]: E1119 13:24:55.276698       1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get https://api.test.fritz.box:6443/api/v1/pods: net/http: TLS handshake timeout
Nov 19 13:25:06 bootstrap bootkube.sh[31694]: E1119 13:25:06.414705       1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get https://api.test.fritz.box:6443/api/v1/pods: net/http: TLS handshake timeout
Nov 19 13:25:17 bootstrap bootkube.sh[31694]: E1119 13:25:17.438228       1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get https://api.test.fritz.box:6443/api/v1/pods: EOF\

masters 0/1/2

Nov 19 13:33:51 master0 hyperkube[2876]: E1119 13:33:51.904677    2876 kubelet.go:2276] node "master0" not found
Nov 19 13:33:52 master0 hyperkube[2876]: E1119 13:33:52.008815    2876 kubelet.go:2276] node "master0" not found
Nov 19 13:33:52 master0 hyperkube[2876]: I1119 13:33:52.031523    2876 eviction_manager.go:229] eviction manager: synchronize housekeeping
Nov 19 13:33:52 master0 hyperkube[2876]: E1119 13:33:52.038710    2876 eviction_manager.go:246] eviction manager: failed to get summary stats: failed to get node info: node "master0" not found
Nov 19 13:33:52 master0 hyperkube[2876]: E1119 13:33:52.109972    2876 kubelet.go:2276] node "master0" not found
Nov 19 13:33:52 master0 hyperkube[2876]: E1119 13:33:52.215617    2876 kubelet.go:2276] node "master0" not found
Nov 19 13:33:52 master0 hyperkube[2876]: E1119 13:33:52.304643    2876 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://api-int.test.fritz.box:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dmaster0&limit=500&resourceVersion=0: EOF
Nov 19 13:33:52 master0 hyperkube[2876]: E1119 13:33:52.319586    2876 kubelet.go:2276] node "master0" not found
Nov 19 13:33:52 master0 hyperkube[2876]: E1119 13:33:52.421361    2876 kubelet.go:2276] node "master0" not found
Nov 19 13:33:52 master0 hyperkube[2876]: E1119 13:33:52.523345    2876 kubelet.go:2276] node "master0" not found
Nov 19 13:33:52 master0 hyperkube[2876]: E1119 13:33:52.625008    2876 kubelet.go:2276] node "master0" not found
Nov 19 13:33:52 master0 hyperkube[2876]: E1119 13:33:52.727907    2876 kubelet.go:2276] node "master0" not found
Nov 19 13:33:52 master0 hyperkube[2876]: E1119 13:33:52.799153    2876 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:453: Failed to list *v1.Node: Get https://api-int.test.fritz.box:6443/api/v1/nodes?fieldSelector=metadata.name%3Dmaster0&limit=500&resourceVersion=0: EOF
Nov 19 13:33:52 master0 hyperkube[2876]: E1119 13:33:52.809021    2876 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:444: Failed to list *v1.Service: Get https://api-int.test.fritz.box:6443/api/v1/services?limit=500&resourceVersion=0: EOF
Nov 19 13:33:52 master0 hyperkube[2876]: E1119 13:33:52.849490    2876 kubelet.go:2276] node "master0" not found
Nov 19 13:33:52 master0 hyperkube[2876]: E1119 13:33:52.955039    2876 kubelet.go:2276] node "master0" not found
Nov 19 13:33:53 master0 hyperkube[2876]: E1119 13:33:53.056800    2876 kubelet.go:2276] node "master0" not found
Nov 19 13:33:53 master0 hyperkube[2876]: E1119 13:33:53.159608    2876 kubelet.go:2276] node "master0" not found
Nov 19 13:33:53 master0 hyperkube[2876]: E1119 13:33:53.262447    2876 kubelet.go:2276] node "master0" not found
Nov 19 13:33:53 master0 hyperkube[2876]: I1119 13:33:53.313235    2876 reflector.go:160] Listing and watching *v1.Pod from k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47
Nov 19 13:33:53 master0 hyperkube[2876]: E1119 13:33:53.374617    2876 kubelet.go:2276] node "master0" not found
Nov 19 13:33:53 master0 hyperkube[2876]: E1119 13:33:53.485333    2876 kubelet.go:2276] node "master0" not found
Nov 19 13:33:53 master0 hyperkube[2876]: E1119 13:33:53.600113    2876 kubelet.go:2276] node "master0" not found
Nov 19 13:33:53 master0 hyperkube[2876]: E1119 13:33:53.713770    2876 kubelet.go:2276] node "master0" not found

What you expected to happen?

Start cluster successfully
Services are estalshed on masters

How to reproduce it

PXE Boot:

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions