Cluster deployment hangs at "Failed to connect events watcher: Get https://mycluster-api.openshift.testing:6443/apikube-system/events? #721

jihed · 2018-11-23T10:56:36Z

Version

openshift-install version
openshift-install v0.3.0-250-g30bb25ac57d7c7d3dae519186cbfca9af8aeaca2
Terraform v0.11.8

Your version of Terraform is out of date! The latest version
is 0.11.10. You can update by downloading from www.terraform.io/downloads.html

/usr/local/bin/terraform-provider-libvirt -version
/usr/local/bin/terraform-provider-libvirt was not built correctly
Compiled against library: libvirt 4.3.0
Using library: libvirt 4.3.0
Running hypervisor: QEMU 2.10.0
Running against daemon: 4.3.0

Platform (aws|libvirt|openstack):

Libvirt

What happened?

The installation stuck at the check of the api.
level=warning msg="Failed to connect events watcher: Get https://mycluster-api.openshift.testing:6443/api/v1/namespaces/kube-system/events?resourceVersion=2176&watch=true: dial tcp 192.168.126.10:6443: connect: connection refused"
The etcd is running with some issue regarding the tls certificate:

...
2018-11-23 09:52:56.673956 I | etcdserver: setting up the initial cluster version to 3.2
2018-11-23 09:52:56.674057 I | embed: ready to serve client requests
2018-11-23 09:52:56.674405 I | embed: serving client requests on [::]:2379
2018-11-23 09:52:56.677078 N | etcdserver/membership: set the initial cluster version to 3.2
2018-11-23 09:52:56.677356 I | etcdserver/api: enabled capabilities for version 3.2
WARNING: 2018/11/23 09:52:56 Failed to dial 0.0.0.0:2379: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry.
2018-11-23 09:57:36.927049 W | etcdserver: apply entries took too long [136.407186ms for 1 entries]
2018-11-23 09:57:36.927368 W | etcdserver: avoid queries with large range/delete range!
2018-11-23 09:58:53.660410 W | etcdserver: apply entries took too long [368.197198ms for 1 entries]
2018-11-23 09:58:53.671038 W | etcdserver: avoid queries with large range/delete range!
**2018-11-23 09:58:55.388843 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = client disconnected")**
2018-11-23 09:58:55.591253 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = client disconnected")
2018-11-23 09:58:55.646246 W | etcdserver: apply entries took too long [322.505777ms for 2 entries]
2018-11-23 09:58:55.646335 W | etcdserver: avoid queries with large range/delete range!

On the pod of api-opneshift and api-kube-openshift are dead:
logs of openshift-apiserver

crictl logs $(sudo crictl ps -a --pod=$(sudo crictl pods --name=openshift-apiserver --quiet) --quiet)
I1123 10:36:08.862280       1 cmd.go:128] Using service-serving-cert provided certificates
I1123 10:36:08.863163       1 observer_polling.go:93] Starting file observer
W1123 10:36:09.669411       1 authentication.go:237] Unable to get configmap/extension-apiserver-authentication in kube-system.  Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'
F1123 10:36:09.669488       1 cmd.go:79] Get https://10.3.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 10.3.0.1:443: connect: connection refused

However on the openshift-kube-api I see that it was not able to resolve the etcd.kube-system.svc.
(I don't know if it's related).

I1123 10:32:37.389333       1 patch_handlerchain.go:63] Starting OAuth2 API at /oauth
F1123 10:32:47.395137       1 storage_decorator.go:57] Unable to create storage backend: config (&{etcd3 openshift.io [https://etcd.kube-system.svc:2379] /etc/kubernetes/static-pod-resources/secrets/etcd-client/tls.key /etc/kubernetes/static-pod-resources/secrets/etcd-client/tls.crt /etc/kubernetes/static-pod-resources/configmaps/etcd-serving-ca/ca-bundle.crt true false 1000 0xc421174bd0 <nil> 5m0s 1m0s}), err (dial tcp: lookup etcd.kube-system.svc on 192.168.126.1:53: no such host)

What you expected to happen?

finish the cluster installation process: the api is up, and the new worker node was created.

How to reproduce it (as minimally and precisely as possible)?

create-cluster mycluster

The text was updated successfully, but these errors were encountered:

wking · 2018-11-23T22:21:34Z

Sometimes the bad certificate errors are because you didn't clean up your state between runs. My other guess would be your kubelet certs expiring (#650).

jihed · 2018-11-24T01:38:38Z

I did clean up the state (rm -rf and virt-cleanup.sh) and run the installer in a new dir.
I tried to hard code the validitythirtyminute from 30 to 120 and I updated the installer. However, still have the same message.

2018-11-24 01:23:24.199305 I | etcdserver: published {Name:etcd-member-cluster1-master-0 ClientURLs:[https://192.168.126.11:2379]} to cluster 3e61dfa82ee6f64e
2018-11-24 01:23:24.199528 I | embed: ready to serve client requests
2018-11-24 01:23:24.200137 I | embed: serving client requests on [::]:2379
WARNING: 2018/11/24 01:23:24 Failed to dial 0.0.0.0:2379: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry.
2018-11-24 01:28:23.455573 W | etcdserver: apply entries took too long [667.598635ms for 1 entries]
2018-11-24 01:28:23.455804 W | etcdserver: avoid queries with large range/delete range!

wking · 2018-11-26T05:51:32Z

I somehow missed that the issue is with etcd certs. In order to reach the event watcher, you need to have a reasonably happy etcd cluster. If you run that health check after seeing this TLS problem, is your etcd cluster still reporting itself as healthy? And when you connect to one of the etcd nodes on 2379, is it giving you a cert signed by the generated etcd CA?

jihed · 2018-11-26T11:28:47Z

podman run --rm --network host --name etcdctl --env ETCDCTL_API=3 --volume /opt/tectonic/tls:/opt/tectic/tls:ro,z quay.io/coreos/etcd /usr/local/bin/etcdctl --dial-timeout=10m --cacert=/opt/tectonic/tls/etcd-client-ca.crt --cert=/opt/tectonic/tls/etcd-client.crt --key=/opt/tectonic/tls/etcd-client.key --endpoints=https://192.168.126.11:2379 endpoint health
https://192.168.126.11:2379 is healthy: successfully committed proposal: took = 42.256423ms

It's healthy.

jihed · 2018-11-26T11:40:43Z

 sudo crictl logs $(sudo crictl ps -a --pod=$(sudo crictl pods --name=openshift-apiserver --quiet) --quiet)
W1126 11:29:05.787963       1 cmd.go:132] Using insecure, self-signed certificates
I1126 11:29:05.788437       1 crypto.go:459] Generating new CA for openshift-cluster-openshift-apiserver-operator-signer@1543231745 cert, and key in /tmp/serving-cert-574829093/serving-signer.crt, /tmp/serving-cert-574829093/serving-signer.key
I1126 11:29:06.419884       1 observer_polling.go:93] Starting file observer
W1126 11:29:06.979326       1 authentication.go:237] Unable to get configmap/extension-apiserver-authentication in kube-system.  Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'
F1126 11:29:06.979406       1 cmd.go:79] Get https://10.3.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 10.3.0.1:443: connect: connection refused

on the hypervisor dnsmasq, there's no a record for etcd.kube-system.svc

F1126 11:31:39.365700       1 storage_decorator.go:57] Unable to create storage backend: config (&{etcd3 openshift.io [https://etcd.kube-system.svc:2379] /etc/kubernetes/static-pod-resources/secrets/etcd-client/tls.key /etc/kubernetes/static-pod-resources/secrets/etcd-client/tls.crt /etc/kubernetes/static-pod-resources/configmaps/etcd-serving-ca/ca-bundle.crt true false 1000 0xc42113b830 <nil> 5m0s 1m0s}), err (dial tcp: lookup etcd.kube-system.svc on 192.168.126.1:53: no such host)

abhinavdahiya · 2018-11-26T17:21:55Z

can you paste output for oc get pods,nodes --all-namespaces

on the hypervisor dnsmasq, there's no a record for etcd.kube-system.svc

is a kubernetes services for aggregated servers to talk to etcd cluster. you can check the state of the services using oc -n kube-system get svc etcd oc -n kube-system get endpoints etcd

jihed · 2018-11-26T20:10:18Z

the api pods are down. That's my original issue.

eparis · 2019-02-19T21:04:50Z

This should be resolved by now.

eparis closed this as completed Feb 19, 2019

andfasano mentioned this issue Jun 7, 2023

AGENT-558 Generate unconfigured agent ignition #7186

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster deployment hangs at "Failed to connect events watcher: Get https://mycluster-api.openshift.testing:6443/apikube-system/events? #721

Cluster deployment hangs at "Failed to connect events watcher: Get https://mycluster-api.openshift.testing:6443/apikube-system/events? #721

jihed commented Nov 23, 2018

wking commented Nov 23, 2018 •

edited

Loading

jihed commented Nov 24, 2018

wking commented Nov 26, 2018

jihed commented Nov 26, 2018 •

edited

Loading

jihed commented Nov 26, 2018

abhinavdahiya commented Nov 26, 2018 •

edited

Loading

jihed commented Nov 26, 2018

eparis commented Feb 19, 2019

Cluster deployment hangs at "Failed to connect events watcher: Get https://mycluster-api.openshift.testing:6443/apikube-system/events? #721

Cluster deployment hangs at "Failed to connect events watcher: Get https://mycluster-api.openshift.testing:6443/apikube-system/events? #721

Comments

jihed commented Nov 23, 2018

Version

Platform (aws|libvirt|openstack):

What happened?

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

wking commented Nov 23, 2018 • edited Loading

jihed commented Nov 24, 2018

wking commented Nov 26, 2018

jihed commented Nov 26, 2018 • edited Loading

jihed commented Nov 26, 2018

abhinavdahiya commented Nov 26, 2018 • edited Loading

jihed commented Nov 26, 2018

eparis commented Feb 19, 2019

wking commented Nov 23, 2018 •

edited

Loading

jihed commented Nov 26, 2018 •

edited

Loading

abhinavdahiya commented Nov 26, 2018 •

edited

Loading