Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster deployment hangs at "Failed to connect events watcher: Get https://mycluster-api.openshift.testing:6443/apikube-system/events? #721

Closed
jihed opened this issue Nov 23, 2018 · 8 comments

Comments

@jihed
Copy link

jihed commented Nov 23, 2018

Version

openshift-install version
openshift-install v0.3.0-250-g30bb25ac57d7c7d3dae519186cbfca9af8aeaca2
Terraform v0.11.8

Your version of Terraform is out of date! The latest version
is 0.11.10. You can update by downloading from www.terraform.io/downloads.html
/usr/local/bin/terraform-provider-libvirt -version
/usr/local/bin/terraform-provider-libvirt was not built correctly
Compiled against library: libvirt 4.3.0
Using library: libvirt 4.3.0
Running hypervisor: QEMU 2.10.0
Running against daemon: 4.3.0

Platform (aws|libvirt|openstack):

Libvirt

What happened?

The installation stuck at the check of the api.
level=warning msg="Failed to connect events watcher: Get https://mycluster-api.openshift.testing:6443/api/v1/namespaces/kube-system/events?resourceVersion=2176&watch=true: dial tcp 192.168.126.10:6443: connect: connection refused"
The etcd is running with some issue regarding the tls certificate:

...
2018-11-23 09:52:56.673956 I | etcdserver: setting up the initial cluster version to 3.2
2018-11-23 09:52:56.674057 I | embed: ready to serve client requests
2018-11-23 09:52:56.674405 I | embed: serving client requests on [::]:2379
2018-11-23 09:52:56.677078 N | etcdserver/membership: set the initial cluster version to 3.2
2018-11-23 09:52:56.677356 I | etcdserver/api: enabled capabilities for version 3.2
WARNING: 2018/11/23 09:52:56 Failed to dial 0.0.0.0:2379: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry.
2018-11-23 09:57:36.927049 W | etcdserver: apply entries took too long [136.407186ms for 1 entries]
2018-11-23 09:57:36.927368 W | etcdserver: avoid queries with large range/delete range!
2018-11-23 09:58:53.660410 W | etcdserver: apply entries took too long [368.197198ms for 1 entries]
2018-11-23 09:58:53.671038 W | etcdserver: avoid queries with large range/delete range!
**2018-11-23 09:58:55.388843 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = client disconnected")**
2018-11-23 09:58:55.591253 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = client disconnected")
2018-11-23 09:58:55.646246 W | etcdserver: apply entries took too long [322.505777ms for 2 entries]
2018-11-23 09:58:55.646335 W | etcdserver: avoid queries with large range/delete range!

On the pod of api-opneshift and api-kube-openshift are dead:
logs of openshift-apiserver

crictl logs $(sudo crictl ps -a --pod=$(sudo crictl pods --name=openshift-apiserver --quiet) --quiet)
I1123 10:36:08.862280       1 cmd.go:128] Using service-serving-cert provided certificates
I1123 10:36:08.863163       1 observer_polling.go:93] Starting file observer
W1123 10:36:09.669411       1 authentication.go:237] Unable to get configmap/extension-apiserver-authentication in kube-system.  Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'
F1123 10:36:09.669488       1 cmd.go:79] Get https://10.3.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 10.3.0.1:443: connect: connection refused

However on the openshift-kube-api I see that it was not able to resolve the etcd.kube-system.svc.
(I don't know if it's related).

I1123 10:32:37.389333       1 patch_handlerchain.go:63] Starting OAuth2 API at /oauth
F1123 10:32:47.395137       1 storage_decorator.go:57] Unable to create storage backend: config (&{etcd3 openshift.io [https://etcd.kube-system.svc:2379] /etc/kubernetes/static-pod-resources/secrets/etcd-client/tls.key /etc/kubernetes/static-pod-resources/secrets/etcd-client/tls.crt /etc/kubernetes/static-pod-resources/configmaps/etcd-serving-ca/ca-bundle.crt true false 1000 0xc421174bd0 <nil> 5m0s 1m0s}), err (dial tcp: lookup etcd.kube-system.svc on 192.168.126.1:53: no such host)

What you expected to happen?

finish the cluster installation process: the api is up, and the new worker node was created.

How to reproduce it (as minimally and precisely as possible)?

create-cluster mycluster
@wking
Copy link
Member

wking commented Nov 23, 2018

Sometimes the bad certificate errors are because you didn't clean up your state between runs. My other guess would be your kubelet certs expiring (#650).

@jihed
Copy link
Author

jihed commented Nov 24, 2018

I did clean up the state (rm -rf and virt-cleanup.sh) and run the installer in a new dir.
I tried to hard code the validitythirtyminute from 30 to 120 and I updated the installer. However, still have the same message.

2018-11-24 01:23:24.199305 I | etcdserver: published {Name:etcd-member-cluster1-master-0 ClientURLs:[https://192.168.126.11:2379]} to cluster 3e61dfa82ee6f64e
2018-11-24 01:23:24.199528 I | embed: ready to serve client requests
2018-11-24 01:23:24.200137 I | embed: serving client requests on [::]:2379
WARNING: 2018/11/24 01:23:24 Failed to dial 0.0.0.0:2379: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry.
2018-11-24 01:28:23.455573 W | etcdserver: apply entries took too long [667.598635ms for 1 entries]
2018-11-24 01:28:23.455804 W | etcdserver: avoid queries with large range/delete range!

@wking
Copy link
Member

wking commented Nov 26, 2018

I somehow missed that the issue is with etcd certs. In order to reach the event watcher, you need to have a reasonably happy etcd cluster. If you run that health check after seeing this TLS problem, is your etcd cluster still reporting itself as healthy? And when you connect to one of the etcd nodes on 2379, is it giving you a cert signed by the generated etcd CA?

@jihed
Copy link
Author

jihed commented Nov 26, 2018

podman run --rm --network host --name etcdctl --env ETCDCTL_API=3 --volume /opt/tectonic/tls:/opt/tectic/tls:ro,z quay.io/coreos/etcd /usr/local/bin/etcdctl --dial-timeout=10m --cacert=/opt/tectonic/tls/etcd-client-ca.crt --cert=/opt/tectonic/tls/etcd-client.crt --key=/opt/tectonic/tls/etcd-client.key --endpoints=https://192.168.126.11:2379 endpoint health
https://192.168.126.11:2379 is healthy: successfully committed proposal: took = 42.256423ms

It's healthy.

@jihed
Copy link
Author

jihed commented Nov 26, 2018

 sudo crictl logs $(sudo crictl ps -a --pod=$(sudo crictl pods --name=openshift-apiserver --quiet) --quiet)
W1126 11:29:05.787963       1 cmd.go:132] Using insecure, self-signed certificates
I1126 11:29:05.788437       1 crypto.go:459] Generating new CA for openshift-cluster-openshift-apiserver-operator-signer@1543231745 cert, and key in /tmp/serving-cert-574829093/serving-signer.crt, /tmp/serving-cert-574829093/serving-signer.key
I1126 11:29:06.419884       1 observer_polling.go:93] Starting file observer
W1126 11:29:06.979326       1 authentication.go:237] Unable to get configmap/extension-apiserver-authentication in kube-system.  Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'
F1126 11:29:06.979406       1 cmd.go:79] Get https://10.3.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 10.3.0.1:443: connect: connection refused

on the hypervisor dnsmasq, there's no a record for etcd.kube-system.svc

F1126 11:31:39.365700       1 storage_decorator.go:57] Unable to create storage backend: config (&{etcd3 openshift.io [https://etcd.kube-system.svc:2379] /etc/kubernetes/static-pod-resources/secrets/etcd-client/tls.key /etc/kubernetes/static-pod-resources/secrets/etcd-client/tls.crt /etc/kubernetes/static-pod-resources/configmaps/etcd-serving-ca/ca-bundle.crt true false 1000 0xc42113b830 <nil> 5m0s 1m0s}), err (dial tcp: lookup etcd.kube-system.svc on 192.168.126.1:53: no such host)

@abhinavdahiya
Copy link
Contributor

abhinavdahiya commented Nov 26, 2018

can you paste output for oc get pods,nodes --all-namespaces

on the hypervisor dnsmasq, there's no a record for etcd.kube-system.svc

is a kubernetes services for aggregated servers to talk to etcd cluster. you can check the state of the services using oc -n kube-system get svc etcd oc -n kube-system get endpoints etcd

@jihed
Copy link
Author

jihed commented Nov 26, 2018

the api pods are down. That's my original issue.

@eparis
Copy link
Member

eparis commented Feb 19, 2019

This should be resolved by now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants