Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with etcd during OCP 4.3 installation on vSphere #3028

Closed
ac06012014 opened this issue Jan 30, 2020 · 27 comments
Closed

Issue with etcd during OCP 4.3 installation on vSphere #3028

ac06012014 opened this issue Jan 30, 2020 · 27 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@ac06012014
Copy link

Hi there,

Version

OCP 4.3

Platform: vSphere 6.7 U3

What happened?

I've got this issue on bootstrap/master nodes even if I've respected all prerequisites.

journalctl -b -f -u bootkube.service
desc = latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.23.3.55:2379: connect: connection refused""}
Jan 30 18:38:46 master0 bootkube.sh[2802]: https://etcd-0.vmware.cpod-aca-ocp.az-rbx.cloud-garage.net:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Jan 30 18:38:46 master0 bootkube.sh[2802]: https://etcd-1.vmware.cpod-aca-ocp.az-rbx.cloud-garage.net:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Jan 30 18:38:46 master0 bootkube.sh[2802]: https://etcd-2.vmware.cpod-aca-ocp.az-rbx.cloud-garage.net:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Jan 30 18:38:46 master0 bootkube.sh[2802]: Error: unhealthy cluster

Do you have any idea please ?

@abhinavdahiya
Copy link
Contributor

can you provide the log bundle from openshift-install gather bootstrap

@ac06012014
Copy link
Author

@ac06012014
Copy link
Author

Port 2379 is not bind on master0,1,2. (Same issue with OCP 4.2)

Jan 31 10:21:28 bootstrap bootkube.sh[1782]: https://etcd-2.vmware.cpod-aca-ocp.az-rbx.cloud-garage.net:2379 is unhealthy: failed to connect: dial tcp 172.23.3.64:2379: connect: no route to host
Jan 31 10:21:28 bootstrap bootkube.sh[1782]: https://etcd-1.vmware.cpod-aca-ocp.az-rbx.cloud-garage.net:2379 is unhealthy: failed to connect: dial tcp 172.23.3.63:2379: connect: no route to host
Jan 31 10:21:28 bootstrap bootkube.sh[1782]: https://etcd-0.vmware.cpod-aca-ocp.az-rbx.cloud-garage.net:2379 is unhealthy: failed to connect: dial tcp 172.23.3.62:2379: connect: no route to host
Jan 31 10:21:28 bootstrap bootkube.sh[1782]: Error: unhealthy cluster
Jan 31 10:21:28 bootstrap bootkube.sh[1782]: etcdctl failed. Retrying in 5 seconds...

@ac06012014
Copy link
Author

ping @abhinavdahiya Do you have some news please ?

@ac06012014
Copy link
Author

It works with Bind. So probably an issue with dnsmasq. Keep you posted.

@jameslabocki
Copy link

@ac06012014 - did you ever figure out the root cause of this? I'm having the same issue on bare metal.

@ac06012014
Copy link
Author

ac06012014 commented Mar 22, 2020

Yes. It was due to etcd SRV resolution. This project can be help you https://github.com/RedHatOfficial/ocp4-helpernode

You can view DNS bind entries.

@alfredzoto
Copy link

Hi,

I am having the same issue. My etcd SRV entries are as below
_etcd-server-ssl._tcp.ocp4 IN SRV 0 10 2380 etcd-0.ocp4.example.com.
_etcd-server-ssl._tcp.ocp4 IN SRV 0 10 2380 etcd-1.ocp4.example.com.
_etcd-server-ssl._tcp.ocp4 IN SRV 0 10 2380 etcd-2.ocp4.example.com.

They look fine but i keep getting that error. Any idea?

@AlekseyUsov
Copy link

Facing the same problem with 4.3. SRV records are absolutely fine. In fact, I did a successful install just 41 days ago. Has anyone found a solution yet?

@alfredzoto
Copy link

@AlekseyUsov what version did you manage to install?

@AlekseyUsov
Copy link

@alfredzoto 4.3.0.

@ac06012014
Copy link
Author

@AlekseyUsov
Copy link

@ac06012014 Just checked - all records are in place. They are absolutely identical to those of the successfully installed cluster. The only difference is time. Something must've got broken between now and then.
Just in case, below are CoreOS details:

VERSION="43.81.202001142154.0"
VERSION_ID="4.3"
OPENSHIFT_VERSION="4.3"
RHEL_VERSION=8.0
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 43.81.202001142154.0 (Ootpa)"
ID="rhcos"
ID_LIKE="rhel fedora"
ANSI_COLOR="0;31"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.3"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.3"
OSTREE_VERSION='43.81.202001142154.0'

Again, they are completely identical to the previously deployed cluster.

@alfredzoto
Copy link

All my configurations seem ok. Please see attached dns and haproxy files.
In my case the error is as per below

Mar 30 09:02:05 bootstrap.ocp4.example.com bootkube.sh[1434]: https://etcd-2.ocp4.example.com:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Mar 30 09:02:05 bootstrap.ocp4.example.com bootkube.sh[1434]: https://etcd-0.ocp4.example.com:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Mar 30 09:02:05 bootstrap.ocp4.example.com bootkube.sh[1434]: https://etcd-1.ocp4.example.com:2379 is unhealthy: failed to commit proposal: context deadline exceeded

My understanding is that port 2379 is not open or soemthing.

haproxy.cfg.txt
fwd.example.com.txt
64.168.192.txt

@AlekseyUsov
Copy link

@alfredzoto Yes, it's configured exactly the same way in my environment. The port is not the issue, as the subnet is the same as for the already installed cluster + it's not that etcd members can't connect to each other, etcd processes just won't start. And logs are completely useless.
Thanks for trying, though.

@alfredzoto
Copy link

@AlekseyUsov so the same configuration used to work somehow, but now it is not working.
I guess i will experiment with an older version

@AlekseyUsov
Copy link

@alfredzoto Exactly. I was very careful not to change anything from the recent installation, just to make sure I don't introduce any unknowns. So seems like they were introduced somewhere else.

@ac06012014
Copy link
Author

Here are my DNS entries:
; The SRV records are IMPORTANT....make sure you get these right...note the trailing dot at the end...
_etcd-server-ssl._tcp IN SRV 0 10 2380 etcd-0.vmware.cpod-aca-ocp.az-rbx.cloud-garage.net.
_etcd-server-ssl._tcp IN SRV 0 10 2380 etcd-1.vmware.cpod-aca-ocp.az-rbx.cloud-garage.net.
_etcd-server-ssl._tcp IN SRV 0 10 2380 etcd-2.vmware.cpod-aca-ocp.az-rbx.cloud-garage.net.

@alfredzoto
Copy link

Do you believe that on the SRV records i need to bind port 2379 instead of 2380 since the error im getting is :

Mar 30 09:02:05 bootstrap.ocp4.example.com bootkube.sh[1434]: https://etcd-2.ocp4.example.com:2379 is unhealthy: failed to commit proposal: context deadline exceeded

@AlekseyUsov
Copy link

@alfredzoto I don't think so, as 2380/tcp is used for peer-to-peer communications, while client requests use 2379/tcp. It's my understanding that etcd members form a quorum first over 2380/tcp and then bootstrap process tries to contact each of them individually over 2379/tcp.

@SteeleDesmond
Copy link

Having the same issue as well. Didn't have the issue with 4.2 previously and I used an install script for both 4.2 and 4.3 installs. Here is part of the bootstrap output from journalctl command

Apr 01 15:48:01 bootstrap bootkube.sh[2881]: Error: unhealthy cluster
Apr 01 15:48:01 bootstrap podman[6510]: 2020-04-01 15:48:01.128193206 +0000 UTC m=+5.483432628 container died 4c4a92f873da2b4661720254f0a81c59db2ef3e7ca45da2ea7259f7e512f294c (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:742b729094abbc4ebeaf62323e9393b0d6bf06606d4fe349e8458f9191d9905a, name=etcdctl)
Apr 01 15:48:01 bootstrap podman[6510]: 2020-04-01 15:48:01.178457356 +0000 UTC m=+5.533696791 container remove 4c4a92f873da2b4661720254f0a81c59db2ef3e7ca45da2ea7259f7e512f294c (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:742b729094abbc4ebeaf62323e9393b0d6bf06606d4fe349e8458f9191d9905a, name=etcdctl)
Apr 01 15:48:01 bootstrap bootkube.sh[2881]: etcdctl failed. Retrying in 5 seconds...
Apr 01 15:48:06 bootstrap podman[6600]: 2020-04-01 15:48:06.316992269 +0000 UTC m=+0.111921637 container create 0ae68d1c4d20b50315281d754b2efe1f672fb9c34d5a1238b1066a2f51d86936 (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:742b729094abbc4ebeaf62323e9393b0d6bf06606d4fe349e8458f9191d9905a, name=etcdctl)
Apr 01 15:48:06 bootstrap podman[6600]: 2020-04-01 15:48:06.67172174 +0000 UTC m=+0.466651153 container init 0ae68d1c4d20b50315281d754b2efe1f672fb9c34d5a1238b1066a2f51d86936 (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:742b729094abbc4ebeaf62323e9393b0d6bf06606d4fe349e8458f9191d9905a, name=etcdctl)
Apr 01 15:48:06 bootstrap podman[6600]: 2020-04-01 15:48:06.688797966 +0000 UTC m=+0.483727354 container start 0ae68d1c4d20b50315281d754b2efe1f672fb9c34d5a1238b1066a2f51d86936 (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:742b729094abbc4ebeaf62323e9393b0d6bf06606d4fe349e8458f9191d9905a, name=etcdctl)
Apr 01 15:48:06 bootstrap podman[6600]: 2020-04-01 15:48:06.688918499 +0000 UTC m=+0.483847922 container attach 0ae68d1c4d20b50315281d754b2efe1f672fb9c34d5a1238b1066a2f51d86936 (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:742b729094abbc4ebeaf62323e9393b0d6bf06606d4fe349e8458f9191d9905a, name=etcdctl)
Apr 01 15:48:11 bootstrap bootkube.sh[2881]: {"level":"warn","ts":"2020-04-01T15:48:11.702Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-42a4ab3e-7a5a-4da4-a1af-9b7641ee1329/etcd-2.srd.ocp.csplab.local:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.18.5.102:2379: connect: no route to host\""}
Apr 01 15:48:11 bootstrap bootkube.sh[2881]: {"level":"warn","ts":"2020-04-01T15:48:11.702Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-ad421969-4ebb-4754-81b6-627096c23e80/etcd-1.srd.ocp.csplab.local:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.18.5.101:2379: connect: no route to host\""}
Apr 01 15:48:11 bootstrap bootkube.sh[2881]: {"level":"warn","ts":"2020-04-01T15:48:11.702Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-7649f7e7-22ae-498e-8f34-2ac027d640cf/etcd-0.srd.ocp.csplab.local:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.18.5.100:2379: connect: no route to host\""}
Apr 01 15:48:11 bootstrap bootkube.sh[2881]: https://etcd-2.srd.ocp.csplab.local:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Apr 01 15:48:11 bootstrap bootkube.sh[2881]: https://etcd-1.srd.ocp.csplab.local:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Apr 01 15:48:11 bootstrap bootkube.sh[2881]: https://etcd-0.srd.ocp.csplab.local:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Apr 01 15:48:11 bootstrap bootkube.sh[2881]: Error: unhealthy cluster

@DustinTrap
Copy link

Note, this also occurs with VMware vSphere 6.7 u2.

@DustinTrap
Copy link

SOLVED: https://github.com/vchintal/ocp4-vsphere-upi-automation/issues/12#issuecomment-612164871

Read the above note to see how I solved this with the help of the great, @jimbarlow. (https://github.com/jimbarlow)

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 10, 2020
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 9, 2020
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link
Contributor

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

9 participants