Issue with etcd during OCP 4.3 installation on vSphere #3028

ac06012014 · 2020-01-30T18:40:36Z

Hi there,

Version

OCP 4.3

Platform: vSphere 6.7 U3

What happened?

I've got this issue on bootstrap/master nodes even if I've respected all prerequisites.

journalctl -b -f -u bootkube.service
desc = latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.23.3.55:2379: connect: connection refused""}
Jan 30 18:38:46 master0 bootkube.sh[2802]: https://etcd-0.vmware.cpod-aca-ocp.az-rbx.cloud-garage.net:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Jan 30 18:38:46 master0 bootkube.sh[2802]: https://etcd-1.vmware.cpod-aca-ocp.az-rbx.cloud-garage.net:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Jan 30 18:38:46 master0 bootkube.sh[2802]: https://etcd-2.vmware.cpod-aca-ocp.az-rbx.cloud-garage.net:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Jan 30 18:38:46 master0 bootkube.sh[2802]: Error: unhealthy cluster

Do you have any idea please ?

abhinavdahiya · 2020-01-30T18:45:27Z

can you provide the log bundle from openshift-install gather bootstrap

ac06012014 · 2020-01-30T20:03:15Z

log-bundle-20200130205931.tar.gz

ac06012014 · 2020-01-31T10:24:25Z

Port 2379 is not bind on master0,1,2. (Same issue with OCP 4.2)

Jan 31 10:21:28 bootstrap bootkube.sh[1782]: https://etcd-2.vmware.cpod-aca-ocp.az-rbx.cloud-garage.net:2379 is unhealthy: failed to connect: dial tcp 172.23.3.64:2379: connect: no route to host
Jan 31 10:21:28 bootstrap bootkube.sh[1782]: https://etcd-1.vmware.cpod-aca-ocp.az-rbx.cloud-garage.net:2379 is unhealthy: failed to connect: dial tcp 172.23.3.63:2379: connect: no route to host
Jan 31 10:21:28 bootstrap bootkube.sh[1782]: https://etcd-0.vmware.cpod-aca-ocp.az-rbx.cloud-garage.net:2379 is unhealthy: failed to connect: dial tcp 172.23.3.62:2379: connect: no route to host
Jan 31 10:21:28 bootstrap bootkube.sh[1782]: Error: unhealthy cluster
Jan 31 10:21:28 bootstrap bootkube.sh[1782]: etcdctl failed. Retrying in 5 seconds...

ac06012014 · 2020-01-31T20:09:52Z

ping @abhinavdahiya Do you have some news please ?

ac06012014 · 2020-02-01T15:56:05Z

It works with Bind. So probably an issue with dnsmasq. Keep you posted.

jameslabocki · 2020-03-21T13:57:25Z

@ac06012014 - did you ever figure out the root cause of this? I'm having the same issue on bare metal.

ac06012014 · 2020-03-22T08:41:03Z

Yes. It was due to etcd SRV resolution. This project can be help you https://github.com/RedHatOfficial/ocp4-helpernode

You can view DNS bind entries.

alfredzoto · 2020-03-30T12:23:55Z

Hi,

I am having the same issue. My etcd SRV entries are as below
_etcd-server-ssl._tcp.ocp4 IN SRV 0 10 2380 etcd-0.ocp4.example.com.
_etcd-server-ssl._tcp.ocp4 IN SRV 0 10 2380 etcd-1.ocp4.example.com.
_etcd-server-ssl._tcp.ocp4 IN SRV 0 10 2380 etcd-2.ocp4.example.com.

They look fine but i keep getting that error. Any idea?

AlekseyUsov · 2020-03-31T07:06:09Z

Facing the same problem with 4.3. SRV records are absolutely fine. In fact, I did a successful install just 41 days ago. Has anyone found a solution yet?

alfredzoto · 2020-03-31T07:14:15Z

@AlekseyUsov what version did you manage to install?

AlekseyUsov · 2020-03-31T07:18:38Z

@alfredzoto 4.3.0.

ac06012014 · 2020-03-31T07:21:44Z

Try to look this bind configuration
https://github.com/RedHatOfficial/ocp4-helpernode/blob/master/templates/zonefile.j2

AlekseyUsov · 2020-03-31T07:27:39Z

@ac06012014 Just checked - all records are in place. They are absolutely identical to those of the successfully installed cluster. The only difference is time. Something must've got broken between now and then.
Just in case, below are CoreOS details:

VERSION="43.81.202001142154.0"
VERSION_ID="4.3"
OPENSHIFT_VERSION="4.3"
RHEL_VERSION=8.0
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 43.81.202001142154.0 (Ootpa)"
ID="rhcos"
ID_LIKE="rhel fedora"
ANSI_COLOR="0;31"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.3"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.3"
OSTREE_VERSION='43.81.202001142154.0'

Again, they are completely identical to the previously deployed cluster.

alfredzoto · 2020-03-31T07:31:17Z

All my configurations seem ok. Please see attached dns and haproxy files.
In my case the error is as per below

Mar 30 09:02:05 bootstrap.ocp4.example.com bootkube.sh[1434]: https://etcd-2.ocp4.example.com:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Mar 30 09:02:05 bootstrap.ocp4.example.com bootkube.sh[1434]: https://etcd-0.ocp4.example.com:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Mar 30 09:02:05 bootstrap.ocp4.example.com bootkube.sh[1434]: https://etcd-1.ocp4.example.com:2379 is unhealthy: failed to commit proposal: context deadline exceeded

My understanding is that port 2379 is not open or soemthing.

haproxy.cfg.txt
fwd.example.com.txt
64.168.192.txt

AlekseyUsov · 2020-03-31T07:37:46Z

@alfredzoto Yes, it's configured exactly the same way in my environment. The port is not the issue, as the subnet is the same as for the already installed cluster + it's not that etcd members can't connect to each other, etcd processes just won't start. And logs are completely useless.
Thanks for trying, though.

alfredzoto · 2020-03-31T07:41:12Z

@AlekseyUsov so the same configuration used to work somehow, but now it is not working.
I guess i will experiment with an older version

AlekseyUsov · 2020-03-31T07:43:49Z

@alfredzoto Exactly. I was very careful not to change anything from the recent installation, just to make sure I don't introduce any unknowns. So seems like they were introduced somewhere else.

ac06012014 · 2020-03-31T07:51:13Z

Here are my DNS entries:
; The SRV records are IMPORTANT....make sure you get these right...note the trailing dot at the end...
_etcd-server-ssl._tcp IN SRV 0 10 2380 etcd-0.vmware.cpod-aca-ocp.az-rbx.cloud-garage.net.
_etcd-server-ssl._tcp IN SRV 0 10 2380 etcd-1.vmware.cpod-aca-ocp.az-rbx.cloud-garage.net.
_etcd-server-ssl._tcp IN SRV 0 10 2380 etcd-2.vmware.cpod-aca-ocp.az-rbx.cloud-garage.net.

alfredzoto · 2020-03-31T10:07:26Z

Do you believe that on the SRV records i need to bind port 2379 instead of 2380 since the error im getting is :

Mar 30 09:02:05 bootstrap.ocp4.example.com bootkube.sh[1434]: https://etcd-2.ocp4.example.com:2379 is unhealthy: failed to commit proposal: context deadline exceeded

AlekseyUsov · 2020-03-31T14:38:11Z

@alfredzoto I don't think so, as 2380/tcp is used for peer-to-peer communications, while client requests use 2379/tcp. It's my understanding that etcd members form a quorum first over 2380/tcp and then bootstrap process tries to contact each of them individually over 2379/tcp.

SteeleDesmond · 2020-04-01T15:51:57Z

Having the same issue as well. Didn't have the issue with 4.2 previously and I used an install script for both 4.2 and 4.3 installs. Here is part of the bootstrap output from journalctl command

Apr 01 15:48:01 bootstrap bootkube.sh[2881]: Error: unhealthy cluster
Apr 01 15:48:01 bootstrap podman[6510]: 2020-04-01 15:48:01.128193206 +0000 UTC m=+5.483432628 container died 4c4a92f873da2b4661720254f0a81c59db2ef3e7ca45da2ea7259f7e512f294c (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:742b729094abbc4ebeaf62323e9393b0d6bf06606d4fe349e8458f9191d9905a, name=etcdctl)
Apr 01 15:48:01 bootstrap podman[6510]: 2020-04-01 15:48:01.178457356 +0000 UTC m=+5.533696791 container remove 4c4a92f873da2b4661720254f0a81c59db2ef3e7ca45da2ea7259f7e512f294c (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:742b729094abbc4ebeaf62323e9393b0d6bf06606d4fe349e8458f9191d9905a, name=etcdctl)
Apr 01 15:48:01 bootstrap bootkube.sh[2881]: etcdctl failed. Retrying in 5 seconds...
Apr 01 15:48:06 bootstrap podman[6600]: 2020-04-01 15:48:06.316992269 +0000 UTC m=+0.111921637 container create 0ae68d1c4d20b50315281d754b2efe1f672fb9c34d5a1238b1066a2f51d86936 (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:742b729094abbc4ebeaf62323e9393b0d6bf06606d4fe349e8458f9191d9905a, name=etcdctl)
Apr 01 15:48:06 bootstrap podman[6600]: 2020-04-01 15:48:06.67172174 +0000 UTC m=+0.466651153 container init 0ae68d1c4d20b50315281d754b2efe1f672fb9c34d5a1238b1066a2f51d86936 (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:742b729094abbc4ebeaf62323e9393b0d6bf06606d4fe349e8458f9191d9905a, name=etcdctl)
Apr 01 15:48:06 bootstrap podman[6600]: 2020-04-01 15:48:06.688797966 +0000 UTC m=+0.483727354 container start 0ae68d1c4d20b50315281d754b2efe1f672fb9c34d5a1238b1066a2f51d86936 (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:742b729094abbc4ebeaf62323e9393b0d6bf06606d4fe349e8458f9191d9905a, name=etcdctl)
Apr 01 15:48:06 bootstrap podman[6600]: 2020-04-01 15:48:06.688918499 +0000 UTC m=+0.483847922 container attach 0ae68d1c4d20b50315281d754b2efe1f672fb9c34d5a1238b1066a2f51d86936 (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:742b729094abbc4ebeaf62323e9393b0d6bf06606d4fe349e8458f9191d9905a, name=etcdctl)
Apr 01 15:48:11 bootstrap bootkube.sh[2881]: {"level":"warn","ts":"2020-04-01T15:48:11.702Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-42a4ab3e-7a5a-4da4-a1af-9b7641ee1329/etcd-2.srd.ocp.csplab.local:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.18.5.102:2379: connect: no route to host\""}
Apr 01 15:48:11 bootstrap bootkube.sh[2881]: {"level":"warn","ts":"2020-04-01T15:48:11.702Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-ad421969-4ebb-4754-81b6-627096c23e80/etcd-1.srd.ocp.csplab.local:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.18.5.101:2379: connect: no route to host\""}
Apr 01 15:48:11 bootstrap bootkube.sh[2881]: {"level":"warn","ts":"2020-04-01T15:48:11.702Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-7649f7e7-22ae-498e-8f34-2ac027d640cf/etcd-0.srd.ocp.csplab.local:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.18.5.100:2379: connect: no route to host\""}
Apr 01 15:48:11 bootstrap bootkube.sh[2881]: https://etcd-2.srd.ocp.csplab.local:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Apr 01 15:48:11 bootstrap bootkube.sh[2881]: https://etcd-1.srd.ocp.csplab.local:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Apr 01 15:48:11 bootstrap bootkube.sh[2881]: https://etcd-0.srd.ocp.csplab.local:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Apr 01 15:48:11 bootstrap bootkube.sh[2881]: Error: unhealthy cluster

DustinTrap · 2020-04-10T14:17:09Z

Note, this also occurs with VMware vSphere 6.7 u2.

DustinTrap · 2020-04-10T18:47:03Z

SOLVED: https://github.com/vchintal/ocp4-vsphere-upi-automation/issues/12#issuecomment-612164871

Read the above note to see how I solved this with the help of the great, @jimbarlow. (https://github.com/jimbarlow)

openshift-bot · 2020-07-10T00:41:11Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2020-08-09T02:29:47Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2020-09-08T04:17:53Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot · 2020-09-08T04:18:08Z

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

DustinTrap mentioned this issue Apr 10, 2020

Known issue: https://github.com/openshift/installer/issues/3028 RedHatOfficial/ocp4-vsphere-upi-automation#12

Closed

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 10, 2020

bshephar mentioned this issue Jul 22, 2020

Boostrapping cannot complete 4.4 #3878

Closed

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 9, 2020

openshift-ci-robot closed this as completed Sep 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with etcd during OCP 4.3 installation on vSphere #3028

Issue with etcd during OCP 4.3 installation on vSphere #3028

ac06012014 commented Jan 30, 2020

abhinavdahiya commented Jan 30, 2020

ac06012014 commented Jan 30, 2020

ac06012014 commented Jan 31, 2020

ac06012014 commented Jan 31, 2020

ac06012014 commented Feb 1, 2020

jameslabocki commented Mar 21, 2020

ac06012014 commented Mar 22, 2020 •

edited

alfredzoto commented Mar 30, 2020

AlekseyUsov commented Mar 31, 2020

alfredzoto commented Mar 31, 2020

AlekseyUsov commented Mar 31, 2020

ac06012014 commented Mar 31, 2020

AlekseyUsov commented Mar 31, 2020

alfredzoto commented Mar 31, 2020

AlekseyUsov commented Mar 31, 2020

alfredzoto commented Mar 31, 2020

AlekseyUsov commented Mar 31, 2020

ac06012014 commented Mar 31, 2020

alfredzoto commented Mar 31, 2020

AlekseyUsov commented Mar 31, 2020

SteeleDesmond commented Apr 1, 2020

DustinTrap commented Apr 10, 2020

DustinTrap commented Apr 10, 2020

openshift-bot commented Jul 10, 2020

openshift-bot commented Aug 9, 2020

openshift-bot commented Sep 8, 2020

openshift-ci-robot commented Sep 8, 2020

Issue with etcd during OCP 4.3 installation on vSphere #3028

Issue with etcd during OCP 4.3 installation on vSphere #3028

Comments

ac06012014 commented Jan 30, 2020

Version

Platform: vSphere 6.7 U3

What happened?

abhinavdahiya commented Jan 30, 2020

ac06012014 commented Jan 30, 2020

ac06012014 commented Jan 31, 2020

ac06012014 commented Jan 31, 2020

ac06012014 commented Feb 1, 2020

jameslabocki commented Mar 21, 2020

ac06012014 commented Mar 22, 2020 • edited

alfredzoto commented Mar 30, 2020

AlekseyUsov commented Mar 31, 2020

alfredzoto commented Mar 31, 2020

AlekseyUsov commented Mar 31, 2020

ac06012014 commented Mar 31, 2020

AlekseyUsov commented Mar 31, 2020

alfredzoto commented Mar 31, 2020

AlekseyUsov commented Mar 31, 2020

alfredzoto commented Mar 31, 2020

AlekseyUsov commented Mar 31, 2020

ac06012014 commented Mar 31, 2020

alfredzoto commented Mar 31, 2020

AlekseyUsov commented Mar 31, 2020

SteeleDesmond commented Apr 1, 2020

DustinTrap commented Apr 10, 2020

DustinTrap commented Apr 10, 2020

openshift-bot commented Jul 10, 2020

openshift-bot commented Aug 9, 2020

openshift-bot commented Sep 8, 2020

openshift-ci-robot commented Sep 8, 2020

ac06012014 commented Mar 22, 2020 •

edited