incubator/etcd scaling and recovery not working #685

lwolf · 2017-02-18T14:55:41Z

Etcd cluster does not recover if pod was deleted.
The issue seems to be in rejoining nodes with the same name.

Steps to reproduce:

deploy etcd from incubator(needs manual change from petset to statefulset)
check that cluster is healthy

> $ kubectl exec -it factual-crocodile-etcd-0 etcdctl cluster-health
member 6a7fe15c528ef50d is healthy: got healthy result from http://factual-crocodile-etcd-0.factual-crocodile-etcd:2379
member 8167cc828aa0f298 is healthy: got healthy result from http://factual-crocodile-etcd-1.factual-crocodile-etcd:2379
member b3d79057b17efe5f is healthy: got healthy result from http://factual-crocodile-etcd-2.factual-crocodile-etcd:2379
cluster is healthy

Delete any pod

> $ kubectl delete pod factual-crocodile-etcd-0
pod "factual-crocodile-etcd-0" deleted

Watch its logs after recreating

> $ kubectl logs -f factual-crocodile-etcd-0
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
sh: al-crocodile-etcd-0: bad number
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-1.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-2.factual-crocodile-etcd to come up
2017-02-18 14:44:40.028499 I | etcdmain: etcd Version: 2.2.5
2017-02-18 14:44:40.028612 I | etcdmain: Git SHA: bc9ddf2
2017-02-18 14:44:40.028617 I | etcdmain: Go Version: go1.5.3
2017-02-18 14:44:40.028623 I | etcdmain: Go OS/Arch: linux/amd64
2017-02-18 14:44:40.028632 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2017-02-18 14:44:40.033004 I | etcdmain: listening for peers on http://factual-crocodile-etcd-0.factual-crocodile-etcd:2380
2017-02-18 14:44:40.033062 I | etcdmain: listening for client requests on http://127.0.0.1:2379
2017-02-18 14:44:40.033838 I | etcdmain: listening for client requests on http://factual-crocodile-etcd-0.factual-crocodile-etcd:2379
2017-02-18 14:44:40.052563 I | netutil: resolving factual-crocodile-etcd-0.factual-crocodile-etcd:2380 to 10.244.2.125:2380
2017-02-18 14:44:40.053037 I | netutil: resolving factual-crocodile-etcd-0.factual-crocodile-etcd:2380 to 10.244.2.125:2380
2017-02-18 14:44:40.054740 I | etcdserver: name = factual-crocodile-etcd-0
2017-02-18 14:44:40.054756 I | etcdserver: data dir = /var/run/etcd/default.etcd
2017-02-18 14:44:40.054762 I | etcdserver: member dir = /var/run/etcd/default.etcd/member
2017-02-18 14:44:40.054766 I | etcdserver: heartbeat = 100ms
2017-02-18 14:44:40.054769 I | etcdserver: election = 1000ms
2017-02-18 14:44:40.054773 I | etcdserver: snapshot count = 10000
2017-02-18 14:44:40.054901 I | etcdserver: advertise client URLs = http://factual-crocodile-etcd-0.factual-crocodile-etcd:2379
2017-02-18 14:44:40.054911 I | etcdserver: initial advertise peer URLs = http://factual-crocodile-etcd-0.factual-crocodile-etcd:2380
2017-02-18 14:44:40.054925 I | etcdserver: initial cluster = factual-crocodile-etcd-0=http://factual-crocodile-etcd-0.factual-crocodile-etcd:2380,factual-crocodile-etcd-1=http://factual-crocodile-etcd-1.factual-crocodile-etcd:2380,factual-crocodile-etcd-2=http://factual-crocodile-etcd-2.factual-crocodile-etcd:2380
2017-02-18 14:44:40.084529 I | etcdserver: starting member 6a7fe15c528ef50d in cluster d141f24a7d5c19f9
2017-02-18 14:44:40.084623 I | raft: 6a7fe15c528ef50d became follower at term 0
2017-02-18 14:44:40.084638 I | raft: newRaft 6a7fe15c528ef50d [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2017-02-18 14:44:40.084645 I | raft: 6a7fe15c528ef50d became follower at term 1
2017-02-18 14:44:40.100290 E | rafthttp: failed to dial 8167cc828aa0f298 on stream MsgApp v2 (the member has been permanently removed from the cluster)
2017-02-18 14:44:40.100367 E | rafthttp: failed to dial 8167cc828aa0f298 on stream Message (the member has been permanently removed from the cluster)
2017-02-18 14:44:40.101500 I | etcdserver: starting server... [version: 2.2.5, cluster version: to_be_decided]
2017-02-18 14:44:40.102363 E | etcdserver: the member has been permanently removed from the cluster
2017-02-18 14:44:40.102389 I | etcdserver: the data-dir used by this member must be removed.
2017-02-18 14:44:40.102800 E | rafthttp: failed to dial b3d79057b17efe5f on stream MsgApp v2 (net/http: request canceled while waiting for connection)
2017-02-18 14:44:40.103008 E | rafthttp: failed to dial b3d79057b17efe5f on stream Message (net/http: request canceled while waiting for connection)

On the next restart it shows only this:

> $ kubectl logs -f factual-crocodile-etcd-0
Re-joining etcd member
cat: can't open '/var/run/etcd/member_id': No such file or directory

Meanwhile in the logs of other nodes will be something like this:

2017-02-18 14:44:40.098380 W | rafthttp: rejected the stream from peer 6a7fe15c528ef50d since it was removed
2017-02-18 14:44:40.098469 W | rafthttp: rejected the stream from peer 6a7fe15c528ef50d since it was removed

With scaledown/up there is similar problem - after scaledown it's not possible to scale up, since pod will be unable to rejoin.

The text was updated successfully, but these errors were encountered:

lachie83 · 2017-06-05T23:43:01Z

Is this still an issue?

lwolf · 2017-06-09T08:33:02Z

As far as I can see no changes were made to the chart, so, yes it is.

phyrwork · 2017-11-02T21:05:22Z

I am experiencing a similar issue.

Re-joining etcd member
client: etcd cluster is unavailable or misconfigured

I have a 3-node etcd cluster on top of a 3-node GKE container cluster on preemptible nodes - by design I am expecting to lose etcd pods every now and then and have new ones spun and the cluster recover.

It's not happening.

I don't know etcd well enough to understand and fix the problem. I would however, like it to work.

Am happy to try and help collect logs/data if someone with more etcd background is interested in taking a stab at a fix.

Elexy · 2017-11-27T09:12:53Z

@lwolf @phyrwork @lachie83 This PR should fix that: #2864

fejta-bot · 2018-02-26T08:50:50Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2018-03-28T12:40:53Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-04-27T12:57:05Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

aperrot42 · 2018-09-17T07:24:42Z

seems to always be an issue...

pcornelissen · 2019-05-12T16:46:30Z

and it's still an issue :-(

Waiting for etcd-0.etcd to come up
Waiting for etcd-1.etcd to come up
Waiting for etcd-2.etcd to come up
ping: bad address 'etcd-2.etcd'
Waiting for etcd-2.etcd to come up
Re-joining etcd member
cat: can't open '/var/run/etcd/member_id': No such file or directory

(Just copied from my cluster)

lbornov2 · 2019-08-29T12:54:55Z

I get this in AWS, but not in GKE.

miguelaferreira · 2019-11-14T14:09:19Z

For anyone landing here looking for a workaround, in my case, all it took to get pods to join the cluster was to delete all of them. When they re-start they join each other in the cluster.

rajeshneo · 2020-05-19T12:20:21Z

Workaround for this without loosing you data or recreating your whole cluster.
Use helm to scale down your cluster by 1 node
helm upgrade etcd incubator/etcd --set replicas=2
Wait for few minutes and all nodes will do rolling restart.
Scale it back up and voila :)

lwolf mentioned this issue Apr 1, 2017

Etcd cluster in CrashLoopBackOff loop lwolf/stolon-chart#2

Closed

evaldasou mentioned this issue Dec 11, 2017

Helm chart - use external etcd zalando/spilo#195

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 26, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 28, 2018

k8s-ci-robot closed this as completed Apr 27, 2018

lwolf mentioned this issue Jun 11, 2018

stolon chart deployment not working lwolf/stolon-chart#22

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incubator/etcd scaling and recovery not working #685

incubator/etcd scaling and recovery not working #685

lwolf commented Feb 18, 2017

lachie83 commented Jun 5, 2017

lwolf commented Jun 9, 2017

phyrwork commented Nov 2, 2017

Elexy commented Nov 27, 2017 •

edited

fejta-bot commented Feb 26, 2018

fejta-bot commented Mar 28, 2018

fejta-bot commented Apr 27, 2018

aperrot42 commented Sep 17, 2018

pcornelissen commented May 12, 2019

lbornov2 commented Aug 29, 2019

miguelaferreira commented Nov 14, 2019

rajeshneo commented May 19, 2020

incubator/etcd scaling and recovery not working #685

incubator/etcd scaling and recovery not working #685

Comments

lwolf commented Feb 18, 2017

lachie83 commented Jun 5, 2017

lwolf commented Jun 9, 2017

phyrwork commented Nov 2, 2017

Elexy commented Nov 27, 2017 • edited

fejta-bot commented Feb 26, 2018

fejta-bot commented Mar 28, 2018

fejta-bot commented Apr 27, 2018

aperrot42 commented Sep 17, 2018

pcornelissen commented May 12, 2019

lbornov2 commented Aug 29, 2019

miguelaferreira commented Nov 14, 2019

rajeshneo commented May 19, 2020

Elexy commented Nov 27, 2017 •

edited