Skip to content
This repository has been archived by the owner on Feb 22, 2022. It is now read-only.

incubator/etcd scaling and recovery not working #685

Closed
lwolf opened this issue Feb 18, 2017 · 12 comments
Closed

incubator/etcd scaling and recovery not working #685

lwolf opened this issue Feb 18, 2017 · 12 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@lwolf
Copy link
Collaborator

lwolf commented Feb 18, 2017

Etcd cluster does not recover if pod was deleted.
The issue seems to be in rejoining nodes with the same name.

Steps to reproduce:

  1. deploy etcd from incubator(needs manual change from petset to statefulset)
  2. check that cluster is healthy
> $ kubectl exec -it factual-crocodile-etcd-0 etcdctl cluster-health
member 6a7fe15c528ef50d is healthy: got healthy result from http://factual-crocodile-etcd-0.factual-crocodile-etcd:2379
member 8167cc828aa0f298 is healthy: got healthy result from http://factual-crocodile-etcd-1.factual-crocodile-etcd:2379
member b3d79057b17efe5f is healthy: got healthy result from http://factual-crocodile-etcd-2.factual-crocodile-etcd:2379
cluster is healthy
  1. Delete any pod
> $ kubectl delete pod factual-crocodile-etcd-0
pod "factual-crocodile-etcd-0" deleted
  1. Watch its logs after recreating
> $ kubectl logs -f factual-crocodile-etcd-0
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
sh: al-crocodile-etcd-0: bad number
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-1.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-2.factual-crocodile-etcd to come up
2017-02-18 14:44:40.028499 I | etcdmain: etcd Version: 2.2.5
2017-02-18 14:44:40.028612 I | etcdmain: Git SHA: bc9ddf2
2017-02-18 14:44:40.028617 I | etcdmain: Go Version: go1.5.3
2017-02-18 14:44:40.028623 I | etcdmain: Go OS/Arch: linux/amd64
2017-02-18 14:44:40.028632 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2017-02-18 14:44:40.033004 I | etcdmain: listening for peers on http://factual-crocodile-etcd-0.factual-crocodile-etcd:2380
2017-02-18 14:44:40.033062 I | etcdmain: listening for client requests on http://127.0.0.1:2379
2017-02-18 14:44:40.033838 I | etcdmain: listening for client requests on http://factual-crocodile-etcd-0.factual-crocodile-etcd:2379
2017-02-18 14:44:40.052563 I | netutil: resolving factual-crocodile-etcd-0.factual-crocodile-etcd:2380 to 10.244.2.125:2380
2017-02-18 14:44:40.053037 I | netutil: resolving factual-crocodile-etcd-0.factual-crocodile-etcd:2380 to 10.244.2.125:2380
2017-02-18 14:44:40.054740 I | etcdserver: name = factual-crocodile-etcd-0
2017-02-18 14:44:40.054756 I | etcdserver: data dir = /var/run/etcd/default.etcd
2017-02-18 14:44:40.054762 I | etcdserver: member dir = /var/run/etcd/default.etcd/member
2017-02-18 14:44:40.054766 I | etcdserver: heartbeat = 100ms
2017-02-18 14:44:40.054769 I | etcdserver: election = 1000ms
2017-02-18 14:44:40.054773 I | etcdserver: snapshot count = 10000
2017-02-18 14:44:40.054901 I | etcdserver: advertise client URLs = http://factual-crocodile-etcd-0.factual-crocodile-etcd:2379
2017-02-18 14:44:40.054911 I | etcdserver: initial advertise peer URLs = http://factual-crocodile-etcd-0.factual-crocodile-etcd:2380
2017-02-18 14:44:40.054925 I | etcdserver: initial cluster = factual-crocodile-etcd-0=http://factual-crocodile-etcd-0.factual-crocodile-etcd:2380,factual-crocodile-etcd-1=http://factual-crocodile-etcd-1.factual-crocodile-etcd:2380,factual-crocodile-etcd-2=http://factual-crocodile-etcd-2.factual-crocodile-etcd:2380
2017-02-18 14:44:40.084529 I | etcdserver: starting member 6a7fe15c528ef50d in cluster d141f24a7d5c19f9
2017-02-18 14:44:40.084623 I | raft: 6a7fe15c528ef50d became follower at term 0
2017-02-18 14:44:40.084638 I | raft: newRaft 6a7fe15c528ef50d [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2017-02-18 14:44:40.084645 I | raft: 6a7fe15c528ef50d became follower at term 1
2017-02-18 14:44:40.100290 E | rafthttp: failed to dial 8167cc828aa0f298 on stream MsgApp v2 (the member has been permanently removed from the cluster)
2017-02-18 14:44:40.100367 E | rafthttp: failed to dial 8167cc828aa0f298 on stream Message (the member has been permanently removed from the cluster)
2017-02-18 14:44:40.101500 I | etcdserver: starting server... [version: 2.2.5, cluster version: to_be_decided]
2017-02-18 14:44:40.102363 E | etcdserver: the member has been permanently removed from the cluster
2017-02-18 14:44:40.102389 I | etcdserver: the data-dir used by this member must be removed.
2017-02-18 14:44:40.102800 E | rafthttp: failed to dial b3d79057b17efe5f on stream MsgApp v2 (net/http: request canceled while waiting for connection)
2017-02-18 14:44:40.103008 E | rafthttp: failed to dial b3d79057b17efe5f on stream Message (net/http: request canceled while waiting for connection)

On the next restart it shows only this:

> $ kubectl logs -f factual-crocodile-etcd-0
Re-joining etcd member
cat: can't open '/var/run/etcd/member_id': No such file or directory

Meanwhile in the logs of other nodes will be something like this:

2017-02-18 14:44:40.098380 W | rafthttp: rejected the stream from peer 6a7fe15c528ef50d since it was removed
2017-02-18 14:44:40.098469 W | rafthttp: rejected the stream from peer 6a7fe15c528ef50d since it was removed

With scaledown/up there is similar problem - after scaledown it's not possible to scale up, since pod will be unable to rejoin.

@lachie83
Copy link
Contributor

lachie83 commented Jun 5, 2017

Is this still an issue?

@lwolf
Copy link
Collaborator Author

lwolf commented Jun 9, 2017

As far as I can see no changes were made to the chart, so, yes it is.

@phyrwork
Copy link

phyrwork commented Nov 2, 2017

I am experiencing a similar issue.

Re-joining etcd member
client: etcd cluster is unavailable or misconfigured

I have a 3-node etcd cluster on top of a 3-node GKE container cluster on preemptible nodes - by design I am expecting to lose etcd pods every now and then and have new ones spun and the cluster recover.

It's not happening.

I don't know etcd well enough to understand and fix the problem. I would however, like it to work.

Am happy to try and help collect logs/data if someone with more etcd background is interested in taking a stab at a fix.

@Elexy
Copy link

Elexy commented Nov 27, 2017

@lwolf @phyrwork @lachie83 This PR should fix that: #2864

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 26, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 28, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@aperrot42
Copy link

seems to always be an issue...

@pcornelissen
Copy link

and it's still an issue :-(

Waiting for etcd-0.etcd to come up
Waiting for etcd-1.etcd to come up
Waiting for etcd-2.etcd to come up
ping: bad address 'etcd-2.etcd'
Waiting for etcd-2.etcd to come up
Re-joining etcd member
cat: can't open '/var/run/etcd/member_id': No such file or directory

(Just copied from my cluster)

@lbornov2
Copy link

I get this in AWS, but not in GKE.

@miguelaferreira
Copy link

For anyone landing here looking for a workaround, in my case, all it took to get pods to join the cluster was to delete all of them. When they re-start they join each other in the cluster.

@rajeshneo
Copy link

Workaround for this without loosing you data or recreating your whole cluster.
Use helm to scale down your cluster by 1 node
helm upgrade etcd incubator/etcd --set replicas=2
Wait for few minutes and all nodes will do rolling restart.
Scale it back up and voila :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests