-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd cluster losing quorum, causing cluster outage #2235
Comments
Had exactly the same problem recently, however ended up failing over the whole cluster. I forced one node to be removed from etcd due to the split brain scenario, and removed persistent storage. I never managed to restore the cluster and solve, before we failed over the whole cluster. |
Ouch - sorry about the problems. The 64K connections... what are they? Were they in CLOSE_WAIT by any chance? Or which ports were they going to? I'm not sure that we actually have a 64K limit on connections, but possibly if they are all loopback connections... |
I was in a hurry today, but was able to find a transcript from when this happened a couple weeks back. example-- On the 2nd time this happened, a few weeks ago, I also tried clearing the persistent etcd datastore on one of the masters and ran in to the issue that @ajohnstone experienced. I ended up migrating everything to new cluster. |
I also had this take down our entire cluster earlier this week which prompted me to file this issue #2216 |
Wow OK. I'm not sure why this is suddenly more of a problem, but this sounds like kubernetes/kubernetes#43212 I don't think that an external etcd cluster would be immune to this @egalano - though now I understand better where that request was coming from! So there is a fix for that, and it looks like it did make it into 1.6.0, but it hasn't yet been backported. I just proposed cherry-picks to 1.4 and 1.5 this morning. The "workaround" is to restart kube-proxy on the machine, or to remove any ELB services with no pods. What happens is that inbound connections - including ELB health checks - leak connections whenever a service has no pods. So it's typically "fine" for a few minutes, but if you have a service that actually has no pods you will gradually leak connections. Restarting kube-proxy will close those connections. More gory details in the PR & the issue :-) |
Thanks Justin! We do have the occasional ELB service with no pods on this cluster. |
@justinsb I'm having difficulty trying to figure out whether the fix for this has been backported to 1.5.* yet. Any chance you could advise? |
I believe the fix is in two parts, and so has not been backported to 1.5. The workaround is not to have LoadBalancer/NodePort services with no endpoints for prolonged periods of time The fix needs both kubernetes/kubernetes#43415 and kubernetes/kubernetes#43972, and neither has been backported to 1.5 I believe |
Thanks that helps. Are there plans for these to be cherry picked? |
Just reopened kubernetes/kubernetes#43858 which is the first cherry-pick. (I had closed it when the challenges which needed the second PR came to light; we'll need both) |
Just chiming in to say we're seeing this too on 1.5.7. |
This appears to be an upstream issue, which I thing is resolved Closing, as it appears solved. Please reopen if needed |
Not sure if this is appropriate place to report, but we do use kops to manage our cluster.
TL;DR- feels like losing one master triggers cluster failure.
Had a similar issue 3 times in past 6 weeks.
client: etcd member http://127.0.0.1:4001 has no leader
Saw this in etcd logs:
Thinking that 1c has exhausted all of it's ssh connections, I reboot it via aws console.
The text was updated successfully, but these errors were encountered: