Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd cluster losing quorum, causing cluster outage #2235

Closed
gclyatt opened this issue Mar 30, 2017 · 12 comments
Closed

etcd cluster losing quorum, causing cluster outage #2235

gclyatt opened this issue Mar 30, 2017 · 12 comments

Comments

@gclyatt
Copy link

gclyatt commented Mar 30, 2017

Not sure if this is appropriate place to report, but we do use kops to manage our cluster.

  • kops version: 1.5.1
  • kubernetes versions: 1.5.4, was 1.5.2 first time
  • running in AWS: us-east-1a, us-east-1b, us-east-1c
  • 3 masters, each in it's own subnet/az

TL;DR- feels like losing one master triggers cluster failure.

Had a similar issue 3 times in past 6 weeks.

  • I notice that I can no longer create new entities from kubectl:
    client: etcd member http://127.0.0.1:4001 has no leader
  • one of the masters(1b) has it's pods in 'Unknown' state.
  • notice that the pods for master in 1c have only been running for 10 minutes.
  • Some minion nodes report 'NotReady'.
  • netstat on 1b reports over 64K connections.
  • Having seen this once before, I reboot 1b.
  • When 1b is running again, etcd is not able to communicate with either of the other masters.

Saw this in etcd logs:

2017-03-29 19:13:18.424460 E | rafthttp: failed to dial 5871757f7d6915d1 on stream Message (read tcp 172.16.2.94:33822->172.16.3.133:2380: i/o timeout)
2017-03-29 19:13:21.038708 E | etcdhttp: got unexpected response error (etcdserver: request timed out)
  • unable to login to master in 1c over ssh, it never connects.
  • notice that 1a also has nearly 64K tcp connections reported in netstat.
    Thinking that 1c has exhausted all of it's ssh connections, I reboot it via aws console.
  • 1b and 1c are now able to communicate with each other, but not 1a.
  • also no longer able to ssh into 1a.
  • I ultimately needed to reboot all 3 masters to get quorum again.
@ajohnstone
Copy link
Contributor

ajohnstone commented Mar 30, 2017

Had exactly the same problem recently, however ended up failing over the whole cluster.
Rebooted 2/3 nodes that has issues, and ended up with a split brain type scenario, where by you would see different pods available in the pool depending which master you hit.

I forced one node to be removed from etcd due to the split brain scenario, and removed persistent storage. I never managed to restore the cluster and solve, before we failed over the whole cluster.

@justinsb
Copy link
Member

Ouch - sorry about the problems.

The 64K connections... what are they? Were they in CLOSE_WAIT by any chance? Or which ports were they going to?

I'm not sure that we actually have a 64K limit on connections, but possibly if they are all loopback connections...

@gclyatt
Copy link
Author

gclyatt commented Mar 30, 2017

I was in a hurry today, but was able to find a transcript from when this happened a couple weeks back.
There were 58K in CLOSE_WAIT on these 7 ports:
30976
31119
31607
31652
31724
31855
32060

example--
tcp6 1 0 172.16.2.55:31119 172.16.2.184:50019 CLOSE_WAIT -

On the 2nd time this happened, a few weeks ago, I also tried clearing the persistent etcd datastore on one of the masters and ran in to the issue that @ajohnstone experienced. I ended up migrating everything to new cluster.

@egalano
Copy link

egalano commented Mar 30, 2017

I also had this take down our entire cluster earlier this week which prompted me to file this issue #2216

@justinsb
Copy link
Member

Wow OK. I'm not sure why this is suddenly more of a problem, but this sounds like kubernetes/kubernetes#43212

I don't think that an external etcd cluster would be immune to this @egalano - though now I understand better where that request was coming from!

So there is a fix for that, and it looks like it did make it into 1.6.0, but it hasn't yet been backported. I just proposed cherry-picks to 1.4 and 1.5 this morning.

The "workaround" is to restart kube-proxy on the machine, or to remove any ELB services with no pods. What happens is that inbound connections - including ELB health checks - leak connections whenever a service has no pods. So it's typically "fine" for a few minutes, but if you have a service that actually has no pods you will gradually leak connections. Restarting kube-proxy will close those connections. More gory details in the PR & the issue :-)

@gclyatt
Copy link
Author

gclyatt commented Mar 30, 2017

Thanks Justin!

We do have the occasional ELB service with no pods on this cluster.
I'll be more diligent about removing them from now on.

@gileshinchcliff
Copy link

@justinsb I'm having difficulty trying to figure out whether the fix for this has been backported to 1.5.* yet. Any chance you could advise?

@justinsb
Copy link
Member

justinsb commented May 11, 2017

I believe the fix is in two parts, and so has not been backported to 1.5.

The workaround is not to have LoadBalancer/NodePort services with no endpoints for prolonged periods of time

The fix needs both kubernetes/kubernetes#43415 and kubernetes/kubernetes#43972, and neither has been backported to 1.5 I believe

@gileshinchcliff
Copy link

Thanks that helps. Are there plans for these to be cherry picked?

@justinsb
Copy link
Member

justinsb commented May 11, 2017

Just reopened kubernetes/kubernetes#43858 which is the first cherry-pick. (I had closed it when the challenges which needed the second PR came to light; we'll need both)

@negz
Copy link

negz commented Jun 2, 2017

Just chiming in to say we're seeing this too on 1.5.7.

@chrislovecnm
Copy link
Contributor

This appears to be an upstream issue, which I thing is resolved

Closing, as it appears solved. Please reopen if needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants