etcd cluster losing quorum, causing cluster outage #2235

gclyatt · 2017-03-30T00:31:54Z

Not sure if this is appropriate place to report, but we do use kops to manage our cluster.

kops version: 1.5.1
kubernetes versions: 1.5.4, was 1.5.2 first time
running in AWS: us-east-1a, us-east-1b, us-east-1c
3 masters, each in it's own subnet/az

TL;DR- feels like losing one master triggers cluster failure.

Had a similar issue 3 times in past 6 weeks.

I notice that I can no longer create new entities from kubectl:
client: etcd member http://127.0.0.1:4001 has no leader
one of the masters(1b) has it's pods in 'Unknown' state.
notice that the pods for master in 1c have only been running for 10 minutes.
Some minion nodes report 'NotReady'.
netstat on 1b reports over 64K connections.
Having seen this once before, I reboot 1b.
When 1b is running again, etcd is not able to communicate with either of the other masters.

Saw this in etcd logs:

2017-03-29 19:13:18.424460 E | rafthttp: failed to dial 5871757f7d6915d1 on stream Message (read tcp 172.16.2.94:33822->172.16.3.133:2380: i/o timeout)
2017-03-29 19:13:21.038708 E | etcdhttp: got unexpected response error (etcdserver: request timed out)

unable to login to master in 1c over ssh, it never connects.
notice that 1a also has nearly 64K tcp connections reported in netstat.
Thinking that 1c has exhausted all of it's ssh connections, I reboot it via aws console.
1b and 1c are now able to communicate with each other, but not 1a.
also no longer able to ssh into 1a.
I ultimately needed to reboot all 3 masters to get quorum again.

The text was updated successfully, but these errors were encountered:

ajohnstone · 2017-03-30T01:03:37Z

Had exactly the same problem recently, however ended up failing over the whole cluster.
Rebooted 2/3 nodes that has issues, and ended up with a split brain type scenario, where by you would see different pods available in the pool depending which master you hit.

I forced one node to be removed from etcd due to the split brain scenario, and removed persistent storage. I never managed to restore the cluster and solve, before we failed over the whole cluster.

justinsb · 2017-03-30T02:08:33Z

Ouch - sorry about the problems.

The 64K connections... what are they? Were they in CLOSE_WAIT by any chance? Or which ports were they going to?

I'm not sure that we actually have a 64K limit on connections, but possibly if they are all loopback connections...

gclyatt · 2017-03-30T02:38:58Z

I was in a hurry today, but was able to find a transcript from when this happened a couple weeks back.
There were 58K in CLOSE_WAIT on these 7 ports:
30976
31119
31607
31652
31724
31855
32060

example--
tcp6 1 0 172.16.2.55:31119 172.16.2.184:50019 CLOSE_WAIT -

On the 2nd time this happened, a few weeks ago, I also tried clearing the persistent etcd datastore on one of the masters and ran in to the issue that @ajohnstone experienced. I ended up migrating everything to new cluster.

egalano · 2017-03-30T05:19:29Z

I also had this take down our entire cluster earlier this week which prompted me to file this issue #2216

justinsb · 2017-03-30T14:43:35Z

Wow OK. I'm not sure why this is suddenly more of a problem, but this sounds like kubernetes/kubernetes#43212

I don't think that an external etcd cluster would be immune to this @egalano - though now I understand better where that request was coming from!

So there is a fix for that, and it looks like it did make it into 1.6.0, but it hasn't yet been backported. I just proposed cherry-picks to 1.4 and 1.5 this morning.

The "workaround" is to restart kube-proxy on the machine, or to remove any ELB services with no pods. What happens is that inbound connections - including ELB health checks - leak connections whenever a service has no pods. So it's typically "fine" for a few minutes, but if you have a service that actually has no pods you will gradually leak connections. Restarting kube-proxy will close those connections. More gory details in the PR & the issue :-)

gclyatt · 2017-03-30T16:32:25Z

Thanks Justin!

We do have the occasional ELB service with no pods on this cluster.
I'll be more diligent about removing them from now on.

gileshinchcliff · 2017-05-11T09:21:59Z

@justinsb I'm having difficulty trying to figure out whether the fix for this has been backported to 1.5.* yet. Any chance you could advise?

justinsb · 2017-05-11T14:50:17Z

I believe the fix is in two parts, and so has not been backported to 1.5.

The workaround is not to have LoadBalancer/NodePort services with no endpoints for prolonged periods of time

The fix needs both kubernetes/kubernetes#43415 and kubernetes/kubernetes#43972, and neither has been backported to 1.5 I believe

gileshinchcliff · 2017-05-11T14:53:24Z

Thanks that helps. Are there plans for these to be cherry picked?

justinsb · 2017-05-11T14:55:48Z

Just reopened kubernetes/kubernetes#43858 which is the first cherry-pick. (I had closed it when the challenges which needed the second PR came to light; we'll need both)

negz · 2017-06-02T21:14:00Z

Just chiming in to say we're seeing this too on 1.5.7.

chrislovecnm · 2017-10-14T04:08:11Z

This appears to be an upstream issue, which I thing is resolved

Closing, as it appears solved. Please reopen if needed

chrislovecnm closed this as completed Oct 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd cluster losing quorum, causing cluster outage #2235

etcd cluster losing quorum, causing cluster outage #2235

gclyatt commented Mar 30, 2017

ajohnstone commented Mar 30, 2017 •

edited

Loading

justinsb commented Mar 30, 2017

gclyatt commented Mar 30, 2017

egalano commented Mar 30, 2017

justinsb commented Mar 30, 2017

gclyatt commented Mar 30, 2017

gileshinchcliff commented May 11, 2017

justinsb commented May 11, 2017 •

edited

Loading

gileshinchcliff commented May 11, 2017

justinsb commented May 11, 2017 •

edited

Loading

negz commented Jun 2, 2017

chrislovecnm commented Oct 14, 2017

etcd cluster losing quorum, causing cluster outage #2235

etcd cluster losing quorum, causing cluster outage #2235

Comments

gclyatt commented Mar 30, 2017

ajohnstone commented Mar 30, 2017 • edited Loading

justinsb commented Mar 30, 2017

gclyatt commented Mar 30, 2017

egalano commented Mar 30, 2017

justinsb commented Mar 30, 2017

gclyatt commented Mar 30, 2017

gileshinchcliff commented May 11, 2017

justinsb commented May 11, 2017 • edited Loading

gileshinchcliff commented May 11, 2017

justinsb commented May 11, 2017 • edited Loading

negz commented Jun 2, 2017

chrislovecnm commented Oct 14, 2017

ajohnstone commented Mar 30, 2017 •

edited

Loading

justinsb commented May 11, 2017 •

edited

Loading

justinsb commented May 11, 2017 •

edited

Loading