Skip to content
This repository has been archived by the owner on Jan 20, 2022. It is now read-only.

Leader Election Causes Downtime in Kops Rolling Updates #333

Open
LeeHampton opened this issue Jul 1, 2020 · 0 comments
Open

Leader Election Causes Downtime in Kops Rolling Updates #333

LeeHampton opened this issue Jul 1, 2020 · 0 comments

Comments

@LeeHampton
Copy link

LeeHampton commented Jul 1, 2020

This is a duplicate of an issue I created on kops. I can't quite tell if this is something potentially related to the etcd client or even the manager and how it handles load balancing during leader election, or if it's something else on the k8s/kops side, but I figured I'd post here to be thorough.

This is an HA 3-master cluster on AWS using m5.2xlarge instances with attached gp2 EBS for the etcd disks. I'm using version 3.4.3 for etcd with etcd-manager.

This isn't fully deterministic, but basically after a master gets taken down in a rolling update I often get some variation of the following.

The API Server logs some etcd related errors -- either a leader election error or a timeout error:

E0629 18:45:05.248843       1 status.go:71] apiserver received an error that is not an metav1.Status : rpctypes.EtcdError{code:0xe, desc:"etcdserver: request timed out"}

or

apiserver received an error that is not an metav1.Status rpctypes.EtcdError{code:0xe, desc:"etcdserver: leader changed"}

Often this causes the API server to restart in an unhealthy state:

healthz check failed
[-]etcd failed: reason withheld

But this seems to resolve quickly and the API server gets back to normal. However, during this time, services that rely on the kube API server get timeout errors when trying to connect to it.

Over on the etcd side, I see the usual logs around a peer member not being reachable, etc. during the time while the new peer is unavailable. However, I also see some timeout error messages in the etcd logs:

2020-06-30 16:25:38.979648 W | etcdserver: server is likely overloaded
2020-06-30 16:25:38.979653 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 68.908672ms, to 607a5729c01d3c7d)

And things like this:

etcdserver: read-only range request "key:\"/registry/configmaps/cert-manager/cert-manager-cainjector-leader-election-core\" " with result "range_response_count:1 size:523" took too long (5.896109403s) to execute

These, too, resolve after a few minutes and don't seem to come back.

It seems like any time a new etcd leader election happens, we're guaranteed at least 30 seconds of downtime and some weird bootup issues. I'm not sure what to do here. A 3 node etcd cluster should survive downtime of 1 of its members, but currently if 1 of the nodes goes down, basically k8s becomes fully unavailable. This is a problem for our cluster because it relies on the k8s api server for a lot of functionality.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant