Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Kubernetes API Server healthcheck not failing if etcd is partitioned #72796
TLDR: One etcd server started timing out all writes due to a non-transitive netsplit and the Kubernetes apiserver healthcheck did not fail, causing 1/3 of K8s clients to not have a working control plane until manual intervention.
Anything else we need to know?:
Looks like the etcdctl endpoint health command does a random read after fetching the list of members - https://github.com/etcd-io/etcd/blob/master/etcdctl/ctlv3/command/ep_command.go#L118 - we should probably do the same?