New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes API Server healthcheck not failing if etcd is partitioned #72796

Open
lorenz opened this Issue Jan 10, 2019 · 6 comments

Comments

Projects
None yet
7 participants
@lorenz
Copy link

lorenz commented Jan 10, 2019

What happened:
I'm running a K8s 1.13 cluster and a partial netsplit happened due to a defective linecard dropping only
TCP packets above a certain sizes between two masters, but the third could still communicate with both (so the fault made the network non-transitive). etcd was run on the physical network, but Kubernetes traffic runs on a Wireguard overlay for security and routing is provided by OSPF (so internal communication was not affected since it was tunelled over UDP). Etcd experienced some temporary issues due to the connections dropping only certain packets (above ~1000 bytes), but quickly dropped one host out of the quorum, at which point it was degraded but all the remaining hosts could communicate freely. Each master node has an etcd instance and an apiserver. All apiservers are anycasted inside the overlay to balance the load and provide redundancy and the announcement is coupled to the health check, so in case any apiserver fails no traffic will be routed to it. The problem was that the apiserver did not fail its healthcheck even though all etcd write requests timed out. The non-working apiserver was still accessible and got routed its share of the anycast traffic since that was routed over the UDP overlay. Thus ~1/3 of all workloads and nodes were stuck on a non-working apiserver until I nuked the non-working apiserver.

TLDR: One etcd server started timing out all writes due to a non-transitive netsplit and the Kubernetes apiserver healthcheck did not fail, causing 1/3 of K8s clients to not have a working control plane until manual intervention.
What you expected to happen:
The health check should've failed (in my opinion, I'm happy to discuss if you think otherwise).
How to reproduce it (as minimally and precisely as possible):
Set up 3 etcds with corresponding apiservers, introduce a split between two (shouldn't have to be a packet size limit, a full disconnect should reproduce cleanly), observe that all three report healthy even though at least one of them no longer works.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): v1.13.0
  • Cloud provider or hardware configuration: Physical x86_64 servers
  • OS (e.g. from /etc/os-release): Custom
  • Kernel (e.g. uname -a): Custom, based on 4.19. No patches affect this functionality.
  • Install tools: Custom
  • Others:
@lorenz

This comment has been minimized.

Copy link

lorenz commented Jan 11, 2019

/sig api-machinery

(maybe cluster-ops?)

@dims

This comment has been minimized.

Copy link
Member

dims commented Jan 11, 2019

cc @jpbetz

@dims

This comment has been minimized.

Copy link
Member

dims commented Jan 14, 2019

Looks like the etcdctl endpoint health command does a random read after fetching the list of members - https://github.com/etcd-io/etcd/blob/master/etcdctl/ctlv3/command/ep_command.go#L118 - we should probably do the same?

@lorenz

This comment has been minimized.

Copy link

lorenz commented Jan 14, 2019

Looks sane and would have prevented my issue. Should I PR this?

@dims

This comment has been minimized.

Copy link
Member

dims commented Jan 14, 2019

@lorenz sure. thanks

@lorenz lorenz referenced a pull request that will close this issue Jan 14, 2019

Open

Fix etcd healthcheck for consensus failures #72896

@fedebongio

This comment has been minimized.

Copy link
Contributor

fedebongio commented Jan 14, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment