New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get rid of leader election for layer-2 mode #195

Closed
danderson opened this Issue Mar 14, 2018 · 6 comments

Comments

6 participants
@danderson
Copy link
Member

danderson commented Mar 14, 2018

Layer 2 modes (ARP and NDP) requires a single owner machine for each service IP. Currently we do this by running master election and having the winner own all IPs.

This is a little sub-optimal in several ways:

  • Leader election in k8s is kinda expensive in terms of control plane qps
  • Electing a machine regardless of what it's running means we are forced to use externalTrafficPolicy=Cluster, so we lose source IP information
  • We cannot shard traffic load by IP (arguably this is a feature, but it's not a particularly compelling one)

So, here's a proposal: let's get rid of the leader election, and replace it with a deterministic node selection algorithm. The controller logic remains unchanged (allocates an IP). On the speaker, we would do the following:

  • Based on the Endpoints object for the service, construct a list of nodes that have a pod for that service. In python pseudocode, that list would be [x.node for x in endpoints]
  • Hash the node names and service name together, to produce a service-dependent (but deterministic) set of hashes for the nodes.
  • Do a weighted alphabetical sort of the hashes, such that the first element of the list is the alphabetically first hash with the largest number of local pods
  • Pick that first element as the "owner" of this service, and make it announce that IP.

This algorithm results in a couple of properties:

  • Each IP can be owned by different nodes. In fact, due to the service-dependent hashing, it's likely that services will uniformly distribute throughout the cluster.
  • Services will prefer to attach to nodes that have multiple serving pods for the service, to better distribute load.
  • There is no explicit locking, similar to consistent hashing in Maglev each speaker simply arrives independently at the same conclusions.
  • We can once again allow externalTrafficPolicy: Local for layer2 mode services, which removes one of the major downsides of using ARP/NDP today (no client IP visibility).
  • One downside is that split-brain is more likely, because if a node gets cut off from the control plane it may not realize that conditions around it have changed, and so we might end up with multiple machines thinking that they own an IP. We can either accept this as a tradeoff (ARP and NDP behave somewhat gracefully in the presence of a split brain), or keep a concept similar to leader election that just pings a lease in the cluster, and speakers who don't see that lease increasing stop all advertisement on the assumption that they've lose communications with the control plane. This however has the significant downside that the cluster will stop all announcements if the control plane goes down, rather than gracefully keep announcing the last known state. I think I would prefer just accepting the split brain in that case.

@miekg @mdlayher Thoughts on this proposal?

@danderson danderson added this to To Do in Layer 2 mode via automation Mar 14, 2018

@miekg

This comment has been minimized.

Copy link
Collaborator

miekg commented Mar 14, 2018

@mdlayher

This comment has been minimized.

Copy link
Collaborator

mdlayher commented Mar 15, 2018

This all seems reasonable to me. My only big concern would be ending up with hotspots due to whatever the hashing algorithm is doing.

@steven-sheehy

This comment has been minimized.

Copy link

steven-sheehy commented May 3, 2018

This would definitely solve two of our problems with MetalLB layer2 mode: HA of IP announcing (we only have one master so can't leader elect if master down) and source IP visibility. Any idea what release this is targeted for?

@danderson

This comment has been minimized.

Copy link
Member

danderson commented May 3, 2018

My rough plan is to have BGPv6 support and this bug in 0.7.

Unfortunately no timeline for when that'll happen, since I'm a lone developer in his spare time :(

@mrbobbytables

This comment has been minimized.

Copy link

mrbobbytables commented Jun 26, 2018

We just encountered this issue today ourselves.

The proposal looks good, and I am in favor of just letting ARP/NDP sort it out. This seems to be more in line with Kubernetes as a whole with workload and services not being completely dependent on the control plane being available.

re: @mdlayher thoughts on hotspots -- I cannot speak for everyone's use-cases, but for us this is a specific need where we map services to 'edge-nodes'. We are already targeting specific nodes with these deployment, so its somewhat deterministic already.
One possibility to refine this, would be to check for an optional annotation on the service where a node, or ordered list of nodes could be supplied to function as a manual version of the 'node selection algorithm'. This would allow for some level of operator control or override regarding which nodes would announce the IP.

@danderson danderson changed the title Consider getting rid of leader election for layer-2 mode Get rid of leader election for layer-2 mode Jun 27, 2018

@danderson danderson added this to the v0.7.0 milestone Jun 27, 2018

@danderson danderson self-assigned this Jun 27, 2018

danderson added a commit to danderson/metallb that referenced this issue Jun 28, 2018

Remove all leader election code. google#195
With this change alone, all speakers will always announce all layer2 IPs.
This will work, kinda-sorta, but it's obviously not right. The followup
change to implement a new leader selection algorithm will come separately.

danderson added a commit to danderson/metallb that referenced this issue Jul 13, 2018

Distribute layer2 announcements across eligible nodes. google#195
Now, instead of one node owning all layer2 announcements, each
service selects one eligible node (i.e. with a local ready pod)
as the announcer. There is per-service perturbation such that
even multiple services pointing to the same pods will tend to
spread their announcers across eligible nodes.

danderson added a commit to danderson/metallb that referenced this issue Jul 13, 2018

danderson added a commit to danderson/metallb that referenced this issue Jul 13, 2018

danderson added a commit to danderson/metallb that referenced this issue Jul 21, 2018

Remove all leader election code. google#195
With this change alone, all speakers will always announce all layer2 IPs.
This will work, kinda-sorta, but it's obviously not right. The followup
change to implement a new leader selection algorithm will come separately.

danderson added a commit to danderson/metallb that referenced this issue Jul 21, 2018

Distribute layer2 announcements across eligible nodes. google#195
Now, instead of one node owning all layer2 announcements, each
service selects one eligible node (i.e. with a local ready pod)
as the announcer. There is per-service perturbation such that
even multiple services pointing to the same pods will tend to
spread their announcers across eligible nodes.

danderson added a commit to danderson/metallb that referenced this issue Jul 21, 2018

danderson added a commit to danderson/metallb that referenced this issue Jul 21, 2018

Layer 2 mode automation moved this from To Do to Done Jul 21, 2018

danderson added a commit that referenced this issue Jul 21, 2018

Remove all leader election code. #195
With this change alone, all speakers will always announce all layer2 IPs.
This will work, kinda-sorta, but it's obviously not right. The followup
change to implement a new leader selection algorithm will come separately.

danderson added a commit that referenced this issue Jul 21, 2018

Distribute layer2 announcements across eligible nodes. #195
Now, instead of one node owning all layer2 announcements, each
service selects one eligible node (i.e. with a local ready pod)
as the announcer. There is per-service perturbation such that
even multiple services pointing to the same pods will tend to
spread their announcers across eligible nodes.

danderson added a commit that referenced this issue Jul 21, 2018

@anandsinghkunwar

This comment has been minimized.

Copy link

anandsinghkunwar commented Aug 16, 2018

Does it make sense to add the service namespace as well in the hash in nodename + servicename hash?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment