Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get rid of leader election for layer-2 mode #195

Closed
danderson opened this issue Mar 14, 2018 · 12 comments · Fixed by #276
Closed

Get rid of leader election for layer-2 mode #195

danderson opened this issue Mar 14, 2018 · 12 comments · Fixed by #276
Assignees
Milestone

Comments

@danderson
Copy link
Contributor

Layer 2 modes (ARP and NDP) requires a single owner machine for each service IP. Currently we do this by running master election and having the winner own all IPs.

This is a little sub-optimal in several ways:

  • Leader election in k8s is kinda expensive in terms of control plane qps
  • Electing a machine regardless of what it's running means we are forced to use externalTrafficPolicy=Cluster, so we lose source IP information
  • We cannot shard traffic load by IP (arguably this is a feature, but it's not a particularly compelling one)

So, here's a proposal: let's get rid of the leader election, and replace it with a deterministic node selection algorithm. The controller logic remains unchanged (allocates an IP). On the speaker, we would do the following:

  • Based on the Endpoints object for the service, construct a list of nodes that have a pod for that service. In python pseudocode, that list would be [x.node for x in endpoints]
  • Hash the node names and service name together, to produce a service-dependent (but deterministic) set of hashes for the nodes.
  • Do a weighted alphabetical sort of the hashes, such that the first element of the list is the alphabetically first hash with the largest number of local pods
  • Pick that first element as the "owner" of this service, and make it announce that IP.

This algorithm results in a couple of properties:

  • Each IP can be owned by different nodes. In fact, due to the service-dependent hashing, it's likely that services will uniformly distribute throughout the cluster.
  • Services will prefer to attach to nodes that have multiple serving pods for the service, to better distribute load.
  • There is no explicit locking, similar to consistent hashing in Maglev each speaker simply arrives independently at the same conclusions.
  • We can once again allow externalTrafficPolicy: Local for layer2 mode services, which removes one of the major downsides of using ARP/NDP today (no client IP visibility).
  • One downside is that split-brain is more likely, because if a node gets cut off from the control plane it may not realize that conditions around it have changed, and so we might end up with multiple machines thinking that they own an IP. We can either accept this as a tradeoff (ARP and NDP behave somewhat gracefully in the presence of a split brain), or keep a concept similar to leader election that just pings a lease in the cluster, and speakers who don't see that lease increasing stop all advertisement on the assumption that they've lose communications with the control plane. This however has the significant downside that the cluster will stop all announcements if the control plane goes down, rather than gracefully keep announcing the last known state. I think I would prefer just accepting the split brain in that case.

@miekg @mdlayher Thoughts on this proposal?

@miekg
Copy link
Contributor

miekg commented Mar 14, 2018 via email

@mdlayher
Copy link
Contributor

This all seems reasonable to me. My only big concern would be ending up with hotspots due to whatever the hashing algorithm is doing.

@steven-sheehy
Copy link

This would definitely solve two of our problems with MetalLB layer2 mode: HA of IP announcing (we only have one master so can't leader elect if master down) and source IP visibility. Any idea what release this is targeted for?

@danderson
Copy link
Contributor Author

My rough plan is to have BGPv6 support and this bug in 0.7.

Unfortunately no timeline for when that'll happen, since I'm a lone developer in his spare time :(

@mrbobbytables
Copy link

We just encountered this issue today ourselves.

The proposal looks good, and I am in favor of just letting ARP/NDP sort it out. This seems to be more in line with Kubernetes as a whole with workload and services not being completely dependent on the control plane being available.

re: @mdlayher thoughts on hotspots -- I cannot speak for everyone's use-cases, but for us this is a specific need where we map services to 'edge-nodes'. We are already targeting specific nodes with these deployment, so its somewhat deterministic already.
One possibility to refine this, would be to check for an optional annotation on the service where a node, or ordered list of nodes could be supplied to function as a manual version of the 'node selection algorithm'. This would allow for some level of operator control or override regarding which nodes would announce the IP.

@danderson danderson changed the title Consider getting rid of leader election for layer-2 mode Get rid of leader election for layer-2 mode Jun 27, 2018
@danderson danderson added this to the v0.7.0 milestone Jun 27, 2018
@danderson danderson self-assigned this Jun 27, 2018
danderson added a commit that referenced this issue Jul 21, 2018
With this change alone, all speakers will always announce all layer2 IPs.
This will work, kinda-sorta, but it's obviously not right. The followup
change to implement a new leader selection algorithm will come separately.
danderson added a commit that referenced this issue Jul 21, 2018
Now, instead of one node owning all layer2 announcements, each
service selects one eligible node (i.e. with a local ready pod)
as the announcer. There is per-service perturbation such that
even multiple services pointing to the same pods will tend to
spread their announcers across eligible nodes.
danderson added a commit that referenced this issue Jul 21, 2018
@anandsinghkunwar
Copy link

Does it make sense to add the service namespace as well in the hash in nodename + servicename hash?

@xeor
Copy link

xeor commented Feb 13, 2019

Should this works now, that being getting the real ip of the requester...

I am using..

  • arp mode (also tried bgp mode)
  • LoadBalancer with externalTrafficPolicy: "Local"
  • LoadBalancer > Ingress (nginx-controller) > Service (ClusterIP) > k8s.gcr.io/echoserver:1.4
  • metallb 0.7.3 from helm..
  • nginx ingress 0.22.0 from helm..

Still getting x-real-ip=10.40.0.0.. Or am I missing something? Been fiddling a lot with this now, so now I'm wondering if this issue is solved or not..

@zerkms
Copy link
Contributor

zerkms commented Feb 13, 2019

@xeor check ingress access logs first, does it have the real ip address there

@xeor
Copy link

xeor commented Feb 13, 2019

Just checked it's log during all my experimenting.. It shows 10.40.0.0 everytime..

@zerkms
Copy link
Contributor

zerkms commented Feb 13, 2019

What is even 10.40.0.0?

@danderson
Copy link
Contributor Author

Please use the mailing list for support questions, not a closed bug :)

@xeor
Copy link

xeor commented Feb 13, 2019

Sorry, but I wanted to confirm that this was actually fixed and should have worked.. If it should, I'll continue my quest on figuring it out.. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants