Support VRRP-ish #12
Support VRRP-ish #12
Comments
Past trauma with VRRP prevent me to thing clearly.. but couple of questions:
Why does this need leader election? ReplicaSet of 1 replica should do it? Life cycle managed by k8s control plane - any short downtime shouldn't matter?
Why? |
Canonical arp lib is: https://github.com/mdlayher/arp |
sigh k8s networking... do we need to mac address of a node or of the pod? |
re: DaemonSet, you're right, a single-replica deployment is enough. We might have to add leader election later to avoid issues with "phantom" replicas, but that can be a separate bug. We need the machine's MAC address. From the network's POV, the pod network and all the stuff k8s does with virtual networking doesn't exist. We just need to convince the network to send the VIP traffic to the physical node, and from there kube-proxy's netfilter rules does the rest. To get the MAC address, we will need a small dance:
Re: BPF program, our ARP traffic has to convince 2 separate consumers: switches in the L2 segment, and end hosts. For switches, they just need to learn the port to use for the destination MAC. For that we use unsollicited ARP responses. But, unsollicited ARP is not necessarily enough to make end hosts work. They are not required to cache information from unsollicited responses, or they could evict the cache entry in a large network if they're not yet talking to the VIP. So, I think we need to speak normal request/response ARP as well, to ensure that the clients can find the VIP. WDYT? |
Oh, it looks like mdlayher's ARP package doesn't need to do BPF magic to listen for ARP, it's a supported protocol family. I thought we would have to do the same hacks I did for listening to DHCP efficiently, using x/net/bpf. |
sgtm; but I (still) don't understand the BPF requirement. If we make a arp listeners that just programs the hosts arp cache woulnd't that also do it? Or is BPF doing the same and easier? |
BPF just lets you filter a raw AF_PACKET socket (that receives all traffic on an interface) in-kernel, so it's more efficient. But mdlayher's ARP package lets you listen for ARP only, so it's not necessary. |
Tentatively marking this for v0.2.0. It's possible I will get impatient and push out 0.2.0 while kubecon and CoreDNS stuff occupy your time. If that happens, we can release 0.3 with this and RIP support as the major features :) |
ok have the (very minimal) https://github.com/miekg/karp which allows me to play with sending unsolicitated ARPs. But (again) I'm wondering how to test - my router doesn't seem to accept unsolicated ARPs. |
It's possible you'll have to support both sollicited and unsollicited responses before things will work well. Check in wireshark (a) if the unsollicited packet is being transmitted, and (b) if you're seeing arp-who-has requests for your VIP. If you are, you need to implement responses to those as well so that end hosts know where to forward stuff. |
Ok, succesfully spoofed an ARP, flipped some arguments, but I'll need to double check what is the right thing to do. Actually writing to the arp cache requires a ioctl on a socket; that should be fun to do in Go :) |
Um, why do you need to write to the ARP cache? You only need to convince other machines to send you traffic, not the local machine? |
ah, true you're right. Misread the second half of your inital comment. Well, then. ship it :) |
Ok, https://github.com/miekg/arp-speaker/ is coming along nicely. I figured that I don't have to do any unsolicated ARPs, just start responding when you see an ARP request for the new virtual IP whenever a VIP is going to be announced. This will be using the MAC addr of the node we're running on. I do need one new config item I think (next to the ability to switch protocol), which is the interface we should use. |
Next up:
|
Unsollicited ARP is still necessary when you're doing a live failover. Say you have an arp-speaker on node A, it has sent some ARP responses, and then it goes down. arp-speaker on node B takes over, but the upstream switches are still forwarding the VIP traffic to the port of node A. The clients all have the IP→MAC mapping in their ARP cache, so they are just transmitting immediately without sending ARP requests... And then the switch forwards those ethernet frames to node A instead of node B. I think the unsollicited ARP on failover is requires to teach the switches that the VIP has moved to a new egress port. |
Ack, noted. None of my Linux machines seem to accept these ARP responses. But it easy to add goroutine that just spams the network with these every N seconds. |
Yeah, end hosts probably ignore the ARP spam, it's purely to help dumb L2 switches discover that the VIP has changed ports. |
Iinitial code has been merged, #28 for follow up TODO list. |
If we're going all multiprotocol with RIP, we should also implement VRRP. Or more specifically, "traffic steering using unsollicited ARP responses", not actually VRRP.
The way VRRP normally works is that the speakers ping each other over the network, and whoever wins takes ownership of the virtual router MAC address. This works well for stateless routers where failover can be transparent, but for endpoints, having a virtual MAC that you need to teach the kernel about is cumbersome and doesn't really help you that much.
What we really want is to just make the LB IPs portable, and we can do that with unsollicited ARP response pinging. The general idea: a new arp-speaker (could be a deployment or a daemonset) runs leader election through kubernetes. The winning leader sends periodic unsollicited ARP responses saying that "service-ip is-at node-mac-addr". It additionally runs an AF_PACKET socket (with appropriate BPF program attached) to listen for ARP who-has for service IPs, and responds to those as well.
The net result is that the local L2 segment will send service IP traffic to the elected cluster node, which will then LB.
This routing mode only really works with the cluster LB policy.
The text was updated successfully, but these errors were encountered: