New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support VRRP-ish #12

Closed
danderson opened this Issue Nov 30, 2017 · 19 comments

Comments

Projects
None yet
2 participants
@danderson
Member

danderson commented Nov 30, 2017

If we're going all multiprotocol with RIP, we should also implement VRRP. Or more specifically, "traffic steering using unsollicited ARP responses", not actually VRRP.

The way VRRP normally works is that the speakers ping each other over the network, and whoever wins takes ownership of the virtual router MAC address. This works well for stateless routers where failover can be transparent, but for endpoints, having a virtual MAC that you need to teach the kernel about is cumbersome and doesn't really help you that much.

What we really want is to just make the LB IPs portable, and we can do that with unsollicited ARP response pinging. The general idea: a new arp-speaker (could be a deployment or a daemonset) runs leader election through kubernetes. The winning leader sends periodic unsollicited ARP responses saying that "service-ip is-at node-mac-addr". It additionally runs an AF_PACKET socket (with appropriate BPF program attached) to listen for ARP who-has for service IPs, and responds to those as well.

The net result is that the local L2 segment will send service IP traffic to the elected cluster node, which will then LB.

This routing mode only really works with the cluster LB policy.

@miekg

This comment has been minimized.

Collaborator

miekg commented Dec 4, 2017

Past trauma with VRRP prevent me to thing clearly.. but couple of questions:

. The general idea: a new arp-speaker (could be a deployment or a daemonset) runs leader election through kubernetes.

Why does this need leader election? ReplicaSet of 1 replica should do it? Life cycle managed by k8s control plane - any short downtime shouldn't matter?

The winning leader sends periodic unsollicited ARP responses saying that "service-ip is-at node-mac-addr".

  • How do you get a MAC address of a node?
    kubectl describe pod coredns-4025136029-ctj3k -n kube-system show the node
    kubectl describe node gke-coredns-cluster-default-pool-d327f287-g3hl describes the pod. Then arp-lookup from the InternalIP?

It additionally runs an AF_PACKET socket (with appropriate BPF program attached) to listen for ARP who-has for service IPs, and responds to those as well.

Why?

@miekg

This comment has been minimized.

Collaborator

miekg commented Dec 4, 2017

Canonical arp lib is: https://github.com/mdlayher/arp

@miekg

This comment has been minimized.

Collaborator

miekg commented Dec 4, 2017

sigh k8s networking... do we need to mac address of a node or of the pod?
(I need to read more)

@danderson

This comment has been minimized.

Member

danderson commented Dec 4, 2017

re: DaemonSet, you're right, a single-replica deployment is enough. We might have to add leader election later to avoid issues with "phantom" replicas, but that can be a separate bug.

We need the machine's MAC address. From the network's POV, the pod network and all the stuff k8s does with virtual networking doesn't exist. We just need to convince the network to send the VIP traffic to the physical node, and from there kube-proxy's netfilter rules does the rest.

To get the MAC address, we will need a small dance:

  1. Run the deployment with hostNetwork: true, so that the pod has direct access to the machine's network stack. Otherwise, by default it gets sandboxed and low-level protocols like ARP don't get forwarded to the real network.
  2. Using the pod Downward API, tell the deployment the IP of the node it's running on. The BGP speaker already does this, so we can just copy/paste.
  3. In the binary, use net.Interfaces() to list network interfaces, find the one that owns the node IP, and use that HardwareAddr (and also only do ARP stuff on that interface)

Re: BPF program, our ARP traffic has to convince 2 separate consumers: switches in the L2 segment, and end hosts.

For switches, they just need to learn the port to use for the destination MAC. For that we use unsollicited ARP responses.

But, unsollicited ARP is not necessarily enough to make end hosts work. They are not required to cache information from unsollicited responses, or they could evict the cache entry in a large network if they're not yet talking to the VIP. So, I think we need to speak normal request/response ARP as well, to ensure that the clients can find the VIP. WDYT?

@danderson

This comment has been minimized.

Member

danderson commented Dec 4, 2017

Oh, it looks like mdlayher's ARP package doesn't need to do BPF magic to listen for ARP, it's a supported protocol family. I thought we would have to do the same hacks I did for listening to DHCP efficiently, using x/net/bpf.

@miekg

This comment has been minimized.

Collaborator

miekg commented Dec 4, 2017

But, unsollicited ARP is not necessarily enough to make end hosts work. They are not required to cache information from unsollicited responses, or they could evict the cache entry in a large network if they're not yet talking to the VIP. So, I think we need to speak normal request/response ARP as well, to ensure that the clients can find the VIP. WDYT?

sgtm; but I (still) don't understand the BPF requirement. If we make a arp listeners that just programs the hosts arp cache woulnd't that also do it? Or is BPF doing the same and easier?

@danderson

This comment has been minimized.

Member

danderson commented Dec 4, 2017

BPF just lets you filter a raw AF_PACKET socket (that receives all traffic on an interface) in-kernel, so it's more efficient.

But mdlayher's ARP package lets you listen for ARP only, so it's not necessary.

@danderson danderson added this to the v0.2.0 milestone Dec 5, 2017

@danderson

This comment has been minimized.

Member

danderson commented Dec 5, 2017

Tentatively marking this for v0.2.0. It's possible I will get impatient and push out 0.2.0 while kubecon and CoreDNS stuff occupy your time. If that happens, we can release 0.3 with this and RIP support as the major features :)

@miekg

This comment has been minimized.

Collaborator

miekg commented Dec 5, 2017

ok have the (very minimal) https://github.com/miekg/karp which allows me to play with sending unsolicitated ARPs. But (again) I'm wondering how to test - my router doesn't seem to accept unsolicated ARPs.

@danderson

This comment has been minimized.

Member

danderson commented Dec 5, 2017

It's possible you'll have to support both sollicited and unsollicited responses before things will work well. Check in wireshark (a) if the unsollicited packet is being transmitted, and (b) if you're seeing arp-who-has requests for your VIP. If you are, you need to implement responses to those as well so that end hosts know where to forward stuff.

@miekg

This comment has been minimized.

Collaborator

miekg commented Dec 5, 2017

Ok, succesfully spoofed an ARP, flipped some arguments, but I'll need to double check what is the right thing to do.

Actually writing to the arp cache requires a ioctl on a socket; that should be fun to do in Go :)

@danderson

This comment has been minimized.

Member

danderson commented Dec 5, 2017

Um, why do you need to write to the ARP cache? You only need to convince other machines to send you traffic, not the local machine?

@miekg

This comment has been minimized.

Collaborator

miekg commented Dec 5, 2017

ah, true you're right. Misread the second half of your inital comment.

Well, then. ship it :)

@miekg

This comment has been minimized.

Collaborator

miekg commented Dec 9, 2017

Ok, https://github.com/miekg/arp-speaker/ is coming along nicely.

I figured that I don't have to do any unsolicated ARPs, just start responding when you see an ARP request for the new virtual IP whenever a VIP is going to be announced. This will be using the MAC addr of the node we're running on.

I do need one new config item I think (next to the ability to switch protocol), which is the interface we should use.

@miekg

This comment has been minimized.

Collaborator

miekg commented Dec 9, 2017

Next up:

  • integrate with metallb
  • actually compile metallb locally
  • find out if minikube on my laptop can send ARPs on my LAN
@danderson

This comment has been minimized.

Member

danderson commented Dec 9, 2017

Unsollicited ARP is still necessary when you're doing a live failover. Say you have an arp-speaker on node A, it has sent some ARP responses, and then it goes down. arp-speaker on node B takes over, but the upstream switches are still forwarding the VIP traffic to the port of node A. The clients all have the IP→MAC mapping in their ARP cache, so they are just transmitting immediately without sending ARP requests... And then the switch forwards those ethernet frames to node A instead of node B.

I think the unsollicited ARP on failover is requires to teach the switches that the VIP has moved to a new egress port.

@miekg

This comment has been minimized.

Collaborator

miekg commented Dec 9, 2017

Ack, noted. None of my Linux machines seem to accept these ARP responses. But it easy to add goroutine that just spams the network with these every N seconds.

@danderson

This comment has been minimized.

Member

danderson commented Dec 9, 2017

Yeah, end hosts probably ignore the ARP spam, it's purely to help dumb L2 switches discover that the VIP has changed ports.

@miekg

This comment has been minimized.

Collaborator

miekg commented Dec 10, 2017

Iinitial code has been merged, #28 for follow up TODO list.

@miekg miekg closed this Dec 10, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment