Skip to content

Commit

Permalink
Rewrite concepts/layer2.md to account for #195 and #257.
Browse files Browse the repository at this point in the history
  • Loading branch information
danderson committed Jul 21, 2018
1 parent f5268e5 commit 4d621fc
Show file tree
Hide file tree
Showing 2 changed files with 92 additions and 83 deletions.
13 changes: 6 additions & 7 deletions website/content/concepts/_index.md
Expand Up @@ -49,13 +49,12 @@ this: ARP, NDP, or BGP.

### Layer 2 mode (ARP/NDP)

In layer 2 mode, one machine in the cluster takes ownership of the
service IPs, and uses standard address discovery protocols
([ARP](https://en.wikipedia.org/wiki/Address_Resolution_Protocol) for
IPv4, [NDP](https://en.wikipedia.org/wiki/Neighbor_Discovery_Protocol)
for IPv6) to make those IPs reachable on the local network. From the
LAN's point of view, the announcing machine simply has multiple IP
addresses.
In layer 2 mode, one machine in the cluster takes ownership of the service, and
uses standard address discovery protocols
([ARP](https://en.wikipedia.org/wiki/Address_Resolution_Protocol) for IPv4,
[NDP](https://en.wikipedia.org/wiki/Neighbor_Discovery_Protocol) for IPv6) to
make those IPs reachable on the local network. From the LAN's point of view, the
announcing machine simply has multiple IP addresses.

The [layer 2 mode]({{% relref "layer2.md" %}}) sub-page has more
details on the behavior and limitations of layer 2 mode.
Expand Down
162 changes: 86 additions & 76 deletions website/content/concepts/layer2.md
Expand Up @@ -3,89 +3,99 @@ title: MetalLB in layer 2 mode
weight: 1
---

In layer 2 mode, one node in your cluster assumes the responsibility
of advertising all service IPs to the local network. From the
network's perspective, it simply looks like that machine has multiple
IP addresses assigned to its network interface.
In layer 2 mode, one node assumes the responsibility of advertising a service to
the local network. From the network's perspective, it simply looks like that
machine has multiple IP addresses assigned to its network interface.

Under the hood, MetalLB responds to [ARP](https://en.wikipedia.org/wiki/Address_Resolution_Protocol)
requests for IPv4 services, and [NDP](https://en.wikipedia.org/wiki/Neighbor_Discovery_Protocol) requests for IPv6.
Under the hood, MetalLB responds to
[ARP](https://en.wikipedia.org/wiki/Address_Resolution_Protocol) requests for
IPv4 services, and
[NDP](https://en.wikipedia.org/wiki/Neighbor_Discovery_Protocol) requests for
IPv6.

The major advantage of the layer 2 mode is its universality: it will
work on any ethernet network, with no special hardware required, not
even fancy routers.
The major advantage of the layer 2 mode is its universality: it will work on any
ethernet network, with no special hardware required, not even fancy routers.

## Load-balancing behavior

In layer 2 mode, all traffic for all service IPs goes to one
node. From there, `kube-proxy` spreads the traffic to all the
service's pods.
In layer 2 mode, all traffic for a service IP goes to one node. From there,
`kube-proxy` spreads the traffic to all the service's pods.

In that sense, layer 2 does not implement a load-balancer. Rather, it
implements a failover mechanism so that a different node can take over
should the current leader node fail for some reason.
In that sense, layer 2 does not implement a load-balancer. Rather, it implements
a failover mechanism so that a different node can take over should the current
leader node fail for some reason.

If the leader node fails for some reason, failover is automatic: the
old leader's lease times out after 10 seconds, at which point another
node becomes the leader and takes over ownership of all addresses.
If the leader node fails for some reason, failover is automatic: the old
leader's lease times out after 10 seconds, at which point another node becomes
the leader and takes over ownership of the service IP.

## Limitations

Layer 2 mode has two main limitations you should be aware of:
single-node bottlenecking, and potentially slow failover.

As explained above, in layer2 mode a single leader-elected node
receives all traffic for all service IPs. This means that your cluster
ingress bandwidth is limited to the bandwidth of a single node. This
is a fundamental limitation of using ARP and NDP to steer traffic.

In the current implementation, failover between nodes depends on
cooperation from the clients. When a failover occurs, MetalLB sends a
number of gratuitous layer 2 packets (a bit of a misnomer - it should
really be called "unsolicited layer 2 packets") to notify clients that
the MAC address associated with the service IPs has changed.

Most operating systems handle "gratuitous" packets correctly, and
update their neighbor caches promptly. In that case, failover happens
within a few seconds. However, some systems either don't implement
gratuitous handling at all, or have buggy implementations that delay
the cache update.

All modern versions of major OSes (Windows, Mac, Linux) implement
layer 2 failover correctly, so the only situation where issues may
happen is with older or less common OSes.

To minimize the impact of planned failover on buggy clients, you
should keep the old leader node up for a couple of minutes after
flipping leadership, so that it can continue forwarding traffic for
old clients until their caches refresh.

During an unplanned failover, the service IPs will be unreachable
until the buggy clients refresh their cache entries.

If you encounter a situation where layer 2 mode failover is slow (more
than about 10s),
please [file a bug](https://github.com/google/metallb/issues/new)! We
can help you investigate and determine if the issue is with the
client, or a bug in MetalLB.

## Note: Architectural comparison with other similar solutions

It may seem that metallb in layer 2 mode is very similar to projects such as
KeepAliveD that use layer 2 networking protocols such as Virtual Router Redundancy Protocol (VRRP).
Although the high level functionality is similar, the details are quite different.

Metallb does not rely
on VRRP packets on the wire between the nodes implementing the load balancing/ failover.
Arbitration and selection of the active node happens completely in the metallb control plane
without need for sending/ receiving special layer 2 packets such as VRRP. As
a consequence, the limit of 255 load balanced/ service IPs per network (which exists with VRRP,
KeepAliveD and similar approaches) does not apply in case of metallb. There is also no need
for additional configuration objects such as Virtual Router IDs as needed by VRRP.

However as mentioned above, the current implementation of metallb (at least as of release v0.6.2) does not support
a mechanism for spreading the location of the service IPs to different nodes in a way that different service IPs
are active/ primary on different nodes. Hence (unlike VRRP based approaches) there is no current ability
to distribute network traffic for multiple service IPs to different nodes. This limitation may be
addressed in a future release of metallb.

Layer 2 mode has two main limitations you should be aware of: single-node
bottlenecking, and potentially slow failover.

As explained above, in layer2 mode a single leader-elected node receives all
traffic for a service IP. This means that your service's ingress bandwidth is
limited to the bandwidth of a single node. This is a fundamental limitation of
using ARP and NDP to steer traffic.

In the current implementation, failover between nodes depends on cooperation
from the clients. When a failover occurs, MetalLB sends a number of gratuitous
layer 2 packets (a bit of a misnomer - it should really be called "unsolicited
layer 2 packets") to notify clients that the MAC address associated with the
service IP has changed.

Most operating systems handle "gratuitous" packets correctly, and update their
neighbor caches promptly. In that case, failover happens within a few
seconds. However, some systems either don't implement gratuitous handling at
all, or have buggy implementations that delay the cache update.

All modern versions of major OSes (Windows, Mac, Linux) implement layer 2
failover correctly, so the only situation where issues may happen is with older
or less common OSes.

To minimize the impact of planned failover on buggy clients, you should keep the
old leader node up for a couple of minutes after flipping leadership, so that it
can continue forwarding traffic for old clients until their caches refresh.

During an unplanned failover, the service IPs will be unreachable until the
buggy clients refresh their cache entries.

If you encounter a situation where layer 2 mode failover is slow (more than
about 10s), please [file a bug](https://github.com/google/metallb/issues/new)!
We can help you investigate and determine if the issue is with the client, or a
bug in MetalLB.

## Comparison to Keepalived

MetalLB's layer2 mode has a lot of similarities to Keepalived, so if you're
familiar with Keepalived, this should all sound fairly familiar. However, there
are also a few differences worth mentioning. If you aren't familiar with
Keepalived, you can skip this section.

Keepalived uses the Virtual Router Redundancy Protocol (VRRP). Instances of
Keepalived continuously exchange VRRP messages with each other, both to select a
leader and to notice when that leader goes away.

MetalLB on the other hand relies on Kubernetes to know when pods and nodes go up
and down. It doesn't need to speak a separate protocol to select leaders,
instead it just lets Kubernetes do most of the work of deciding which pods are
healthy, and which nodes are ready.

Keepalived and MetalLB "look" the same from the client's perspective: the
service IP address seems to migrate from one machine to another when failovers
happen, and the rest of the time it just looks like machines have more than one
IP address.

Because it doesn't use VRRP, MetalLB isn't subject to some of the limitations of
that protocol. For example, the VRRP limit of 255 load-balancers per network
doesn't exist in MetalLB. You can have as many load-balanced IPs as you want, as
long as there are free IPs in your network. MetalLB also requires less
configuration than VRRP – for example, there are no Virtual Router IDs

On the flip side, because MetalLB relies on Kubernetes for information instead
of a standard network protocol, it cannot interoperate with third-party
VRRP-aware routers and infrastructure. This is working as intended: MetalLB is
specifically designed to provide load balancing and failover _within_ a
Kubernetes cluster, and in that scenario interoperability with third-party LB
software is out of scope.

0 comments on commit 4d621fc

Please sign in to comment.