Skip to content

Commit

Permalink
add ipvs proxy mode design proposal
Browse files Browse the repository at this point in the history
  • Loading branch information
杜军 authored and 杜军 committed Jun 7, 2017
1 parent 3853b8d commit 26252e0
Showing 1 changed file with 152 additions and 0 deletions.
152 changes: 152 additions & 0 deletions contributors/design-proposals/ipvs-proxy.md
@@ -0,0 +1,152 @@
# Alpha Version IPVS Load Balancing Mode in Kubernetes

## Summary

There’s been a lot discussion and voice of enabling IPVS as in-cluster service load balancing mode. IPVS performs better than iptables, meanwhile supports more sophisticated load balancing algorithms than iptables (least load, least connections, locality, weighted) as well as other useful features (e.g. health checking, retries etc).

This page summarizes what’s expected in alpha version IPVS load balancing support, includes Kubernetes user behavior changes, build and deployment changes, and the test validation planned.

## Kubernetes behavior change

### Changes to kube-proxy startup parameter

In addition to existing userspace and iptables mode, ipvs mode is configured via `--proxy-mode=ipvs`. In alpha version, it implicitly uses IPVS NAT(masq) mode.

A new kube-proxy parameter `ipvs-scheduler` will be added to specify IPVS load balancing algorithm, the parameter is --ipvs-scheduler. Below is list of supported values, if it’s not configured `rr` is default value, if it’s incorrectly configured kube-proxy will exit with error message.

- rr: round-robin


- lc: least connection


- dh: destination hashing


- sh: source hashing


- sed: shortest expected delay


- nq: never queue

For more details about it, refer to [http://kb.linuxvirtualserver.org/wiki/Ipvsadm](http://kb.linuxvirtualserver.org/wiki/Ipvsadm)

In order to clean up inactive rules(includes iptables rules and ipvs rules), we will introduce a new kube-proxy parameter `--cleanup-proxyrules ` and mark the older `--cleanup-iptables` deprecated. Unfortunately, there is no way to distinguish whether an ipvs service is created by ipvs proxy or other process, `--cleanup-proxyrules` will clear all ipvs service in a host.

### Change to build

No changes. The IPVS implementation relies on [docker/libnetwork](https://godoc.org/github.com/docker/libnetwork/ipvs) ipvs library, which is a pure-golang implementation.

### Change to deployment

IPVS kernel module installation is beyond kubernetes, it’s assumed IPVS is installed before running kubernetes. When kube-proxy starts, if proxy mode is IPVS kube-proxy would validate if IPVS is installed on the node, if it’s not installed kube-proxy will exit with an error message.

Some of the existing deployment script might need update to install ipvsadm, which is beyond the scope of alpha version.

```shell
apt-get install ipvsadm
```

## Other design considerations

### IPVS setup and network topology

IPVS is a replacement or IPTables as load balancer, it’s assumed reader of this proposal is familiar with IPTables load balancer mode. We will create a dummy interface and assign all service Cluster IPs to the dummy interface(maybe called `kube0`). In alpha version, we will implicitly use NAT mode.

We will create some ipvs services for each kubernetes service. The VIP of ipvs service corresponding to the accessable IP(such as cluster IP, external IP, nodeIP, ingress IP, etc.) of kubernetes service. Each destination of an ipvs service corresponding to an kubernetes service endpoint.

### Port remapping

There are 3 proxy mode in ipvs, NAT, IPIP and DR. Only NAT mode support port remapping.

### NodePort type service support

For NodePort type service, IPVS proxy will take all accessable IPs in a host as the virtual IP of ipvs service. Specifically, accessable IP excludes `lo`, `docker0`, `vethxxx`, `cni0`, `flannel0`, etc. Currently, we assume they are IPs bound to `eth{i}`.

We people visit `NodeIP:NodePort`, it will hit the VIP of ipvs service. IPVS proxy doesn't need to assign Node IP to dummy interface.

### LoadBalancer type service support

For LoadBalancer type service, IPVS proxy will take the loadbalancer's ingress IP as the virtual IP of ipvs service.

Since there is a need of supporting access control for ingress IP. IPVS proxy will make use of iptables to do access control, Like

```shell
-I KUBE-SERVICES -d {ingress.IP} --dport {service.Port} -s {LoadBalancerSource} -j ACCEPT
```

When the packet reach the end of firewall chain and does not get DNATed, IPVS proxy will drop it, iptables rule like

```shell
-A KUBE-SERVICES -d {ingress.IP} --dport {service.Port} -j KubeMarkDropChain
```

### Support service cluster IP

IPVS proxy will assign service cluster IP to the dummy interface on each cluster node. Specifically, it will create an alias on dummy interface for each cluster IP. Delete a service will trigger the deletion of alias device.

### Support service external IP

IPVS proxy will not create alias device for external IPs of a service. It only create a ipvs service whose VIP corresponding to external IP.

### Session affinity

ipvs support client IP session affinity, it's called persistent connection. It's the `-p` flag in ipvsadm, like

```shell
ipvsadm -A -t 10.244.1.100:8080 -s rr -p [timeout]
```

When a service specify session affinity, ipvs proxy will assign a timeout value(180min by default) to ipvs service.

### Sync period

Similar to iptables proxy, IPVS proxy will do full sync loop every 10 seconds by default. Besides, every update on kubernetes service and endpoint will trigger an ipvs service and destination update.

### Support Only NodeLocal Endpoints

Similar to iptables proxy, IPVS proxy support Only NodeLocal Endpoints feature. When a service has the annotation, ipvs proxy will do traffic proxy only for local endpoints.

### Network policy

For IPVS NAT mode to work, **all packets from the realservers to the client must go through the director**. It means ipvs proxy should do SNAT in a L3-overlay network(such as flannel) for cross-host network communication. When a container request a cluster IP to visit an endpoint in another host, we should enable `--masquer-all` for ipvs proxy, **which will break network policy**.

## Test validation

**Functionality tests, all below traffic should be reachable**

* Traffic accessing service IP(include cluster IP, external IP, nodeIP, ingress IP)

- container -> serviceIP -> container (same host)

- container -> serviceIP -> container (cross host)

- container -> serviceIP -> container (same container)

- host -> serviceIP -> container (same host)

- host -> serviceIP -> container (cross host)


* Traffic between container and host (not via service IP)

- container -> container (same host)

- container -> container (cross host)

- container -> container (same container)

- host -> container (same host)

- host -> container (cross host)

- container -> host (same host)

- container -> host (cross host)


## TODO

Currently, we only tested flannel L3-overlay. we will test more network plugins other flannel.

0 comments on commit 26252e0

Please sign in to comment.