Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support ipvs mode for kube-proxy #692

Closed
wants to merge 1 commit into from

Conversation

@m1093782566
Copy link
Member

m1093782566 commented Jun 7, 2017

Implement IPVS-based in-cluster service load balancing. It can provide some performance enhancement and some other benefits to kube proxy while comparing iptables and userspace mode. Besides, it also support more sophisticated load balancing algorithms than iptables (least conns, weighted, hash and so on).

related issue: kubernetes/kubernetes#17470 kubernetes/kubernetes#44063

related PR: kubernetes/kubernetes#46580 kubernetes/kubernetes#48994

@thockin @quinton-hoole @wojtek-t

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jun 7, 2017

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://github.com/kubernetes/kubernetes/wiki/CLA-FAQ to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@dhilipkumars

This comment has been minimized.

Copy link
Member

dhilipkumars commented Jun 7, 2017

@m1093782566 Thank you for reacting quickly, Could we add some statistics we collected during our experiment? Like a table that compares IpTables vs IPVS with few thousand services created.

@m1093782566

This comment has been minimized.

Copy link
Member Author

m1093782566 commented Jun 7, 2017

@dhilipkumars I think @haibinxie has the original statistics.

@m1093782566

This comment has been minimized.

Copy link
Member Author

m1093782566 commented Jun 7, 2017

IPVS vs. IPTables Latency to Add Rules

Measured by iptables and ipvsadm, observation:

  • In iptables mode, latency to add rules increase significantly when number of service increases

  • In IPVS mode, latency to add VIP and backend IPs does not increase when number of service increases

number of services 1 5000 20000
number of rules 8 40000 160000
iptables 2ms 11min 5hours
ipvs are neat 2ms 2ms

@dhilipkumars I am not sure if the statistics is sufficient.

@spiffxp

This comment has been minimized.

Copy link
Member

spiffxp commented Jun 7, 2017

@spiffxp

This comment has been minimized.

Copy link
Member

spiffxp commented Jun 7, 2017


### Network policy

For IPVS NAT mode to work, **all packets from the realservers to the client must go through the director**. It means ipvs proxy should do SNAT in a L3-overlay network(such as flannel) for cross-host network communication. When a container request a cluster IP to visit an endpoint in another host, we should enable `--masquer-all` for ipvs proxy, **which will break network policy**.

This comment has been minimized.

Copy link
@cmluciano

cmluciano Jun 7, 2017

Member

Breaking a soon-to-be GA feature seems like an edge case that we probably cannot overlook. Even with something in alpha form, I think we should be sure that all existing functionality is supported.

This comment has been minimized.

Copy link
@haibinxie

haibinxie Jun 8, 2017

SNAT is required for cross host communication, there should be compromise in between, probably an awareness of the situation in release note. I am also open to any solution/fix that we can work towards.

This comment has been minimized.

Copy link
@ddysher

ddysher Jun 9, 2017

Contributor

apart from soon to be a GA feature, network policy is really important that people probably would trade performance for it. A draft idea is to not do snat, rather, on each host, add a new routing table and fwmark all service traffic to this table. The routing table will take care of routing back the packet. There's a lot of caveats, i don't know if it will ever work at all.

This comment has been minimized.

Copy link
@thockin

thockin Jun 11, 2017

Member

When I prototyped IPVS, and I just re-did my tests, I don't see a need to SNAT. I tcpdumped it at both ends.

client pod has IP P
service has IP S
backend has IP B

P -> S packet
leaves pod netns to root netns
IPVS rewrites destination to B (packet is src:P, dst:B)
arrives at B's node's root netns
forwarded to B
response is src:B, dst:P
arrives at P's nodes' root netns
conntrack converts packet to src:S, dest:P

This works (on GCP, at least) because the only route to P is back to P's node. I can see how it might not work in some environments, but frankly, if this can't work in most environments, we should maybe not put it in kube-proxy.

Preserving client IP has been a huge concern, and I am not inclined to throw that away. WRT NetworkPolicy, this doesn't break the API, but it does end up breaking almost every implementation (in a really not-obviously-fixable way).

This comment has been minimized.

Copy link
@m1093782566

m1093782566 Jun 13, 2017

Author Member

I am not sure if L3 overlay(flannel with vxlan backend) is the matter. Maybe I need a reverse traffic goes through IPVS proxy so that traffic reaches the source pod (after performing DNAT). I will re-test it in my environment.

This comment has been minimized.

Copy link
@m1093782566

m1093782566 Jun 13, 2017

Author Member

@murali-reddy do you have any idea?

This comment has been minimized.

Copy link
@murali-reddy

murali-reddy Jun 14, 2017

Agree with @thockin, there is no need for SNAT, atleast not for all traffic paths. So when a pod access cluster IP or node port, IPVS does DNAT. On reverse path pretty much (can not think of any pod networking solution where its not true) route to source pod is back to source pod node.

Where its gets tricky, is when a node port is accessed from outside the cluster. Node on which destination pod is running may route traffic directly to client through default gateway. To prevent that we need to SNAT traffic, so the return traffic goes through the same node through which client accessed node port. We do loose the source IP in this case. But AFAIK this is not unique to use of IPVS but even iptable kube-proxy will have to do the SNAT.

FWIW, i have implemented logic in kube-router to deal with external client accessing node port. I just tested Flannel VXLAN + IPVS service proxy + network policy, i dont see any issue. Dont see reason to do SNAT. Please test it with your POC, and see you can remove this restriction.

This comment has been minimized.

Copy link
@m1093782566

This comment has been minimized.

Copy link
@m1093782566

m1093782566 Jun 14, 2017

Author Member

I re-tested in my environment(Flannel VXLAN + IPVS service proxy) and found something different.

There is an ipvs service with 2 destinations which are in the different host.

[root@SHA1000130405 home]# ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn     
TCP  10.102.128.4:3080 rr
  -> 10.244.0.235:8080            Masq    1      0          0         
  -> 10.244.2.123:8080            Masq    1      0          0 

There was no SNAT rules applied and I curl VIP in each container.

# In container 10.244.0.235

$curl 10.102.128.4:3080
# get response from 10.244.2.123:8080

I tcpdumped in another host to see what the source IP is.

$ tcpdump -i flannel.1

20:44:48.021765 IP 10.244.0.235.36844 > 10.244.2.123.webcache: Flags [S], seq 519767460, win 28200, options [mss 1410,sackOK,TS val 416
20:44:48.021998 IP 10.244.2.123.webcache > 10.244.0.235.36844: Flags [S.], seq 1131844123, ack 519767461, win 27960, options [mss 1410,76,nop,wscale 7], length 0

The output shows that the source IP no changed.

So, it seems that there is no need to do SNAT for cross-host communication.

However, when ipvs scheduled the self container, there is no response returned. It seems that the packet is dropped when the traffic reached the self container.

I can confirm that it has nothing to do with same host vs. cross host, the issue is about whether the request hit the self container.

When I applied SNAT rules as iptables proxy does, it can reach self container via VIP.

iptables -t nat -A PREROUTING -s 10.244.0.235 -j KUBE-MARK-MASQ

Unfortunately, the source container lost its IP and the source IP became flannel.1's IP.

Anyway, it's a good news to me that SNAT is unnecessary for cross-host communication so that it won't break the network policy though I need more knowledge to fix the issue I mentioned above.

This comment has been minimized.

Copy link
@murali-reddy

murali-reddy Jun 14, 2017

Again this is nothing specic to IPVS, please search for hairpin related issues with kube-proxy/kubelet https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/#a-pod-cannot-reach-itself-via-service-ip

- container -> host (cross host)


## TODO

This comment has been minimized.

Copy link
@cmluciano

cmluciano Jun 7, 2017

Member

I'm not sure if the TODO is necessary since it should be part of the PR steps.

This comment has been minimized.

Copy link
@m1093782566

m1093782566 Jul 24, 2017

Author Member

Fixed. Thanks.

@m1093782566

This comment has been minimized.

Copy link
Member Author

m1093782566 commented Jun 8, 2017

@spiffxp Yes. @haibinxie wrote the doc and I translated it to markdown file and added some details

Copy link
Contributor

ddysher left a comment

Thanks for putting this together.

@@ -0,0 +1,152 @@
# Alpha Version IPVS Load Balancing Mode in Kubernetes

This comment has been minimized.

Copy link
@ddysher

ddysher Jun 8, 2017

Contributor

It's odd to see a proposal just for alpha version. what's the plan for beta and stable?

This comment has been minimized.

Copy link
@thockin

thockin Jun 11, 2017

Member

Yeah, the proposal is not "for alpha". alpha is just a milestone.

This comment has been minimized.

Copy link
@m1093782566

m1093782566 Jun 15, 2017

Author Member

Okay, will fix.


For more details about it, refer to [http://kb.linuxvirtualserver.org/wiki/Ipvsadm](http://kb.linuxvirtualserver.org/wiki/Ipvsadm)

In order to clean up inactive rules(includes iptables rules and ipvs rules), we will introduce a new kube-proxy parameter `--cleanup-proxyrules ` and mark the older `--cleanup-iptables` deprecated. Unfortunately, there is no way to distinguish whether an ipvs service is created by ipvs proxy or other process, `--cleanup-proxyrules` will clear all ipvs service in a host.

This comment has been minimized.

Copy link
@ddysher

ddysher Jun 8, 2017

Contributor

Is there absolutely no way? We tainted some cluster nodes with ipvs for external loadbalancing; if cleanup cleans all rules, things will break. Using ipvs for external loadbalance for external traffic is not uncommon i think.

--cleanup-proxyrules will clear all ipvs service in a host.

This comment has been minimized.

Copy link
@m1093782566

m1093782566 Jun 8, 2017

Author Member

@ddysher Do you have any good idea?

This comment has been minimized.

Copy link
@ddysher

ddysher Jun 9, 2017

Contributor

Not that I know of. It's probably ok to just call out this in document or flag comment. User with existing ipvs services should use with caution. If one just wants to clean up kubernetes related ipvs services, just start the proxy and clear any unwanted k8s services.

This comment has been minimized.

Copy link
@fisherxu

fisherxu Jun 9, 2017

Member

Yes. We can add this to document or flag comment.

This comment has been minimized.

Copy link
@thockin

thockin Jun 11, 2017

Member

Can you only clean rules with RC1918 addresses? Or only rules in the service IP range?

This comment has been minimized.

Copy link
@m1093782566

m1093782566 Jun 15, 2017

Author Member

Can you only clean rules with RC1918 addresses?
[m1093782566] It still has a possibility of clearing ipvs rules created by other process although the the possibility is low.

[m1093782566] Kube-apiserver knows the service cluster IP range through its--service-cluster-ip-range parameter. However, kube-proxy knows nothing about that. I don't suggest to add a new --service-cluster-ip-range flag in kube-proxy since it will easily conflicts with kube-apiserver's flag. Even though kube-proxy clearing all ipvs rules in the service cluster IP range, it may leave some ipvs rules in external IP, exteral LB ingress IP and Node IP.

Clearing all ipvs rules with RFC1918 addresses probably is easier to implement and has lower possibility of clearing user's existing ipvs rules.

@fisherxu What do you think about it?


In order to clean up inactive rules(includes iptables rules and ipvs rules), we will introduce a new kube-proxy parameter `--cleanup-proxyrules ` and mark the older `--cleanup-iptables` deprecated. Unfortunately, there is no way to distinguish whether an ipvs service is created by ipvs proxy or other process, `--cleanup-proxyrules` will clear all ipvs service in a host.

### Change to build

This comment has been minimized.

Copy link
@ddysher

ddysher Jun 8, 2017

Contributor

The doc linked somewhere uses seesaw, i suppose we changed to libnetwork afterwards?

This comment has been minimized.

Copy link
@m1093782566

m1093782566 Jun 8, 2017

Author Member

Yes. We changed to use libnetwork to avoid cgo and libnl dependency.


### IPVS setup and network topology

IPVS is a replacement or IPTables as load balancer, it’s assumed reader of this proposal is familiar with IPTables load balancer mode. We will create a dummy interface and assign all service Cluster IPs to the dummy interface(maybe called `kube0`). In alpha version, we will implicitly use NAT mode.

This comment has been minimized.

Copy link
@ddysher

ddysher Jun 8, 2017

Contributor

iptables is already widely used now; it's better to say 'an alternative to' instead of replacement IMO

This comment has been minimized.

Copy link
@m1093782566

m1093782566 Jul 24, 2017

Author Member

Fixed. Thanks for reminding.


IPVS is a replacement or IPTables as load balancer, it’s assumed reader of this proposal is familiar with IPTables load balancer mode. We will create a dummy interface and assign all service Cluster IPs to the dummy interface(maybe called `kube0`). In alpha version, we will implicitly use NAT mode.

We will create some ipvs services for each kubernetes service. The VIP of ipvs service corresponding to the accessable IP(such as cluster IP, external IP, nodeIP, ingress IP, etc.) of kubernetes service. Each destination of an ipvs service corresponding to an kubernetes service endpoint.

This comment has been minimized.

Copy link
@ddysher

ddysher Jun 8, 2017

Contributor

Can we be more specific about "some ipvs services"? I suppose it's "one ipvs for each kubernetes service, port and protocol combination"?

This comment has been minimized.

Copy link
@m1093782566

m1093782566 Jul 24, 2017

Author Member

Suppose kubernetes service is NodePort type. Then ipvs Proxier will create two ipvs serivice, one is NodeIP:NodePort and the other one is ClusterIP:Port.

Of course, I will explain it in the doc. Thanks.

This comment has been minimized.

Copy link
@m1093782566

m1093782566 Jul 28, 2017

Author Member

Update:

Since ipvs proxier will fall back on iptables when support node port type service. I should give another example.

Note that the relationship between kubernetes service and ipvs service is 1:N。The address of ipvs service corresponding to service's access IP, such as Cluster IP, external IP and LB.ingress.IP. If a kubernetes service has more than one access IP, For example, a external IP type service has 2 access IP(ClusterIP and External IP), then ipvs proxier will create 2 ipvs serivice - one for Cluster IP and the other one for External IP.

ipvsadm -A -t 10.244.1.100:8080 -s rr -p [timeout]
```

When a service specify session affinity, ipvs proxy will assign a timeout value(180min by default) to ipvs service.

This comment has been minimized.

Copy link
@ddysher

ddysher Jun 8, 2017

Contributor

180s?

This comment has been minimized.

Copy link
@m1093782566

m1093782566 Jun 8, 2017

Author Member

The current default value for iptables proxy is 180min, see https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/iptables/proxier.go#L205


## Other design considerations

### IPVS setup and network topology

This comment has been minimized.

Copy link
@ddysher

ddysher Jun 8, 2017

Contributor

The design mixes ipvs and iptables rules, can we have a section dedicated to explain the interaction between ipvs and iptables, and which is responsible for what requirements?

This comment has been minimized.

Copy link
@thockin

thockin Jun 11, 2017

Member

+1 - this needs to detail every sort of flow and every feature of Services.

This comment has been minimized.

Copy link
@m1093782566

m1093782566 Jul 24, 2017

Author Member

Yes, I created a section "when fall back on iptables" to explain the interaction between ipvs and iptables. Thanks!

@m1093782566

This comment has been minimized.

Copy link
Member Author

m1093782566 commented Jun 8, 2017

@@ -0,0 +1,152 @@
# Alpha Version IPVS Load Balancing Mode in Kubernetes

This comment has been minimized.

Copy link
@dhilipkumars

dhilipkumars Jun 8, 2017

Member

Please add the author's name

This comment has been minimized.

Copy link
@m1093782566

m1093782566 Jul 24, 2017

Author Member

Fixed. Thanks!


### NodePort type service support

For NodePort type service, IPVS proxy will take all accessable IPs in a host as the virtual IP of ipvs service. Specifically, accessable IP excludes `lo`, `docker0`, `vethxxx`, `cni0`, `flannel0`, etc. Currently, we assume they are IPs bound to `eth{i}`.

This comment has been minimized.

Copy link
@ddysher

ddysher Jun 8, 2017

Contributor

Sorry this comment somehow gets lost.

Why do we enforce interface name? e.g. assuming ethx won't work for predictable network interface name for newer systemd.

This comment has been minimized.

Copy link
@thockin

thockin Jun 11, 2017

Member

Correct. This is a challenging design constraint. Do we need to use IPVS for NodePorts or can that fall back on iptables?

This comment has been minimized.

Copy link
@m1093782566

m1093782566 Jul 23, 2017

Author Member

What about take all UP state network interface(except the "vethxxx"s) address as the Node IP?

This comment has been minimized.

Copy link
@m1093782566

m1093782566 Jul 28, 2017

Author Member

Updates:

As discussed in sig-network meeting, IPVS proxier will fall back on iptables when support node port type service.

Copy link
Member

thockin left a comment

Thanks for the doc. This needs a LOT more detail, though. I flagged a few big issues, but I don't feel like this is covering the depth we need.

Is this built around exec ipvsadm?

I'd like to see pseudo-code explaining how the resync loop works, and what the intermediate state looks like.

How do you prevent dropped packets during updates?

How do you do cleanups across proxy restarts, where you might have lost information? (e.g. create service A and B, you crash, service A gets deleted, you restart, you get a sync for B - what happens to A?)

How does this scale (if I have 10,000 services and 5 backends each, is this 50,000 exec calls?)

How will you use/expose the different algorithms?

I am sure I have more questions. This is one of the most significant changes in the history of kube-proxy and kubernetes services. Please spend some time helping me not be scared of it. :)

@@ -0,0 +1,152 @@
# Alpha Version IPVS Load Balancing Mode in Kubernetes

This comment has been minimized.

Copy link
@thockin

thockin Jun 11, 2017

Member

Yeah, the proposal is not "for alpha". alpha is just a milestone.


For more details about it, refer to [http://kb.linuxvirtualserver.org/wiki/Ipvsadm](http://kb.linuxvirtualserver.org/wiki/Ipvsadm)

In order to clean up inactive rules(includes iptables rules and ipvs rules), we will introduce a new kube-proxy parameter `--cleanup-proxyrules ` and mark the older `--cleanup-iptables` deprecated. Unfortunately, there is no way to distinguish whether an ipvs service is created by ipvs proxy or other process, `--cleanup-proxyrules` will clear all ipvs service in a host.

This comment has been minimized.

Copy link
@thockin

thockin Jun 11, 2017

Member

Can you only clean rules with RC1918 addresses? Or only rules in the service IP range?


## Other design considerations

### IPVS setup and network topology

This comment has been minimized.

Copy link
@thockin

thockin Jun 11, 2017

Member

+1 - this needs to detail every sort of flow and every feature of Services.


### NodePort type service support

For NodePort type service, IPVS proxy will take all accessable IPs in a host as the virtual IP of ipvs service. Specifically, accessable IP excludes `lo`, `docker0`, `vethxxx`, `cni0`, `flannel0`, etc. Currently, we assume they are IPs bound to `eth{i}`.

This comment has been minimized.

Copy link
@thockin

thockin Jun 11, 2017

Member

Correct. This is a challenging design constraint. Do we need to use IPVS for NodePorts or can that fall back on iptables?


### Sync period

Similar to iptables proxy, IPVS proxy will do full sync loop every 10 seconds by default. Besides, every update on kubernetes service and endpoint will trigger an ipvs service and destination update.

This comment has been minimized.

Copy link
@thockin

thockin Jun 11, 2017

Member

How will you do changes to Services without downtime? If I change the session affinity, for example, you shouldn't take a service down to change it.

This comment has been minimized.

Copy link
@m1093782566

m1093782566 Jun 15, 2017

Author Member

Changing seesion affinity will call UpdateService API which directly send update command to kernel via socket communication and won't take a service down.

This comment has been minimized.

Copy link
@m1093782566

m1093782566 Jun 16, 2017

Author Member

@dhilipkumars did a test and found ipvs update did not disrupt the service or even the existing connection.


### Network policy

For IPVS NAT mode to work, **all packets from the realservers to the client must go through the director**. It means ipvs proxy should do SNAT in a L3-overlay network(such as flannel) for cross-host network communication. When a container request a cluster IP to visit an endpoint in another host, we should enable `--masquer-all` for ipvs proxy, **which will break network policy**.

This comment has been minimized.

Copy link
@thockin

thockin Jun 11, 2017

Member

When I prototyped IPVS, and I just re-did my tests, I don't see a need to SNAT. I tcpdumped it at both ends.

client pod has IP P
service has IP S
backend has IP B

P -> S packet
leaves pod netns to root netns
IPVS rewrites destination to B (packet is src:P, dst:B)
arrives at B's node's root netns
forwarded to B
response is src:B, dst:P
arrives at P's nodes' root netns
conntrack converts packet to src:S, dest:P

This works (on GCP, at least) because the only route to P is back to P's node. I can see how it might not work in some environments, but frankly, if this can't work in most environments, we should maybe not put it in kube-proxy.

Preserving client IP has been a huge concern, and I am not inclined to throw that away. WRT NetworkPolicy, this doesn't break the API, but it does end up breaking almost every implementation (in a really not-obviously-fixable way).


For IPVS NAT mode to work, **all packets from the realservers to the client must go through the director**. It means ipvs proxy should do SNAT in a L3-overlay network(such as flannel) for cross-host network communication. When a container request a cluster IP to visit an endpoint in another host, we should enable `--masquer-all` for ipvs proxy, **which will break network policy**.

## Test validation

This comment has been minimized.

Copy link
@thockin

thockin Jun 11, 2017

Member

I tried to enumerate everything you have to test:

pod -> pod, same VM
pop -> pod, other VM
pod -> own VM, own hostPort
pod -> own VM, other hostPort
pod -> other VM, other hostPort

pod -> own VM
pod -> other VM
pod -> internet
pod -> http://metadata

VM -> pod, same VM
VM -> pod, other VM
VM -> same VM hostPort
VM -> other VM hostPort

pod -> own clusterIP, hairpin
pod -> own clusterIP, same VM, other pod, no port remap
pod -> own clusterIP, same VM, other pod, port remap
pod -> own clusterIP, other VM, other pod, no port remap
pod -> own clusterIP, other VM, other pod, port remap
pod -> other clusterIP, same VM, no port remap
pod -> other clusterIP, same VM, port remap
pod -> other clusterIP, other VM, no port remap
pod -> other clusterIP, other VM, port remap
pod -> own node, own nodePort, hairpin
pod -> own node, own nodePort, policy=local
pod -> own node, own nodePort, same VM
pod -> own node, own nodePort, other VM
pod -> own node, other nodePort, policy=local
pod -> own node, other nodePort, same VM
pod -> own node, other nodePort, other VM
pod -> other node, own nodeport, policy=local
pod -> other node, own nodeport, same VM
pod -> other node, own nodeport, other VM
pod -> other node, other nodeport, policy=local
pod -> other node, other nodeport, same VM
pod -> other node, other nodeport, other VM
pod -> own external LB, no remap, policy=local
pod -> own external LB, no remap, same VM
pod -> own external LB, no remap, other VM
pod -> own external LB, remap, policy=local
pod -> own external LB, remap, same VM
pod -> own external LB, remap, other VM

VM -> same VM nodePort, policy=local
VM -> same VM nbodePort, same VM
VM -> same VM nbodePort, other VM
VM -> other VM nodePort, policy=local
VM -> other VM nbodePort, same VM
VM -> other VM nbodePort, other VM

VM -> external LB

public -> nodeport, policy=local
public -> nodeport, policy=global
public -> external LB, no remap, policy=local
public -> external LB, no remap, policy=global
public -> external LB, remap, policy=local
public -> external LB, remap, policy=global

public -> nodeport, manual backend
public -> external LB, manual backend

This comment has been minimized.

Copy link
@m1093782566

m1093782566 Jul 24, 2017

Author Member

super!

@haibinxie

This comment has been minimized.

Copy link

haibinxie commented Jun 12, 2017

@thockin
Let me know if this helps.

Is this built around exec ipvsadm?
[Haibin Michael Xie] this is built on top of libnetwork, which talks to kernel via socket communication. Not on top of ipvsadm.

I'd like to see pseudo-code explaining how the resync loop works, and what the intermediate state looks like.

How do you prevent dropped packets during updates?
[Haibin Michael Xie] do you mean OS updates? I don't know the full picture of how iptables handles it, IMO this is with no difference from iptables.

How do you do cleanups across proxy restarts, where you might have lost information? (e.g. create service A and B, you crash, service A gets deleted, you restart, you get a sync for B - what happens to A?)
[Haibin Michael Xie] There is periodic full resync and in memory cache based diff. This should be handled already in iptables, and there is no difference in this regard.

How does this scale (if I have 10,000 services and 5 backends each, is this 50,000 exec calls?)
[Haibin Michael Xie] Same to above libnetwork use socket to talk to kernel which is very efficient.

How will you use/expose the different algorithms?
[Haibin Michael Xie] If it's LB algorithm, it is already mentioned in the proposal. kube-proxy has a new parameter --ipvs-scheduler.

@m1093782566

This comment has been minimized.

Copy link
Member Author

m1093782566 commented Jun 14, 2017

@thockin

I re-tested and found something different. SNAT is not required for cross-host communication. So it won't break network policy. It's really a big finding to me :)

I find the packet will be dropped when container vist VIP and the real backend is itself. I have no idea now but will try to find out why.

@m1093782566

This comment has been minimized.

Copy link
Member Author

m1093782566 commented Jun 15, 2017

I will update the proposal and try to fix the review comments in newer proposal.

@thockin I will add pseudo-code explaining how the resync loop works. Thanks.

@m1093782566 m1093782566 force-pushed the m1093782566:ipvs-proxy branch 2 times, most recently from c0a5b3e to ebb958f Jun 16, 2017
@m1093782566

This comment has been minimized.

Copy link
Member Author

m1093782566 commented Jun 16, 2017

@thockin @ddysher @cmluciano @murali-reddy I update the proposals according to review comments and add more details. PTAL.

Any comments are welcomed. Thanks.

/cc @haibinxie @ThomasZhou

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jun 16, 2017

@m1093782566: GitHub didn't allow me to request PR reviews from the following users: haibinxie.

Note that only kubernetes members can review this PR, and authors cannot review their own PRs.

In response to this:

@thockin @ddysher @cmluciano I update the proposals according review comments and add more details. PTAL.

Any comments are welcomed. Thanks.

/cc @haibinxie

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@m1093782566 m1093782566 force-pushed the m1093782566:ipvs-proxy branch 2 times, most recently from 934152c to a552531 Jun 16, 2017
@m1093782566

This comment has been minimized.

Copy link
Member Author

m1093782566 commented Jun 16, 2017

@thockin

How do you prevent dropped packets during updates?

How will you do changes to Services without downtime? If I change the session affinity, for example, you shouldn't take a service down to change it.

According to @dhilipkumars 's test result. It shows ipvs update did not disrupt the service or even the existing connection

sudo ipvsadm -L -n
[sudo] password for d:
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.11.12.13:6379 wlc persistent 10
  -> 10.192.0.1:32768             Masq    1      1          0

the real backedn service is redis-alpine conecting to the service

docker run --net=host -it redis:3.0-alpine redis-cli -h 10.11.12.13 -p 6379
10.11.12.13:6379> info Clients
# Clients
connected_clients:1
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0

In a parllel session if i update the scheduler algorithm or persistence timeout the service is not disrupted. Ipvsadm and libnetwor's ipvs pkg works in the same principle firing netlink message to the kernal so the behaviour should be the same.

Udpate persistance timout still no

$sudo ipvsadm -E -t 10.11.12.13:6379 -p 60
$sudo ipvsadm -L -n --persistent-conn
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port            Weight    PersistConn ActiveConn InActConn
  -> RemoteAddress:Port
TCP  10.11.12.13:6379 wlc persistent 60
  -> 10.192.0.1:32768             1         1           1          0

what other parameters should we test?

@m1093782566

This comment has been minimized.

Copy link
Member Author

m1093782566 commented Jul 28, 2017

As discussed in sig-netwrok meeting.

This design proposal updated to:

  • fall back on iptables when support node port type service

  • clear all ipvs rules on a kubernetes node when user call kube-proxy to flush proxy rules.

m1093782566
@m1093782566 m1093782566 force-pushed the m1093782566:ipvs-proxy branch from dd36260 to a64684e Jul 28, 2017
@haibinxie

This comment has been minimized.

Copy link

haibinxie commented Jul 31, 2017

@thockin Could you confirm if @m1093782566 's comment above is the right thing to do? anything else is left on closing it. please expect me keep bothering you until it's closed :)

If you get a chance we can have a quick phone call on reviewing and addressing issues on this, we want to commit to release the feature in 1.8.

@m1093782566

This comment has been minimized.

Copy link
Member Author

m1093782566 commented Aug 12, 2017

@danwinship Do you have interest in reviewing this design proposal? Any comments are welcome. :)

@m1093782566

This comment has been minimized.

Copy link
Member Author

m1093782566 commented Aug 14, 2017

Hi @thockin @ddysher

I just come up with an idea about implementing nodeport type service via ipvs.

Can we take all the IP address which ADDRTYPE match dst-type LOCAL as the address of ipvs service? For example,

[root@100-106-179-225 ~]# ip route show table local type local
100.106.179.225 dev eth0  proto kernel  scope host  src 100.106.179.225 
127.0.0.0/8 dev lo  proto kernel  scope host  src 127.0.0.1 
127.0.0.1 dev lo  proto kernel  scope host  src 127.0.0.1 
172.16.0.0 dev flannel.1  proto kernel  scope host  src 172.16.0.0 
172.17.0.1 dev docker0  proto kernel  scope host  src 172.17.0.1 
192.168.122.1 dev virbr0  proto kernel  scope host  src 192.168.122.1 

Then, [100.106.179.225, 127.0.0.0/8, 127.0.0.1, 172.16.0.0, 172.17.0.1, 192.168.122.1] would be the address of ipvs service for nodeport service.

I assume KUBE-NODEPORTS chain created by iptables proxier did the same thing? For example,

Chain KUBE-SERVICES (2 references)
target     prot opt source               destination         
KUBE-NODEPORTS  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL

I am not opposed to implementing nodeport service via iptables, I just want to know if the approach mentioned above make sense? Or am I wrong?

Looking forward to receiving your opinions.

@m1093782566

This comment has been minimized.

Copy link
Member Author

m1093782566 commented Aug 14, 2017

By doing this, I think we can remove the design constraint that assuming node IP is the address of eth{x}?

@m1093782566

This comment has been minimized.

Copy link
Member Author

m1093782566 commented Aug 14, 2017

@feiskyer Do you have bandwidth to take a look at this protosal? Thanks :)

@feiskyer

This comment has been minimized.

Copy link
Member

feiskyer commented Aug 14, 2017

Using a list of IP addresses for nodePort has potential problems, e.g. ip addresses may be changed or new nics may be added later. And I don't think watching the changes of ip addresses and nics is a good idea.

Maybe using iptables for nodePort services is a better choice.

@m1093782566

This comment has been minimized.

Copy link
Member Author

m1093782566 commented Aug 15, 2017

Glad to receive your feedback, @feiskyer

And I don't think watching the changes of ip addresses and nics is a good idea.

Yes, I agree. Thanks for your thoughts.

@luxas

This comment has been minimized.

Copy link
Member

luxas commented Oct 8, 2017

ping @kubernetes/sig-network-feature-requests
Any movement here lately?

@haibinxie

This comment has been minimized.

@haibinxie

This comment has been minimized.

Copy link

haibinxie commented Oct 9, 2017

@luxas alpha in 1.8. we are working on beta release in 1.9

@castrojo

This comment has been minimized.

Copy link
Contributor

castrojo commented Oct 10, 2017

This change is Reviewable

@cmluciano

This comment has been minimized.

Copy link
Member

cmluciano commented Nov 10, 2017

/keep-open

@spiffxp

This comment has been minimized.

Copy link
Member

spiffxp commented Dec 14, 2017

/lifecycle frozen
@cmluciano I'm keeping this open on your behalf, if this is no longer relevant to keep open please /remove-lifecycle frozen

@cmluciano

This comment has been minimized.

Copy link
Member

cmluciano commented Jan 3, 2018

@m1093782566 Is there a PR that supersedes this one?

@m1093782566

This comment has been minimized.

Copy link
Member Author

m1093782566 commented Jan 4, 2018

@cmluciano

NO.

This PR is the only design proposal. IPVS proxier already reached beta while this document is still pending, unfortunately.

@m1093782566

This comment has been minimized.

Copy link
Member Author

m1093782566 commented Jan 8, 2018

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

You can’t perform that action at this time.