Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoadBalancer's healthcheck packet will be dropped in ipvs mode #79783

Open
imroc opened this issue Jul 4, 2019 · 16 comments · May be fixed by #79976 or #85956
Open

LoadBalancer's healthcheck packet will be dropped in ipvs mode #79783

imroc opened this issue Jul 4, 2019 · 16 comments · May be fixed by #79976 or #85956

Comments

@imroc
Copy link
Contributor

@imroc imroc commented Jul 4, 2019

What happened:

Hi, I'm from the TKE team. In TencentCloud TKE, LB health check packet source ip is LB's own ip (Service's ExternalIP),and the LB will not do SNAT, when kube-proxy has ipvs enabled in 1.12,the the LoadBalancer's healthcheck packet to NodePort will be dropped by node, so the LB will think the node is unhealthy. I tested 1.10 and found no such problem, and because the latest k8s version supported in TKE is 1.12, so I haven't tested k8s higher than 1.12. I didn't find any related issues, so I think the latest version should also have this problem.

I compared 1.10 and 1.12,1.12 will add the service's external ip to the kube-ipvs0:

$ ip a show kube-ipvs0 | grep -A2 170.106.134.124
    inet 170.106.134.124/32 brd 170.106.134.124 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
  • 170.106.134.124 is service's external ip (LB's ip)
  • LB's ip was added to the dev kube-ipvs0

then kernel will add a local route in local table automatically:

$ ip route show table local | grep 170.106.134.124
local 170.106.134.124 dev kube-ipvs0  proto kernel  scope host  src 170.106.134.124

and in 1.10, there is no LB's ip in dev kube-ipvs0, it is because of this difference that the packet cannot reach cbr0 which source ip is LB's ip in 1.12, I think it's related to an optimization: packets sent to service's external ip are routed directly to the pod via ipvs, which does not actually pass through the lb, check the ipvs rules in 1.12 we can found:

ipvsadm -ln | grep -A1 170.106.134.124
TCP  170.106.134.124:80 rr
  -> 172.16.0.70:80               Masq    1      0          0

Check the reference page we can found kube-proxy will Iterate LB’s ingress IPs, create an ipvs service whose address corresponding LB’s ingress IP

You may ask how to prove the lb's healthcheck packet was dropped rather than redirected to the pod, I created a Service with type=LoadBalancer, and externalTrafficPolicy=Local, and the endpoint only have one pod, I captured the cbr0's packet on the node which has the service's pod, and there's no packet from lb has been captured, because the service's pod is only available at this node, if it is forwarded normally, it will must reach the cbr0.

If we delete the lb's local route in local table, then it will forward normally:

ip route del table local local 170.106.134.124 dev kube-ipvs0  proto kernel  scope host  src 170.106.134.124

Or an infinite loop to execute this command to delete the ip in kube-ipvs0:

$ ip addr del 170.106.134.124/32 dev kube-ipvs0

Is it possible that the incoming packet's source ip is found in the kernel's local table, it is considered to be the local ip, and then the kernel will drop the packet?

I guess why other cloud vendors didn't encounter this problem is because the source ip of lb's healthcheck packet is internal vip instead of lb's own ip

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.4-tke.4", GitCommit:"8433e31fd12cae66e59d634f4e7ab6cc13c93a4a", GitTreeState:"clean", BuildDate:"2019-06-25T05:19:07Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.4-tke.3", GitCommit:"ab6e1c10a35382c2ec70036e0e51c201eb3fc3f8", GitTreeState:"clean", BuildDate:"2019-06-18T12:12:29Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
  • OS (e.g: cat /etc/os-release):
NAME="Ubuntu"
VERSION="16.04.1 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.1 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
UBUNTU_CODENAME=xenial
  • Kernel (e.g. uname -a):
Linux VM-1-3-ubuntu 4.4.0-104-generic #127-Ubuntu SMP Mon Dec 11 12:16:42 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
@imroc imroc added the kind/bug label Jul 4, 2019
@imroc

This comment has been minimized.

Copy link
Contributor Author

@imroc imroc commented Jul 4, 2019

/sig network

@k8s-ci-robot k8s-ci-robot added sig/network and removed needs-sig labels Jul 4, 2019
@imroc

This comment has been minimized.

Copy link
Contributor Author

@imroc imroc commented Jul 4, 2019

@imroc

This comment has been minimized.

Copy link
Contributor Author

@imroc imroc commented Jul 4, 2019

It’s a bit long, but I hope you can read it patiently.

@imroc

This comment has been minimized.

Copy link
Contributor Author

@imroc imroc commented Jul 7, 2019

After a deep dive into this problem,I found the root cause.

Let me to tell the long story. It's a bit too detailed, but it helps us understand the problem.

A tiny IPVS design flaw

There are following two contradictory sentences in the proposal of ipvs:

  1. IPVS proxier will NOT bind LB's ingress IP to the dummy interface
  2. Iterate LB's ingress IPs, create an ipvs service whose address corresponding LB's ingress IP

It means that LB ip will not be bounded to kube-ipvs0, but it will configure ipvs rules for LB ip, it's a bit contradictory, because if we do not bound the LB ip on the dummy interface(kube-ipvs0), kernel will not automatically create local route for the LB ip, so the packet sent to LB from container in cluster will be routed to the real LB instead of forward to pod directly by ipvs, then the ipvs rule for LB ip is useless (ipvs works on the input chain in netfilter, it requires all vips to be configured on the network interface in order for kernel to let packets go to input chain from prerouting chain which destnation ip is the vip, otherwise it will go to forward chain if ip is not in the local route).

I chatted with the author of ipvs proposal @m1093782566 who has confirmed that there is indeed such a problem in the design, and the implementation also has this problem at first, then #63066 this PR fixed that.

How was this problem been discovered ?

It is because of this issue #59976 , some bare metal cluster uses MetalLB to create Service with LoadBalancer type, however, in some actual network environments, it is unreachable from pod to LB ip for some reason. The packet from pod to LB will not go through ipvs, cuz the LB ip is not configured to kube-ipvs0, so there is no local route for LB ip, resulting in the packets from pod to LB can not enter the input chain, enter the forward chain instead. #63066 this PR bind LB ip to kube-ipvs0 solves the problem.

The fix raises another problem

This fix did the right thing, but because of this change, it also raises another problem like this issue which is not usually found: Our LB will perform a health check on the nodeport, and the source ip of probe packet is LB its own ip, after #63066 this PR, the LB ip is configured on kube-ipvs0, and kernel create local route for LB ip automatically, when the probe packet from LB arrived on the nodeport, kernel found its source ip is a local ip, the packet will be ignored, cuz linux has a limitation: the source ip of any packet coming from a non-loopback interface cannot be the local ip. Therefore, the probe packet from LB will never reach the pod, and it will not be able to reply the packet to the LB, then LB will think that the nodeport is unhealthy.

How to solve it ?

I tried to enable accept_local. The accept_local parameter was introduced in linux kernel, allowing the source ip of the incoming packet to be the local ip: sysctl -w net.ipv4.conf.all.accept_local=1, but this only allows the incoming packet to be forwarded normally, and the reply packet is still not reachable, cuz the destination ip of reply packet is LB ip, kernel found it is a local ip, then enters the input chain, and the ipvs kernel module checks the target ip:port, and found that there is no corresponding rs and does not process it (the destination port of reply packet is the source port of the probe packet, which is a random port, not in the ipvs load balancing list) , then the kernel will might tries to send the packet to the interface which configured the target ip, and it's kube-ipvs0, but this is a dummy interface with down state, so the packet will be ignored, and the reply packet for LB probe will never reach LB.

So we can't solve it by just modify the kernel param, the solution I am thinking of is to let kube-proxy support a parameter to control whether to bind LB ip to kube-ipvs0, does anyone has a better suggestion?

@imroc

This comment has been minimized.

Copy link
Contributor Author

@imroc imroc commented Jul 8, 2019

@lbernail I heard that you use ipvs a lot in the production environment, can you join this discussion? rolling back this PR can be a temporary solution, but we need a better long-term solution.

cc @Lion-Wei

@imroc

This comment has been minimized.

Copy link
Contributor Author

@imroc imroc commented Jul 8, 2019

/area ipvs

@lbernail

This comment has been minimized.

Copy link
Contributor

@lbernail lbernail commented Jul 8, 2019

In several implementations load-balanced packets reach the nodes with the Load-balancer IP or External IP as target (it's the case with ILB on GCP and metalLB for instance). For IPVS to work in these cases, these IPs need to be considered local and that's why they are added to the dummy interface.

The problem you describe makes perfect sense. It would also impact AWS because packets are SNAT to the loadbalancer IP. However they don't set this EXTERNAL-IP in the status field but only the load-balancer name:

    ingress:
    - hostname: internal-xxxx.us-east-1.elb.amazonaws.com

I don't see an easy solution there. Would it be possible in the TKE controller to not set the ExternalIP and only set an hostname? That should solve your issue.
Otherwise the only solution I can see to address both use cases would be to add a flag to kube-proxy

@m1093782566 / @andrewsykim what do you think?

@imroc

This comment has been minimized.

Copy link
Contributor Author

@imroc imroc commented Jul 9, 2019

Let me summarize what candidate solutions I've come up with so far.

Add A Flag to kube-proxy

Because the root cause of this problem is the LB's IP been bounded to kube-ipvs0, we can add a flag to kube-proxy to disable this behavior, but @m1093782566 worries that the kube-proxy will be overwhelmed by too many low-level flags, as far as I know, for instance, @lbernail added a low-level flag --ipvs-strict-arp to enable strict ARP in #75295, to avoid breaks some CNI plugins because of the change introduced in #70530

Service Controller Write No ExternalIP

As @lbernail said, service-controller write no ExternalIP but only hostname in the status field of Service like AWS will solve the problem, but I think this approach can make people feel uncomfortable, we can't get the LB IP directly, and if we use dns to resolve LB's hostname to get IP, this depends on the implementation of the cloud vendor's LB, which requires add domain resolution for each LB IP automatically, which TencentCloud does not support for LAN LB. In addition, I think some app that manage kubernetes clusters may also rely on the IP in the service status field.

Let LB Use Reserved IP as Source IP for Health Check

If use internal reserved IP rather than LB its own IP, it also can solve this problem, but this requires cross-team collaboration, and there may be some uncertainty, and the changes may have some other impact. More importantly, some cloud vendors or self-built LB will do snat, so if their LB IP is written in the status field of the service, the packet from normal client request will also never reach the pod. (btw, the reason why our LB can not do SNAT is that the interworking between LB and cloud VM in TencentCloud is achieved through a tunnel on the physical machine, so the reply packet will go through LB event if the destnation IP is not LB IP)

At present, I personally think that the best solution to the problem is the first one: Add a flag to kube-proxy to control whether bind LB IP to kube-ipvs0, cuz it's easy to implement and saves complicated and uncomfortable adaptation work for users.

@imroc

This comment has been minimized.

Copy link
Contributor Author

@imroc imroc commented Jul 10, 2019

We can add --ipvs-exclude-external-ip to kube-proxy which control whether to disable ipvs forwarding to the LoadBalancer's IP, I think it makes more sense and not a low-level flag, I'm going to submit a PR for this, what do you think? @m1093782566 @lbernail

@vllry

This comment has been minimized.

Copy link
Contributor

@vllry vllry commented Jul 25, 2019

/triage unresolved

@andrewsykim

This comment has been minimized.

Copy link
Member

@andrewsykim andrewsykim commented Jul 25, 2019

Wondering if we can fix this somehow with iptables, maybe masquerade only the health check node ports at PREROUTING?

/remove-triage unresolved
(fix being discussed #79976)

@njuicsgz

This comment has been minimized.

Copy link
Contributor

@njuicsgz njuicsgz commented Aug 31, 2019

I am looking forward this PR to avoid my problem.
I used kube node ip as service's external ip with kube-proxy ipvs mode, so I got those node ips in kube-ipvs0. It makes problems as issue:

    1. container1 can not access container2 with directly container ip, because the package from container1 -> flannel.1 -> eth0(container2's node) could not reach container2's node eth0 since container2's ip is just one service's external ip.
    1. node1 could not access node2, if node2 is some service's external ip

when I used spring cloud/dubbo or other 3rd service discovery framwork, there is need to access container directly with it's ip. So 1 become a problem for me.

@andrewsykim

This comment has been minimized.

Copy link
Member

@andrewsykim andrewsykim commented Sep 9, 2019

FYI I'm going to raise this issue in the next SIG Network call

@Sh4d1

This comment has been minimized.

Copy link
Contributor

@Sh4d1 Sh4d1 commented Nov 7, 2019

@andrewsykim any news on this issue?

@andrewsykim

This comment has been minimized.

Copy link
Member

@andrewsykim andrewsykim commented Nov 7, 2019

@Sh4d1 sorry, this fell off my radar -- I'll try to follow-up on this soon. Thanks

@MatthiasLohr

This comment has been minimized.

Copy link

@MatthiasLohr MatthiasLohr commented Dec 31, 2019

We ran into the same issue with MetalLB (Layer2 configuration). During debugging I found the issue with ipvs and so I came across this issue.

@imroc thank you very much for initiating this issue and providing a pull request - but unfortunately, discussion on that is discontinued. Is there any (other) idea around for solving that issue? How did you solve this for you so far?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

8 participants
You can’t perform that action at this time.