Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
LoadBalancer's healthcheck packet will be dropped in ipvs mode #79783
Hi, I'm from the TKE team. In TencentCloud TKE, LB health check packet source ip is LB's own ip (Service's ExternalIP)，and the LB will not do SNAT, when kube-proxy has ipvs enabled in 1.12，the the LoadBalancer's healthcheck packet to NodePort will be dropped by node, so the LB will think the node is unhealthy. I tested 1.10 and found no such problem, and because the latest k8s version supported in TKE is 1.12, so I haven't tested k8s higher than 1.12. I didn't find any related issues, so I think the latest version should also have this problem.
I compared 1.10 and 1.12，1.12 will add the service's external ip to the
$ ip a show kube-ipvs0 | grep -A2 22.214.171.124 inet 126.96.36.199/32 brd 188.8.131.52 scope global kube-ipvs0 valid_lft forever preferred_lft forever
then kernel will add a local route in local table automatically:
$ ip route show table local | grep 184.108.40.206 local 220.127.116.11 dev kube-ipvs0 proto kernel scope host src 18.104.22.168
and in 1.10, there is no LB's ip in dev
ipvsadm -ln | grep -A1 22.214.171.124 TCP 126.96.36.199:80 rr -> 172.16.0.70:80 Masq 1 0 0
Check the reference page we can found kube-proxy will
You may ask how to prove the lb's healthcheck packet was dropped rather than redirected to the pod, I created a Service with type=LoadBalancer, and externalTrafficPolicy=Local, and the endpoint only have one pod, I captured the cbr0's packet on the node which has the service's pod, and there's no packet from lb has been captured, because the service's pod is only available at this node, if it is forwarded normally, it will must reach the
If we delete the lb's local route in local table, then it will forward normally:
ip route del table local local 188.8.131.52 dev kube-ipvs0 proto kernel scope host src 184.108.40.206
Or an infinite loop to execute this command to delete the ip in
$ ip addr del 220.127.116.11/32 dev kube-ipvs0
Is it possible that the incoming packet's source ip is found in the kernel's local table, it is considered to be the local ip, and then the kernel will drop the packet?
I guess why other cloud vendors didn't encounter this problem is because the source ip of lb's healthcheck packet is internal vip instead of lb's own ip
After a deep dive into this problem，I found the root cause.
Let me to tell the long story. It's a bit too detailed, but it helps us understand the problem.
A tiny IPVS design flaw
There are following two contradictory sentences in the proposal of ipvs:
It means that LB ip will not be bounded to
I chatted with the author of ipvs proposal @m1093782566 who has confirmed that there is indeed such a problem in the design, and the implementation also has this problem at first, then #63066 this PR fixed that.
How was this problem been discovered ?
It is because of this issue #59976 , some bare metal cluster uses MetalLB to create Service with LoadBalancer type, however, in some actual network environments, it is unreachable from pod to LB ip for some reason. The packet from pod to LB will not go through ipvs, cuz the LB ip is not configured to
The fix raises another problem
This fix did the right thing, but because of this change, it also raises another problem like this issue which is not usually found: Our LB will perform a health check on the nodeport, and the source ip of probe packet is LB its own ip, after #63066 this PR, the LB ip is configured on
How to solve it ?
I tried to enable
So we can't solve it by just modify the kernel param, the solution I am thinking of is to let kube-proxy support a parameter to control whether to bind LB ip to
In several implementations load-balanced packets reach the nodes with the Load-balancer IP or External IP as target (it's the case with ILB on GCP and metalLB for instance). For IPVS to work in these cases, these IPs need to be considered local and that's why they are added to the dummy interface.
The problem you describe makes perfect sense. It would also impact AWS because packets are SNAT to the loadbalancer IP. However they don't set this
I don't see an easy solution there. Would it be possible in the TKE controller to not set the ExternalIP and only set an hostname? That should solve your issue.
Let me summarize what candidate solutions I've come up with so far.
Add A Flag to kube-proxy
Because the root cause of this problem is the LB's IP been bounded to
Service Controller Write No ExternalIP
As @lbernail said, service-controller write no ExternalIP but only hostname in the status field of Service like AWS will solve the problem, but I think this approach can make people feel uncomfortable, we can't get the LB IP directly, and if we use dns to resolve LB's hostname to get IP, this depends on the implementation of the cloud vendor's LB, which requires add domain resolution for each LB IP automatically, which TencentCloud does not support for LAN LB. In addition, I think some app that manage kubernetes clusters may also rely on the IP in the service status field.
Let LB Use Reserved IP as Source IP for Health Check
If use internal reserved IP rather than LB its own IP, it also can solve this problem, but this requires cross-team collaboration, and there may be some uncertainty, and the changes may have some other impact. More importantly, some cloud vendors or self-built LB will do snat, so if their LB IP is written in the status field of the service, the packet from normal client request will also never reach the pod. (btw, the reason why our LB can not do SNAT is that the interworking between LB and cloud VM in TencentCloud is achieved through a tunnel on the physical machine, so the reply packet will go through LB event if the destnation IP is not LB IP)
At present, I personally think that the best solution to the problem is the first one: Add a flag to kube-proxy to control whether bind LB IP to
I am looking forward this PR to avoid my problem.
when I used spring cloud/dubbo or other 3rd service discovery framwork, there is need to access container directly with it's ip. So 1 become a problem for me.
We ran into the same issue with MetalLB (Layer2 configuration). During debugging I found the issue with ipvs and so I came across this issue.
@imroc thank you very much for initiating this issue and providing a pull request - but unfortunately, discussion on that is discontinued. Is there any (other) idea around for solving that issue? How did you solve this for you so far?