-
Notifications
You must be signed in to change notification settings - Fork 39.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kube-proxy iptables asymmetric routing on bare metal multi interface #101910
Comments
@dylex: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig Network |
@dylex: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig network |
It turns out that even when everything is working correctly, these requests have asymmetric routing:
The only reason they don't work sometimes is because the ARPs for this don't work:
I could make sure all nodes keep their arp tables populated by pinging each other, but this doesn't seem like a great solution. Ideally masq or something could avoid the asymmetric routing altogether. |
Metallb does not affect this, but HA control might. Or more specifically, for this to happen the kube-api address must be routed. For HA that would be to the address of a HA-proxy. So I think this must be corrected by routing setup, which is not done by K8s.
Masquerade is for SNAT, not DNAT. The DNAT rule is correct. Masquerading (= SNAT to the address of the outgoing inteface) is done by another rule;
This is without |
Thanks, that's helpful. In this case the only masq rule that seems to apply is:
That is, coming from the control node itself, which doesn't seem sufficient. I think that's right, that it does only affect traffic from a host or hostNetwork pod, to a service backed by a hostNetwork pod on a different node, such as calico accessing the kubernetes svc. However, it does happen even without HA (since calico at least still uses the kubernetes svc in that case). An easy workaround would be to tell calico to use the real (HA) api server endpoint, rather than the service, but I'm not sure how to do that, and it doesn't fix the general problem. I'm not sure how I would change routing to make this work, aside from using the external interface as the node IP instead. The system routes (without k8s) are quite simple:
so when a hostNetwork generates a packet to 10.96.0.1 (the service IP), it will decide to use the external host source IP, 80.80.80.2, and the kube-proxy rules then result in a packet from 80.80.80.2 to 10.10.10.1, which is routed over INTIF (but EXTIF would be incorrect too). |
@dylex what if you "indicate" the nodes to use the INTIF for the services subnet
|
@aojea Oh, that's a good idea, to fix the source ip for the original packet. It does seem to fix requests from the host. I'll try it, thanks! |
I've tried it locally and it did work |
Yes this worked, and while it'd be nice if this were better documented I suppose it's in this ticket now so people can find it. There are a few issues with running on multi-homed bare-metal clusters regardless (kubernetes/enhancements#1665 touches on some of them), including lack of published node ExternalIPs, but I understand this isn't a common or well-supported deployment scenario, so we can close. |
yeah, this just needs some ❤️ |
What happened:
From non-controller nodes, calls from calico to kubernetes API randomly fail with:
dial tcp 10.96.0.1:443: i/o timeout
. Can be reproduced from host withcurl -k https://10.96.0.1/
times out sometimes.It looks like these requests are being generated with a source IP of the external interface (80.80.80.X), and then DNAT'd to the internal IP of the control node (10.10.10.X), and sent over the internal interface. The control node never responds (or if it does it would be asymmetricly over the external interface 10.10.10.X -> 80.80.80.X and likely droped by iptables state rules).
Eventually requests go through somehow and things work, but often the next time a pod is setup there are more errors and delays. Everything else in the cluster (inter-pod communication, services, loadbalancing) is working perfectly.
What you expected to happen:
kube-proxy iptables rules should masquerade these packets to the right interface, not just DNAT. Turning on masqueradeAll fixes this but seems a bit overkill. (I feel like there must be some obvious configuration I'm missing.)
How to reproduce it (as minimally and precisely as possible):
Multi-node bare metal cluster where each node is on two networks: external (80.80.80.X) and internal (10.10.10.X). kubelet and calico are configured to use internal (
--node-ip=10.10.10.X
,IP_AUTODETECTION_METHOD=interface=int
). Default gateway is on external.Anything else we need to know?:
A more complete description of the problem is here: https://discuss.kubernetes.io/t/multi-network-cluster-broken-without-masquerade-all/13671 (cluster has since been upgraded to 1.20 with no improvement)
Environment:
kubectl version
):cat /etc/os-release
): CentOS 7uname -a
):Linux k8s-2 3.10.0-1160.24.1.el7.x86_64 #1 SMP Thu Apr 8 19:51:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: