kube-proxy iptables asymmetric routing on bare metal multi interface #101910

dylex · 2021-05-11T14:49:39Z

What happened:

From non-controller nodes, calls from calico to kubernetes API randomly fail with: dial tcp 10.96.0.1:443: i/o timeout. Can be reproduced from host with curl -k https://10.96.0.1/ times out sometimes.

It looks like these requests are being generated with a source IP of the external interface (80.80.80.X), and then DNAT'd to the internal IP of the control node (10.10.10.X), and sent over the internal interface. The control node never responds (or if it does it would be asymmetricly over the external interface 10.10.10.X -> 80.80.80.X and likely droped by iptables state rules).

Eventually requests go through somehow and things work, but often the next time a pod is setup there are more errors and delays. Everything else in the cluster (inter-pod communication, services, loadbalancing) is working perfectly.

Chain KUBE-SERVICES:
KUBE-SVC-NPX46M4PTMTKRN6Y  tcp  --  0.0.0.0/0            10.96.0.1            /* default/kubernetes:https cluster IP */ tcp dpt:443
Chain KUBE-SVC-NPX46M4PTMTKRN6Y:
KUBE-SEP-OJTUAWRS4ZQWCYFI  all  --  0.0.0.0/0            0.0.0.0/0            /* default/kubernetes:https */
Chain KUBE-SEP-OJTUAWRS4ZQWCYFI:
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            /* default/kubernetes:https */ tcp to:10.10.10.1:6443

What you expected to happen:

kube-proxy iptables rules should masquerade these packets to the right interface, not just DNAT. Turning on masqueradeAll fixes this but seems a bit overkill. (I feel like there must be some obvious configuration I'm missing.)

How to reproduce it (as minimally and precisely as possible):

Multi-node bare metal cluster where each node is on two networks: external (80.80.80.X) and internal (10.10.10.X). kubelet and calico are configured to use internal (--node-ip=10.10.10.X, IP_AUTODETECTION_METHOD=interface=int). Default gateway is on external.

Anything else we need to know?:

A more complete description of the problem is here: https://discuss.kubernetes.io/t/multi-network-cluster-broken-without-masquerade-all/13671 (cluster has since been upgraded to 1.20 with no improvement)

Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.6", GitCommit:"8a62859e515889f07e3e3be6a1080413f17cf2c3", GitTreeState:"clean", BuildDate:"2021-04-15T03:28:42Z", GoVersion:"go1.15.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.6", GitCommit:"8a62859e515889f07e3e3be6a1080413f17cf2c3", GitTreeState:"clean", BuildDate:"2021-04-15T03:19:55Z", GoVersion:"go1.15.10", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration: bare metal
OS (e.g: cat /etc/os-release): CentOS 7
Kernel (e.g. uname -a): Linux k8s-2 3.10.0-1160.24.1.el7.x86_64 #1 SMP Thu Apr 8 19:51:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Install tools: kubadm v1.20.6
Network plugin and version (if this is a network-related bug): calico v3.18.1
Others: currently HA control nodes and metallb, but problem happens without these

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2021-05-11T14:49:47Z

@dylex: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dylex · 2021-05-11T14:51:53Z

/sig Network

k8s-ci-robot · 2021-05-11T14:51:55Z

@dylex: The label(s) sig/networking cannot be applied, because the repository doesn't have them.

In response to this:

/sig Networking

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dylex · 2021-05-11T14:52:57Z

/sig network

dylex · 2021-05-11T20:58:44Z

It turns out that even when everything is working correctly, these requests have asymmetric routing:

INTIF: IP 80.80.80.2.58699 > 10.10.10.1.6443: Flags [S] (DNAT from 80.80.80.2 -> 10.96.0.1:443)
EXTIF: IP 10.10.10.1.6443 > 80.80.80.2.58699: Flags [S.]
etc.

The only reason they don't work sometimes is because the ARPs for this don't work:

EXTIF: ARP, Request who-has 80.80.80.2 tell 10.10.10.1

I could make sure all nodes keep their arp tables populated by pinging each other, but this doesn't seem like a great solution. Ideally masq or something could avoid the asymmetric routing altogether.

uablrek · 2021-05-12T07:09:56Z

currently HA control nodes and metallb, but problem happens without these

Metallb does not affect this, but HA control might. Or more specifically, for this to happen the kube-api address must be routed. For HA that would be to the address of a HA-proxy.

So I think this must be corrected by routing setup, which is not done by K8s.

rules should masquerade these packets to the right interface, not just DNAT

Masquerade is for SNAT, not DNAT. The DNAT rule is correct. Masquerading (= SNAT to the address of the outgoing inteface) is done by another rule;

Chain KUBE-SVC-NPX46M4PTMTKRN6Y (1 references)
 pkts bytes target     prot opt in     out     source               destination         
    9   540 KUBE-MARK-MASQ  tcp  --  *      *      !11.0.0.0/16          12.0.0.1             /* default/kubernetes:https cluster IP */ tcp dpt:443

This is without masqueradeAll and as you can see PODs as source (11.0.0.0/16) are excluded from masquerading. When you turn on masqueradeAll that condition is removed. I think that is why you get it working with masqueradeAll.

dylex · 2021-05-12T16:49:56Z

Thanks, that's helpful. In this case the only masq rule that seems to apply is:

Chain KUBE-SEP-OJTUAWRS4ZQWCYFI
KUBE-MARK-MASQ  all  --  10.10.10.1           0.0.0.0/0            /* default/kubernetes:https */
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            /* default/kubernetes:https */ tcp to:10.10.10.1:6443

That is, coming from the control node itself, which doesn't seem sufficient.

I think that's right, that it does only affect traffic from a host or hostNetwork pod, to a service backed by a hostNetwork pod on a different node, such as calico accessing the kubernetes svc. However, it does happen even without HA (since calico at least still uses the kubernetes svc in that case). An easy workaround would be to tell calico to use the real (HA) api server endpoint, rather than the service, but I'm not sure how to do that, and it doesn't fix the general problem.

I'm not sure how I would change routing to make this work, aside from using the external interface as the node IP instead. The system routes (without k8s) are quite simple:

0.0.0.0         80.80.80.254   0.0.0.0           UG        0 0          0 EXTIF
10.10.10.0      0.0.0.0        255.255.255.0     U         0 0          0 INTIF

so when a hostNetwork generates a packet to 10.96.0.1 (the service IP), it will decide to use the external host source IP, 80.80.80.2, and the kube-proxy rules then result in a packet from 80.80.80.2 to 10.10.10.1, which is routed over INTIF (but EXTIF would be incorrect too).

aojea · 2021-05-14T07:59:18Z

@dylex what if you "indicate" the nodes to use the INTIF for the services subnet

ip route add SERVICE_NET dev INTIF

dylex · 2021-05-14T12:09:56Z

@aojea Oh, that's a good idea, to fix the source ip for the original packet. It does seem to fix requests from the host. I'll try it, thanks!

aojea · 2021-05-26T16:38:26Z

@aojea Oh, that's a good idea, to fix the source ip for the original packet. It does seem to fix requests from the host. I'll try it, thanks!

I've tried it locally and it did work
did it work for you?
can we close?

dylex · 2021-05-26T16:52:10Z

Yes this worked, and while it'd be nice if this were better documented I suppose it's in this ticket now so people can find it. There are a few issues with running on multi-homed bare-metal clusters regardless (kubernetes/enhancements#1665 touches on some of them), including lack of published node ExternalIPs, but I understand this isn't a common or well-supported deployment scenario, so we can close.

aojea · 2021-05-26T17:00:03Z

but I understand this isn't a common or well-supported deployment scenario, so we can close.

yeah, this just needs some ❤️

dylex added the kind/bug Categorizes issue or PR as related to a bug. label May 11, 2021

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 11, 2021

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 11, 2021

dylex changed the title ~~kube-proxy iptables fails to masquerade on bare metal multi interface~~ kube-proxy iptables asymmetric routing on bare metal multi interface May 11, 2021

thockin assigned dcbw May 13, 2021

dylex closed this as completed May 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kube-proxy iptables asymmetric routing on bare metal multi interface #101910

kube-proxy iptables asymmetric routing on bare metal multi interface #101910

dylex commented May 11, 2021

k8s-ci-robot commented May 11, 2021

dylex commented May 11, 2021 •

edited

Loading

k8s-ci-robot commented May 11, 2021

dylex commented May 11, 2021

dylex commented May 11, 2021

uablrek commented May 12, 2021

dylex commented May 12, 2021

aojea commented May 14, 2021

dylex commented May 14, 2021

aojea commented May 26, 2021

dylex commented May 26, 2021

aojea commented May 26, 2021

kube-proxy iptables asymmetric routing on bare metal multi interface #101910

kube-proxy iptables asymmetric routing on bare metal multi interface #101910

Comments

dylex commented May 11, 2021

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

k8s-ci-robot commented May 11, 2021

dylex commented May 11, 2021 • edited Loading

k8s-ci-robot commented May 11, 2021

dylex commented May 11, 2021

dylex commented May 11, 2021

uablrek commented May 12, 2021

dylex commented May 12, 2021

aojea commented May 14, 2021

dylex commented May 14, 2021

aojea commented May 26, 2021

dylex commented May 26, 2021

aojea commented May 26, 2021

dylex commented May 11, 2021 •

edited

Loading