Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-proxy iptables asymmetric routing on bare metal multi interface #101910

Closed
dylex opened this issue May 11, 2021 · 12 comments
Closed

kube-proxy iptables asymmetric routing on bare metal multi interface #101910

dylex opened this issue May 11, 2021 · 12 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@dylex
Copy link

dylex commented May 11, 2021

What happened:

From non-controller nodes, calls from calico to kubernetes API randomly fail with: dial tcp 10.96.0.1:443: i/o timeout. Can be reproduced from host with curl -k https://10.96.0.1/ times out sometimes.

It looks like these requests are being generated with a source IP of the external interface (80.80.80.X), and then DNAT'd to the internal IP of the control node (10.10.10.X), and sent over the internal interface. The control node never responds (or if it does it would be asymmetricly over the external interface 10.10.10.X -> 80.80.80.X and likely droped by iptables state rules).

Eventually requests go through somehow and things work, but often the next time a pod is setup there are more errors and delays. Everything else in the cluster (inter-pod communication, services, loadbalancing) is working perfectly.

Chain KUBE-SERVICES:
KUBE-SVC-NPX46M4PTMTKRN6Y  tcp  --  0.0.0.0/0            10.96.0.1            /* default/kubernetes:https cluster IP */ tcp dpt:443
Chain KUBE-SVC-NPX46M4PTMTKRN6Y:
KUBE-SEP-OJTUAWRS4ZQWCYFI  all  --  0.0.0.0/0            0.0.0.0/0            /* default/kubernetes:https */
Chain KUBE-SEP-OJTUAWRS4ZQWCYFI:
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            /* default/kubernetes:https */ tcp to:10.10.10.1:6443

What you expected to happen:

kube-proxy iptables rules should masquerade these packets to the right interface, not just DNAT. Turning on masqueradeAll fixes this but seems a bit overkill. (I feel like there must be some obvious configuration I'm missing.)

How to reproduce it (as minimally and precisely as possible):

Multi-node bare metal cluster where each node is on two networks: external (80.80.80.X) and internal (10.10.10.X). kubelet and calico are configured to use internal (--node-ip=10.10.10.X, IP_AUTODETECTION_METHOD=interface=int). Default gateway is on external.

Anything else we need to know?:

A more complete description of the problem is here: https://discuss.kubernetes.io/t/multi-network-cluster-broken-without-masquerade-all/13671 (cluster has since been upgraded to 1.20 with no improvement)

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.6", GitCommit:"8a62859e515889f07e3e3be6a1080413f17cf2c3", GitTreeState:"clean", BuildDate:"2021-04-15T03:28:42Z", GoVersion:"go1.15.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.6", GitCommit:"8a62859e515889f07e3e3be6a1080413f17cf2c3", GitTreeState:"clean", BuildDate:"2021-04-15T03:19:55Z", GoVersion:"go1.15.10", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: bare metal
  • OS (e.g: cat /etc/os-release): CentOS 7
  • Kernel (e.g. uname -a): Linux k8s-2 3.10.0-1160.24.1.el7.x86_64 #1 SMP Thu Apr 8 19:51:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools: kubadm v1.20.6
  • Network plugin and version (if this is a network-related bug): calico v3.18.1
  • Others: currently HA control nodes and metallb, but problem happens without these
@dylex dylex added the kind/bug Categorizes issue or PR as related to a bug. label May 11, 2021
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 11, 2021
@k8s-ci-robot
Copy link
Contributor

@dylex: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dylex
Copy link
Author

dylex commented May 11, 2021

/sig Network

@k8s-ci-robot
Copy link
Contributor

@dylex: The label(s) sig/networking cannot be applied, because the repository doesn't have them.

In response to this:

/sig Networking

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dylex
Copy link
Author

dylex commented May 11, 2021

/sig network

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 11, 2021
@dylex
Copy link
Author

dylex commented May 11, 2021

It turns out that even when everything is working correctly, these requests have asymmetric routing:

INTIF: IP 80.80.80.2.58699 > 10.10.10.1.6443: Flags [S] (DNAT from 80.80.80.2 -> 10.96.0.1:443)
EXTIF: IP 10.10.10.1.6443 > 80.80.80.2.58699: Flags [S.]
etc.

The only reason they don't work sometimes is because the ARPs for this don't work:

EXTIF: ARP, Request who-has 80.80.80.2 tell 10.10.10.1

I could make sure all nodes keep their arp tables populated by pinging each other, but this doesn't seem like a great solution. Ideally masq or something could avoid the asymmetric routing altogether.

@dylex dylex changed the title kube-proxy iptables fails to masquerade on bare metal multi interface kube-proxy iptables asymmetric routing on bare metal multi interface May 11, 2021
@uablrek
Copy link
Contributor

uablrek commented May 12, 2021

currently HA control nodes and metallb, but problem happens without these

Metallb does not affect this, but HA control might. Or more specifically, for this to happen the kube-api address must be routed. For HA that would be to the address of a HA-proxy.

So I think this must be corrected by routing setup, which is not done by K8s.

rules should masquerade these packets to the right interface, not just DNAT

Masquerade is for SNAT, not DNAT. The DNAT rule is correct. Masquerading (= SNAT to the address of the outgoing inteface) is done by another rule;

Chain KUBE-SVC-NPX46M4PTMTKRN6Y (1 references)
 pkts bytes target     prot opt in     out     source               destination         
    9   540 KUBE-MARK-MASQ  tcp  --  *      *      !11.0.0.0/16          12.0.0.1             /* default/kubernetes:https cluster IP */ tcp dpt:443

This is without masqueradeAll and as you can see PODs as source (11.0.0.0/16) are excluded from masquerading. When you turn on masqueradeAll that condition is removed. I think that is why you get it working with masqueradeAll.

@dylex
Copy link
Author

dylex commented May 12, 2021

Thanks, that's helpful. In this case the only masq rule that seems to apply is:

Chain KUBE-SEP-OJTUAWRS4ZQWCYFI
KUBE-MARK-MASQ  all  --  10.10.10.1           0.0.0.0/0            /* default/kubernetes:https */
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            /* default/kubernetes:https */ tcp to:10.10.10.1:6443

That is, coming from the control node itself, which doesn't seem sufficient.

I think that's right, that it does only affect traffic from a host or hostNetwork pod, to a service backed by a hostNetwork pod on a different node, such as calico accessing the kubernetes svc. However, it does happen even without HA (since calico at least still uses the kubernetes svc in that case). An easy workaround would be to tell calico to use the real (HA) api server endpoint, rather than the service, but I'm not sure how to do that, and it doesn't fix the general problem.

I'm not sure how I would change routing to make this work, aside from using the external interface as the node IP instead. The system routes (without k8s) are quite simple:

0.0.0.0         80.80.80.254   0.0.0.0           UG        0 0          0 EXTIF
10.10.10.0      0.0.0.0        255.255.255.0     U         0 0          0 INTIF

so when a hostNetwork generates a packet to 10.96.0.1 (the service IP), it will decide to use the external host source IP, 80.80.80.2, and the kube-proxy rules then result in a packet from 80.80.80.2 to 10.10.10.1, which is routed over INTIF (but EXTIF would be incorrect too).

@aojea
Copy link
Member

aojea commented May 14, 2021

@dylex what if you "indicate" the nodes to use the INTIF for the services subnet

ip route add SERVICE_NET dev INTIF

@dylex
Copy link
Author

dylex commented May 14, 2021

@aojea Oh, that's a good idea, to fix the source ip for the original packet. It does seem to fix requests from the host. I'll try it, thanks!

@aojea
Copy link
Member

aojea commented May 26, 2021

@aojea Oh, that's a good idea, to fix the source ip for the original packet. It does seem to fix requests from the host. I'll try it, thanks!

I've tried it locally and it did work
did it work for you?
can we close?

@dylex
Copy link
Author

dylex commented May 26, 2021

Yes this worked, and while it'd be nice if this were better documented I suppose it's in this ticket now so people can find it. There are a few issues with running on multi-homed bare-metal clusters regardless (kubernetes/enhancements#1665 touches on some of them), including lack of published node ExternalIPs, but I understand this isn't a common or well-supported deployment scenario, so we can close.

@dylex dylex closed this as completed May 26, 2021
@aojea
Copy link
Member

aojea commented May 26, 2021

but I understand this isn't a common or well-supported deployment scenario, so we can close.

yeah, this just needs some ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

No branches or pull requests

5 participants