Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod service endpoint unreachable from same host #87426

Closed
eraclitux opened this issue Jan 21, 2020 · 27 comments
Closed

Pod service endpoint unreachable from same host #87426

eraclitux opened this issue Jan 21, 2020 · 27 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/unresolved Indicates an issue that can not or will not be resolved.

Comments

@eraclitux
Copy link

eraclitux commented Jan 21, 2020

What happened:
Establishing TCP/UDP traffic to a ClusterIP fails when connection is load balanced via iptables to a pod on the same host.

What you expected to happen:
conntrack shows that the udp datagram is DNATted to 10.200.1.37 from 10.200.1.36 (the host has podCIDR: "10.200.1.0/24").

    [NEW] udp      17 30 src=10.200.1.36 dst=10.32.0.10 sport=45956 dport=53 [UNREPLIED] src=10.200.1.37 dst=10.200.1.36 sport=53 dport=45956
[DESTROY] udp      17 src=10.200.1.36 dst=10.32.0.10 sport=57957 dport=53 [UNREPLIED] src=10.200.1.37 dst=10.200.1.36 sport=53 dport=57957

From my understanding, because pods have /24 mask, the reply from .37 doesn't get through cnio0 but directly to .36 braking the DNAT. This is the tcpdump that shows that:

15:10:27.464509 IP 10.200.1.36.42897 > 10.32.0.10.53: 16896+ A?
pippo.it. (26)
15:10:27.464587 IP 10.200.1.36.42897 > 10.32.0.10.53: 16896+ AAAA? pippo.it. (26)
15:10:27.464777 IP 10.200.1.37.53 > 10.200.1.36.42897: 16896 ServFail- 0/0/0 (26)
15:10:27.464841 IP 10.200.1.37.53 > 10.200.1.36.42897: 16896 ServFail- 0/0/0 (26)

How to reproduce it (as minimally and precisely as possible):

  • create ClusterIP with a single pod endpoint
  • from a pod on the the same host just open a TCP connection or send and UDP datagram, communication will fail.

Anything else we need to know?:
I'm not able to assign /32 subnet to pods, both:

    "ipam": {
        "type": "host-local",
        "ranges": [
          [{"subnet": "10.200.1.0/32"}]
        ]
kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
...
podCIDR: "10.200.1.0/32"

don't work. I ended up changing manually the subnet to see if my hypotesis was right.
This fixed the problem because forced the packets to flow back to cnio0 to the DNAT tracked by conntrack:

ip netns exec cni-a6afeeee-a34b-8e24-de62-26ffa93a4bd8 ip a add 10.200.1.37/32 dev eth0
ip netns exec cni-a6afeeee-a34b-8e24-de62-26ffa93a4bd8 ip a del 10.200.1.37/24 dev eth0

Am I doing something wrong?
Why is not possible to assign a /32 subnet to pods?
Is there a cleaner solution?

Even if the conditions are different, the problem could be similar to #87263

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2",
GitTreeState:"clean", BuildDate:"2019-08-19T11:13:54Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2",
GitTreeState:"clean", BuildDate:"2019-08-19T11:05:50Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
    Virtualbox VMs
  • OS (e.g: cat /etc/os-release):
    Ubuntu 18.04.3 LTS
  • Kernel (e.g. uname -a):
    Linux worker-0 4.15.0-74-generic # 84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
    Manual installation following https://github.com/kelseyhightower/kubernetes-the-hard-way
  • Network plugin and version (if this is a network-related bug):
    L2 networks and linux bridging
  • Others:
    CNI conf:
{
    "cniVersion": "0.3.1",
    "name": "bridge",
    "type": "bridge",
    "bridge": "cnio0",
    "isGateway": true,
    "ipMasq": true,
    "ipam": {
        "type": "host-local",
        "ranges": [
          [{"subnet": "10.200.1.0/24"}]
        ],
        "routes": [{"dst": "0.0.0.0/0"}]
    }
}

iptables-save.txt

@eraclitux eraclitux added the kind/bug Categorizes issue or PR as related to a bug. label Jan 21, 2020
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 21, 2020
@eraclitux
Copy link
Author

/sig network

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 21, 2020
@athenabot
Copy link

/triage unresolved

Comment /remove-triage unresolved when the issue is assessed and confirmed.

🤖 I am a bot run by vllry. 👩‍🔬

@k8s-ci-robot k8s-ci-robot added the triage/unresolved Indicates an issue that can not or will not be resolved. label Jan 21, 2020
@uablrek
Copy link
Contributor

uablrek commented Feb 6, 2020

This is likely not a k8s fault but an effect on how DNS resolv works. I have had this problem also, DNS-queries fails if the DNS-server happens to be on the same node. The problem is that the reply comes back but does not have the ClusterIP as source which is the dest in the query. The local resolver discards the reply since it has an "invalid" source.

I think you can fix the problem by specifying --masquerade-all to kube-proxy;

      --masquerade-all                               If using the pure iptables proxy, SNAT all traffic sent via Service cluster IPs (this not commonly needed)

but I have not verified this.

BTW I solved my problem by setting up local coredns in main netns on all nodes.

@uablrek
Copy link
Contributor

uablrek commented Feb 6, 2020

Now I have tested and yes, --masquerade-all seems to fix the problem (proxy-mode=iptables assumed)

@danwinship
Copy link
Contributor

/assign @satyasm

@k8s-ci-robot
Copy link
Contributor

@danwinship: GitHub didn't allow me to assign the following users: satyasm.

Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @satyasm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@satyasm
Copy link
Contributor

satyasm commented Feb 6, 2020

/assign @satyasm

@uablrek
Copy link
Contributor

uablrek commented Feb 6, 2020

This problem occurs when a CNI-plugin that provides L2 connectivity between pods on the same node is used like the "bridge" CNI-plugin used in this issue (that I also am using). When I switch to Calico which uses L3 and routes traffic back to the node it works fine.

The bridge CNI-plugin sends the reply directly to the client pod on the same node (L2) and thus bypass the connection-tracker in main netns.

@eraclitux
Copy link
Author

eraclitux commented Feb 7, 2020

Thanks for the reply @uablrek. Unfortunately --masquerade-all does not solve the problem. The issue is not L7 but L2/L3, here an example tcp connection that breaks the conntrack table:

[NEW] tcp      6 120 SYN_SENT src=10.200.1.56 dst=10.32.0.22 sport=41081 dport=80 [UNREPLIED] src=10.200.1.54 dst=10.200.1.56 sport=80 dport=41081
[DESTROY] tcp      6 src=10.200.1.56 dst=10.32.0.22 sport=41081 dport=80 [UNREPLIED] src=10.200.1.54 dst=10.200.1.56 sport=80 dport=41081

What do you think of my solution to assign a /32 to pods? Actually ipam host-local doesn't permit to assign a /32:
https://github.com/containernetworking/plugins/blob/43716656062566049f0b61d891c97d4a9cc6888d/plugins/ipam/host-local/backend/allocator/range.go#L33
Is there another kubernetes configuration to do so? (force the subnet to /32 even if ipam returns a /24).

@athenabot
Copy link

@satyasm
If this issue has been triaged, please comment /remove-triage unresolved.

If you aren't able to handle this issue, consider unassigning yourself and/or adding the help-wanted label.

🤖 I am a bot run by vllry. 👩‍🔬

@uablrek
Copy link
Contributor

uablrek commented Mar 8, 2020

Here is a sequence-diagram to illustrate the problem (plantuml source masquerade-all.puml.txt);

masquerade-all

The difference for --masquerade-all (or "iptables.masqueradeAll:true/false" if a config file for kube-proxy is used) is the source filter in the KUBE-SERVICES chain. Check with iptables -t nat -L -nv;

Wthout masqueradeAll;

Chain KUBE-SERVICES (2 references)
 pkts bytes target     prot opt in     out     source               destination
    1    60 KUBE-MARK-MASQ  all  --  *      *      !11.0.0.0/16          0.0.0.0            /* Kubernetes service cluster ip + port for masquerade purpose */ match-set KUBE-CLUSTER-IP dst,dst

With masqueradeAll;

Chain KUBE-SERVICES (2 references)
 pkts bytes target     prot opt in     out     source               destination         
    1    60 KUBE-MARK-MASQ  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* Kubernetes service cluster ip + port for masquerade purpose */ match-set KUBE-CLUSTER-IP dst,dst

For CNI-plugins that do not provide L2 connectivity within the k8s node there is no problem. Here is the setup inside a POD with "Calico";

/ # ip addr show dev eth0
12: eth0@if29: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1440 qdisc noqueue 
    link/ether 2a:9a:66:1e:91:6e brd ff:ff:ff:ff:ff:ff
    inet 11.0.246.65/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 1100::b00:f640/128 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::289a:66ff:fe1e:916e/64 scope link 
       valid_lft forever preferred_lft forever
/ # ip route
default via 169.254.1.1 dev eth0 
169.254.1.1 dev eth0 scope link 

@athenabot
Copy link

@satyasm
If this issue has been triaged, please comment /remove-triage unresolved.

If you aren't able to handle this issue, consider unassigning yourself and/or adding the help-wanted label.

🤖 I am a bot run by vllry. 👩‍🔬

@thockin
Copy link
Member

thockin commented Mar 19, 2020

ping @satyasm

@satyasm
Copy link
Contributor

satyasm commented Mar 19, 2020

/unassign @satyasm
/label help-wanted

@k8s-ci-robot
Copy link
Contributor

@satyasm: The label(s) /label help-wanted cannot be applied. These labels are supported: api-review, community/discussion, community/maintenance, community/question, cuj/build-train-deploy, cuj/multi-user, platform/aws, platform/azure, platform/gcp, platform/minikube, platform/other, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash

In response to this:

/unassign @satyasm
/label help-wanted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@satyasm
Copy link
Contributor

satyasm commented Mar 19, 2020

I have not had a chance to look at this or add any more details on top of what is already in the discussion above. Un-assigning myself to remove confusion so other can help. Thanks!

@thockin
Copy link
Member

thockin commented Mar 19, 2020

@eraclitux There's something weird going on. The podCIDR with /24 is per-node. You do not want to change that to /32. Your TCPDump shows the servfail from the other pod, so it does seem to be making a connection.

I also reject the assertion that DNS on same-node doesn't work. It works just fine for me, and has worked forever.

Running an interactive pod on the same node as the only DNS replica:

$ dig +search +short +identify kubernetes.default
10.1.0.1 from server 10.1.0.10 in 0 ms.

So there's SOMETHING ELSE happening. The conntrack record seems correct - why is it not reversing the tracking?

@thockin thockin self-assigned this Mar 19, 2020
@uablrek
Copy link
Contributor

uablrek commented Mar 20, 2020

DNS queries to a server on the same node does not work for the same reason that TCP doesn't work; The query is sent to the DNS service address, but the reply arrives with the pod-address as source (because it is not NAT'ed back) and is rejected by the local resolver. But the reply does arrive.

@uablrek
Copy link
Contributor

uablrek commented Mar 20, 2020

I have observed something; Even with an L2 network I do not see this problem when I install with kubeadm(?!), only when K8s is installed with something like https://github.com/kelseyhightower/kubernetes-the-hard-way as in this issue.

@uablrek
Copy link
Contributor

uablrek commented Mar 20, 2020

Here is a tcpdump for a DNS query from a POD when the DNS-server POD is on the same node with L2 networking (cni-bridge) and masquerade-all=false. The DNS ClusterIP is 12.0.0.5;

/ # nslookup www.google.se 12.0.0.5
...
/ # tcpdump -ni eth0 port 53
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
09:24:34.812986 IP 11.0.2.3.33306 > 12.0.0.5.53: 46097+ A? www.google.se. (31)
09:24:34.813309 IP 11.0.2.3.33306 > 12.0.0.5.53: 62080+ AAAA? www.google.se. (31)
09:24:34.813926 IP 11.0.2.2.53 > 11.0.2.3.33306: 46097* 1/0/0 A 216.58.207.227 (60)
09:24:34.835490 IP 11.0.2.2.53 > 11.0.2.3.33306: 62080 1/0/0 AAAA 2a00:1450:400f:80c::2003 (72)

And the corresponding conntrack entry in main netns on the same node;

vm-002 ~ # conntrack -L | grep 12.0.0.5
conntrack v1.4.5 (conntrack-tools): udp      17 24 src=11.0.2.3 dst=12.0.0.5 sport=33306 dport=53 [UNREPLIED] src=11.0.2.2 dst=11.0.2.3 sport=53 dport=33306 mark=0 use=1

@eraclitux
Copy link
Author

DNS queries to a server on the same node does not work for the same reason that TCP doesn't work; The query is sent to the DNS service address, but the reply arrives with the pod-address as source (because it is not NAT'ed back) and is rejected by the local resolver. But the reply does arrive.

Exactly @uablrek, as showed by my tcpdump the response packet arrives BUT its source addres is not DNATed back so the receiver discards it.

The podCIDR with /24 is per-node. You do not want to change that to /32

Can you elaborate this @thockin? It seems that assigning a /32 to pods is how AWS vpc cni works (https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/cni-proposal.md) and this solves the problem to me.
Thank you.

@uablrek
Copy link
Contributor

uablrek commented Mar 20, 2020

The same with masquerade-all=true;

/ # nslookup www.google.se 12.0.0.5
Server:         12.0.0.5
Address:        12.0.0.5:53

Name:   www.google.se
Address: 216.58.207.227

Non-authoritative answer:
Name:   www.google.se
Address: 2a00:1450:400f:80c::2003
/ # tcpdump -ni eth0 port 53
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
09:39:36.011641 IP 11.0.0.3.47030 > 12.0.0.5.53: 1899+ A? www.google.se. (31)
09:39:36.011876 IP 11.0.0.3.47030 > 12.0.0.5.53: 2533+ AAAA? www.google.se. (31)
09:39:36.015935 IP 12.0.0.5.53 > 11.0.0.3.47030: 1899* 1/0/0 A 216.58.207.227 (60)
09:39:36.027979 IP 12.0.0.5.53 > 11.0.0.3.47030: 2533 1/0/0 AAAA 2a00:1450:400f:80c::2003 (72)

Corresponding conntrack entry in main netns on the same node;

vm-002 ~ # conntrack -L | grep 12.0.0.5
udp      17 28 src=11.0.0.3 dst=12.0.0.5 sport=47030 dport=53 src=11.0.0.2 dst=11.0.0.1 sport=53 dport=31141 [ASSURED] mark=0 use=1

Where 11.0.0.1 is the address of the cbr0 bridge device on the node and 11.0.0.2 is the address of the coredns POD.

@uablrek
Copy link
Contributor

uablrek commented Mar 20, 2020

@eraclitux You can't assign a /32 address in an L2 network. Packets will likely go out from the POD if you set a detault route to eth0 which in this case is a veth i.e point-to-point. But nothing will find it's way back. You must then set correctponding routes in main netns to the "other side" of the veth. What you then have done is transform the L2 network (using arp) to your own version of an L3 network. Then IMHO you should switch to a maintained CNI-plugin that uses L3, e.g Calico.

BTW, please make another try with masquerade-all and verify that the KUBE-SERVICES really have the rule without the src-filter as described above #87426 (comment). I think when you tried is was not set, perhaps a priority between the cli option --masquerade-all and the;

iptables:
  masqueradeAll: true

in the kube-proxy config file (or ConfigMap if kube-proxy is in a POD).

@thockin
Copy link
Member

thockin commented Mar 20, 2020

@uablrek

DNS queries to a server on the same node does not work for the same reason that TCP doesn't work; The query is sent to the DNS service address, but the reply arrives with the pod-address as source (because it is not NAT'ed back) and is rejected by the local resolver. But the reply does arrive.

I believe you that you're seeing this, but I am asserting that this is not what we want and not "acceptable". As you said you are seeing it in some installs and not others, there's clearly SOMETHING misconfigured - it's taking a shortcut and bypassing conntrack, which is not what we need.

Looking at kubenet code, I see that we set /proc/sys/net/bridge/bridge-nf-call-iptables - can you check that?

https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/dockershim/network/kubenet/kubenet_linux.go#L162-L172

I can't reproduce it, so I'm just shooting in the dark...

The podCIDR with /24 is per-node. You do not want to change that to /32

Can you elaborate this @thockin? It seems that assigning a /32 to pods is how AWS vpc cni works (https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/cni-proposal.md) and this solves the problem to me.

"podCIDR" is a per-node field, not a per-pod field. If you set the node's podCIDR to /32 it only has 1 IP to use.

@uablrek
Copy link
Contributor

uablrek commented Mar 21, 2020

@thockin Bullseye! Pretty good for shooting in the dark 😃

I removed "masquerade-all" and added;

modprobe br-netfilter
sysctl -w net.bridge.bridge-nf-call-iptables=1

on node start-up and both TCP to a local POD and DNS queries to a local server works perfectly.

I will raise an issue on https://github.com/kelseyhightower/kubernetes-the-hard-way refering to this issue.

@eraclitux Please try the sysctl above and forget all about "masquerade-all". If it works, please close this issue.

@eraclitux
Copy link
Author

@uablrek I can confirm that enabling br-netfilter and configuring it at startup fixes the conntrack table.
Thank you @uablrek and @thockin for helping on this! 🍺

@joppino
Copy link

joppino commented Apr 14, 2022

When contacting a service IP that maps the same cluster IP for the pod, --masquerade-all is needed anyway.

Environment:

  • Baremetal k8s
  • Networking: calico
  • kube-proxy mode: iptables
  • br_netfilter is loaded

POC:

Create a nginx pod, then create a Service.
If you try to contact the Service clusterIP (using, for example, its mapped hostname) from the Pod that it is mapped to, it will be unreachable. Br_netfilter is still needed, as it solves the problem of same-host, but it doesn't solve a same-pod situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/unresolved Indicates an issue that can not or will not be resolved.
Projects
None yet
Development

No branches or pull requests

8 participants