Pod service endpoint unreachable from same host #87426

eraclitux · 2020-01-21T16:29:42Z

What happened:
Establishing TCP/UDP traffic to a ClusterIP fails when connection is load balanced via iptables to a pod on the same host.

What you expected to happen:
conntrack shows that the udp datagram is DNATted to 10.200.1.37 from 10.200.1.36 (the host has podCIDR: "10.200.1.0/24").

    [NEW] udp      17 30 src=10.200.1.36 dst=10.32.0.10 sport=45956 dport=53 [UNREPLIED] src=10.200.1.37 dst=10.200.1.36 sport=53 dport=45956
[DESTROY] udp      17 src=10.200.1.36 dst=10.32.0.10 sport=57957 dport=53 [UNREPLIED] src=10.200.1.37 dst=10.200.1.36 sport=53 dport=57957

From my understanding, because pods have /24 mask, the reply from .37 doesn't get through cnio0 but directly to .36 braking the DNAT. This is the tcpdump that shows that:

15:10:27.464509 IP 10.200.1.36.42897 > 10.32.0.10.53: 16896+ A?
pippo.it. (26)
15:10:27.464587 IP 10.200.1.36.42897 > 10.32.0.10.53: 16896+ AAAA? pippo.it. (26)
15:10:27.464777 IP 10.200.1.37.53 > 10.200.1.36.42897: 16896 ServFail- 0/0/0 (26)
15:10:27.464841 IP 10.200.1.37.53 > 10.200.1.36.42897: 16896 ServFail- 0/0/0 (26)

How to reproduce it (as minimally and precisely as possible):

create ClusterIP with a single pod endpoint
from a pod on the the same host just open a TCP connection or send and UDP datagram, communication will fail.

Anything else we need to know?:
I'm not able to assign /32 subnet to pods, both:

    "ipam": {
        "type": "host-local",
        "ranges": [
          [{"subnet": "10.200.1.0/32"}]
        ]

kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
...
podCIDR: "10.200.1.0/32"

don't work. I ended up changing manually the subnet to see if my hypotesis was right.
This fixed the problem because forced the packets to flow back to cnio0 to the DNAT tracked by conntrack:

ip netns exec cni-a6afeeee-a34b-8e24-de62-26ffa93a4bd8 ip a add 10.200.1.37/32 dev eth0
ip netns exec cni-a6afeeee-a34b-8e24-de62-26ffa93a4bd8 ip a del 10.200.1.37/24 dev eth0

Am I doing something wrong?
Why is not possible to assign a /32 subnet to pods?
Is there a cleaner solution?

Even if the conditions are different, the problem could be similar to #87263

Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2",
GitTreeState:"clean", BuildDate:"2019-08-19T11:13:54Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2",
GitTreeState:"clean", BuildDate:"2019-08-19T11:05:50Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration:
Virtualbox VMs
OS (e.g: cat /etc/os-release):
Ubuntu 18.04.3 LTS
Kernel (e.g. uname -a):
Linux worker-0 4.15.0-74-generic # 84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Manual installation following https://github.com/kelseyhightower/kubernetes-the-hard-way
Network plugin and version (if this is a network-related bug):
L2 networks and linux bridging
Others:
CNI conf:

{
    "cniVersion": "0.3.1",
    "name": "bridge",
    "type": "bridge",
    "bridge": "cnio0",
    "isGateway": true,
    "ipMasq": true,
    "ipam": {
        "type": "host-local",
        "ranges": [
          [{"subnet": "10.200.1.0/24"}]
        ],
        "routes": [{"dst": "0.0.0.0/0"}]
    }
}

iptables-save.txt

The text was updated successfully, but these errors were encountered:

eraclitux · 2020-01-21T16:31:47Z

/sig network

athenabot · 2020-01-21T16:57:08Z

/triage unresolved

Comment /remove-triage unresolved when the issue is assessed and confirmed.

🤖 I am a bot run by vllry. 👩‍🔬

uablrek · 2020-02-06T10:15:33Z

This is likely not a k8s fault but an effect on how DNS resolv works. I have had this problem also, DNS-queries fails if the DNS-server happens to be on the same node. The problem is that the reply comes back but does not have the ClusterIP as source which is the dest in the query. The local resolver discards the reply since it has an "invalid" source.

I think you can fix the problem by specifying --masquerade-all to kube-proxy;

      --masquerade-all                               If using the pure iptables proxy, SNAT all traffic sent via Service cluster IPs (this not commonly needed)

but I have not verified this.

BTW I solved my problem by setting up local coredns in main netns on all nodes.

uablrek · 2020-02-06T11:20:40Z

Now I have tested and yes, --masquerade-all seems to fix the problem (proxy-mode=iptables assumed)

danwinship · 2020-02-06T22:18:49Z

/assign @satyasm

k8s-ci-robot · 2020-02-06T22:18:52Z

@danwinship: GitHub didn't allow me to assign the following users: satyasm.

Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @satyasm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

satyasm · 2020-02-06T22:19:22Z

/assign @satyasm

uablrek · 2020-02-06T23:04:57Z

This problem occurs when a CNI-plugin that provides L2 connectivity between pods on the same node is used like the "bridge" CNI-plugin used in this issue (that I also am using). When I switch to Calico which uses L3 and routes traffic back to the node it works fine.

The bridge CNI-plugin sends the reply directly to the client pod on the same node (L2) and thus bypass the connection-tracker in main netns.

eraclitux · 2020-02-07T22:27:34Z

Thanks for the reply @uablrek. Unfortunately --masquerade-all does not solve the problem. The issue is not L7 but L2/L3, here an example tcp connection that breaks the conntrack table:

[NEW] tcp      6 120 SYN_SENT src=10.200.1.56 dst=10.32.0.22 sport=41081 dport=80 [UNREPLIED] src=10.200.1.54 dst=10.200.1.56 sport=80 dport=41081
[DESTROY] tcp      6 src=10.200.1.56 dst=10.32.0.22 sport=41081 dport=80 [UNREPLIED] src=10.200.1.54 dst=10.200.1.56 sport=80 dport=41081

What do you think of my solution to assign a /32 to pods? Actually ipam host-local doesn't permit to assign a /32:
https://github.com/containernetworking/plugins/blob/43716656062566049f0b61d891c97d4a9cc6888d/plugins/ipam/host-local/backend/allocator/range.go#L33
Is there another kubernetes configuration to do so? (force the subnet to /32 even if ipam returns a /24).

athenabot · 2020-02-13T22:57:21Z

@satyasm
If this issue has been triaged, please comment /remove-triage unresolved.

If you aren't able to handle this issue, consider unassigning yourself and/or adding the help-wanted label.

🤖 I am a bot run by vllry. 👩‍🔬

uablrek · 2020-03-08T11:13:10Z

Here is a sequence-diagram to illustrate the problem (plantuml source masquerade-all.puml.txt);

The difference for --masquerade-all (or "iptables.masqueradeAll:true/false" if a config file for kube-proxy is used) is the source filter in the KUBE-SERVICES chain. Check with iptables -t nat -L -nv;

Wthout masqueradeAll;

Chain KUBE-SERVICES (2 references)
 pkts bytes target     prot opt in     out     source               destination
    1    60 KUBE-MARK-MASQ  all  --  *      *      !11.0.0.0/16          0.0.0.0            /* Kubernetes service cluster ip + port for masquerade purpose */ match-set KUBE-CLUSTER-IP dst,dst

With masqueradeAll;

Chain KUBE-SERVICES (2 references)
 pkts bytes target     prot opt in     out     source               destination         
    1    60 KUBE-MARK-MASQ  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* Kubernetes service cluster ip + port for masquerade purpose */ match-set KUBE-CLUSTER-IP dst,dst

For CNI-plugins that do not provide L2 connectivity within the k8s node there is no problem. Here is the setup inside a POD with "Calico";

/ # ip addr show dev eth0
12: eth0@if29: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1440 qdisc noqueue 
    link/ether 2a:9a:66:1e:91:6e brd ff:ff:ff:ff:ff:ff
    inet 11.0.246.65/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 1100::b00:f640/128 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::289a:66ff:fe1e:916e/64 scope link 
       valid_lft forever preferred_lft forever
/ # ip route
default via 169.254.1.1 dev eth0 
169.254.1.1 dev eth0 scope link

athenabot · 2020-03-18T15:07:17Z

@satyasm
If this issue has been triaged, please comment /remove-triage unresolved.

If you aren't able to handle this issue, consider unassigning yourself and/or adding the help-wanted label.

🤖 I am a bot run by vllry. 👩‍🔬

thockin · 2020-03-19T21:19:21Z

ping @satyasm

satyasm · 2020-03-19T21:27:48Z

/unassign @satyasm
/label help-wanted

k8s-ci-robot · 2020-03-19T21:27:49Z

@satyasm: The label(s) /label help-wanted cannot be applied. These labels are supported: api-review, community/discussion, community/maintenance, community/question, cuj/build-train-deploy, cuj/multi-user, platform/aws, platform/azure, platform/gcp, platform/minikube, platform/other, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash

In response to this:

/unassign @satyasm
/label help-wanted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

satyasm · 2020-03-19T21:29:03Z

I have not had a chance to look at this or add any more details on top of what is already in the discussion above. Un-assigning myself to remove confusion so other can help. Thanks!

thockin · 2020-03-19T23:00:18Z

@eraclitux There's something weird going on. The podCIDR with /24 is per-node. You do not want to change that to /32. Your TCPDump shows the servfail from the other pod, so it does seem to be making a connection.

I also reject the assertion that DNS on same-node doesn't work. It works just fine for me, and has worked forever.

Running an interactive pod on the same node as the only DNS replica:

$ dig +search +short +identify kubernetes.default
10.1.0.1 from server 10.1.0.10 in 0 ms.

So there's SOMETHING ELSE happening. The conntrack record seems correct - why is it not reversing the tracking?

uablrek · 2020-03-20T06:14:57Z

DNS queries to a server on the same node does not work for the same reason that TCP doesn't work; The query is sent to the DNS service address, but the reply arrives with the pod-address as source (because it is not NAT'ed back) and is rejected by the local resolver. But the reply does arrive.

uablrek · 2020-03-20T06:18:21Z

I have observed something; Even with an L2 network I do not see this problem when I install with kubeadm(?!), only when K8s is installed with something like https://github.com/kelseyhightower/kubernetes-the-hard-way as in this issue.

uablrek · 2020-03-20T09:32:18Z

Here is a tcpdump for a DNS query from a POD when the DNS-server POD is on the same node with L2 networking (cni-bridge) and masquerade-all=false. The DNS ClusterIP is 12.0.0.5;

/ # nslookup www.google.se 12.0.0.5
...
/ # tcpdump -ni eth0 port 53
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
09:24:34.812986 IP 11.0.2.3.33306 > 12.0.0.5.53: 46097+ A? www.google.se. (31)
09:24:34.813309 IP 11.0.2.3.33306 > 12.0.0.5.53: 62080+ AAAA? www.google.se. (31)
09:24:34.813926 IP 11.0.2.2.53 > 11.0.2.3.33306: 46097* 1/0/0 A 216.58.207.227 (60)
09:24:34.835490 IP 11.0.2.2.53 > 11.0.2.3.33306: 62080 1/0/0 AAAA 2a00:1450:400f:80c::2003 (72)

And the corresponding conntrack entry in main netns on the same node;

vm-002 ~ # conntrack -L | grep 12.0.0.5
conntrack v1.4.5 (conntrack-tools): udp      17 24 src=11.0.2.3 dst=12.0.0.5 sport=33306 dport=53 [UNREPLIED] src=11.0.2.2 dst=11.0.2.3 sport=53 dport=33306 mark=0 use=1

eraclitux · 2020-03-20T09:44:01Z

DNS queries to a server on the same node does not work for the same reason that TCP doesn't work; The query is sent to the DNS service address, but the reply arrives with the pod-address as source (because it is not NAT'ed back) and is rejected by the local resolver. But the reply does arrive.

Exactly @uablrek, as showed by my tcpdump the response packet arrives BUT its source addres is not DNATed back so the receiver discards it.

The podCIDR with /24 is per-node. You do not want to change that to /32

Can you elaborate this @thockin? It seems that assigning a /32 to pods is how AWS vpc cni works (https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/cni-proposal.md) and this solves the problem to me.
Thank you.

uablrek · 2020-03-20T09:44:20Z

The same with masquerade-all=true;

/ # nslookup www.google.se 12.0.0.5
Server:         12.0.0.5
Address:        12.0.0.5:53

Name:   www.google.se
Address: 216.58.207.227

Non-authoritative answer:
Name:   www.google.se
Address: 2a00:1450:400f:80c::2003
/ # tcpdump -ni eth0 port 53
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
09:39:36.011641 IP 11.0.0.3.47030 > 12.0.0.5.53: 1899+ A? www.google.se. (31)
09:39:36.011876 IP 11.0.0.3.47030 > 12.0.0.5.53: 2533+ AAAA? www.google.se. (31)
09:39:36.015935 IP 12.0.0.5.53 > 11.0.0.3.47030: 1899* 1/0/0 A 216.58.207.227 (60)
09:39:36.027979 IP 12.0.0.5.53 > 11.0.0.3.47030: 2533 1/0/0 AAAA 2a00:1450:400f:80c::2003 (72)

Corresponding conntrack entry in main netns on the same node;

vm-002 ~ # conntrack -L | grep 12.0.0.5
udp      17 28 src=11.0.0.3 dst=12.0.0.5 sport=47030 dport=53 src=11.0.0.2 dst=11.0.0.1 sport=53 dport=31141 [ASSURED] mark=0 use=1

Where 11.0.0.1 is the address of the cbr0 bridge device on the node and 11.0.0.2 is the address of the coredns POD.

uablrek · 2020-03-20T09:56:18Z

@eraclitux You can't assign a /32 address in an L2 network. Packets will likely go out from the POD if you set a detault route to eth0 which in this case is a veth i.e point-to-point. But nothing will find it's way back. You must then set correctponding routes in main netns to the "other side" of the veth. What you then have done is transform the L2 network (using arp) to your own version of an L3 network. Then IMHO you should switch to a maintained CNI-plugin that uses L3, e.g Calico.

BTW, please make another try with masquerade-all and verify that the KUBE-SERVICES really have the rule without the src-filter as described above #87426 (comment). I think when you tried is was not set, perhaps a priority between the cli option --masquerade-all and the;

iptables:
  masqueradeAll: true

in the kube-proxy config file (or ConfigMap if kube-proxy is in a POD).

thockin · 2020-03-20T19:05:57Z

@uablrek

DNS queries to a server on the same node does not work for the same reason that TCP doesn't work; The query is sent to the DNS service address, but the reply arrives with the pod-address as source (because it is not NAT'ed back) and is rejected by the local resolver. But the reply does arrive.

I believe you that you're seeing this, but I am asserting that this is not what we want and not "acceptable". As you said you are seeing it in some installs and not others, there's clearly SOMETHING misconfigured - it's taking a shortcut and bypassing conntrack, which is not what we need.

Looking at kubenet code, I see that we set /proc/sys/net/bridge/bridge-nf-call-iptables - can you check that?

https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/dockershim/network/kubenet/kubenet_linux.go#L162-L172

I can't reproduce it, so I'm just shooting in the dark...

The podCIDR with /24 is per-node. You do not want to change that to /32

Can you elaborate this @thockin? It seems that assigning a /32 to pods is how AWS vpc cni works (https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/cni-proposal.md) and this solves the problem to me.

"podCIDR" is a per-node field, not a per-pod field. If you set the node's podCIDR to /32 it only has 1 IP to use.

uablrek · 2020-03-21T07:06:05Z

@thockin Bullseye! Pretty good for shooting in the dark 😃

I removed "masquerade-all" and added;

modprobe br-netfilter
sysctl -w net.bridge.bridge-nf-call-iptables=1

on node start-up and both TCP to a local POD and DNS queries to a local server works perfectly.

I will raise an issue on https://github.com/kelseyhightower/kubernetes-the-hard-way refering to this issue.

@eraclitux Please try the sysctl above and forget all about "masquerade-all". If it works, please close this issue.

eraclitux · 2020-03-23T10:36:03Z

@uablrek I can confirm that enabling br-netfilter and configuring it at startup fixes the conntrack table.
Thank you @uablrek and @thockin for helping on this! 🍺

joppino · 2022-04-14T11:36:56Z

When contacting a service IP that maps the same cluster IP for the pod, --masquerade-all is needed anyway.

Environment:

Baremetal k8s
Networking: calico
kube-proxy mode: iptables
br_netfilter is loaded

POC:

Create a nginx pod, then create a Service.
If you try to contact the Service clusterIP (using, for example, its mapped hostname) from the Pod that it is mapped to, it will be unreachable. Br_netfilter is still needed, as it solves the problem of same-host, but it doesn't solve a same-pod situation.

eraclitux added the kind/bug Categorizes issue or PR as related to a bug. label Jan 21, 2020

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 21, 2020

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 21, 2020

k8s-ci-robot added the triage/unresolved Indicates an issue that can not or will not be resolved. label Jan 21, 2020

k8s-ci-robot assigned satyasm Feb 6, 2020

This was referenced Mar 11, 2020

REQUEST: New membership for @uablrek kubernetes/org#1701

Closed

REQUEST: New membership for @uablrek kubernetes/org#1702

Closed

k8s-ci-robot unassigned satyasm Mar 19, 2020

thockin self-assigned this Mar 19, 2020

uablrek mentioned this issue Mar 20, 2020

Issue accessing LB service from pods on same node as the service's pod #87263

Closed

uablrek mentioned this issue Mar 21, 2020

Local service-pod access needs net.bridge.bridge-nf-call-iptables=1 kelseyhightower/kubernetes-the-hard-way#561

Open

eraclitux closed this as completed Mar 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod service endpoint unreachable from same host #87426

Pod service endpoint unreachable from same host #87426

eraclitux commented Jan 21, 2020 •

edited

eraclitux commented Jan 21, 2020

athenabot commented Jan 21, 2020

uablrek commented Feb 6, 2020

uablrek commented Feb 6, 2020

danwinship commented Feb 6, 2020

k8s-ci-robot commented Feb 6, 2020

satyasm commented Feb 6, 2020

uablrek commented Feb 6, 2020 •

edited

eraclitux commented Feb 7, 2020 •

edited

athenabot commented Feb 13, 2020

uablrek commented Mar 8, 2020 •

edited

athenabot commented Mar 18, 2020

thockin commented Mar 19, 2020

satyasm commented Mar 19, 2020

k8s-ci-robot commented Mar 19, 2020

satyasm commented Mar 19, 2020

thockin commented Mar 19, 2020

uablrek commented Mar 20, 2020 •

edited

uablrek commented Mar 20, 2020

uablrek commented Mar 20, 2020

eraclitux commented Mar 20, 2020

uablrek commented Mar 20, 2020

uablrek commented Mar 20, 2020 •

edited

thockin commented Mar 20, 2020

uablrek commented Mar 21, 2020

eraclitux commented Mar 23, 2020

joppino commented Apr 14, 2022

Pod service endpoint unreachable from same host #87426

Pod service endpoint unreachable from same host #87426

Comments

eraclitux commented Jan 21, 2020 • edited

eraclitux commented Jan 21, 2020

athenabot commented Jan 21, 2020

uablrek commented Feb 6, 2020

uablrek commented Feb 6, 2020

danwinship commented Feb 6, 2020

k8s-ci-robot commented Feb 6, 2020

satyasm commented Feb 6, 2020

uablrek commented Feb 6, 2020 • edited

eraclitux commented Feb 7, 2020 • edited

athenabot commented Feb 13, 2020

uablrek commented Mar 8, 2020 • edited

athenabot commented Mar 18, 2020

thockin commented Mar 19, 2020

satyasm commented Mar 19, 2020

k8s-ci-robot commented Mar 19, 2020

satyasm commented Mar 19, 2020

thockin commented Mar 19, 2020

uablrek commented Mar 20, 2020 • edited

uablrek commented Mar 20, 2020

uablrek commented Mar 20, 2020

eraclitux commented Mar 20, 2020

uablrek commented Mar 20, 2020

uablrek commented Mar 20, 2020 • edited

thockin commented Mar 20, 2020

uablrek commented Mar 21, 2020

eraclitux commented Mar 23, 2020

joppino commented Apr 14, 2022

eraclitux commented Jan 21, 2020 •

edited

uablrek commented Feb 6, 2020 •

edited

eraclitux commented Feb 7, 2020 •

edited

uablrek commented Mar 8, 2020 •

edited

uablrek commented Mar 20, 2020 •

edited

uablrek commented Mar 20, 2020 •

edited