Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodePort accessible only from host where POD is running #89632

Closed
cfabio opened this issue Mar 29, 2020 · 34 comments
Closed

NodePort accessible only from host where POD is running #89632

cfabio opened this issue Mar 29, 2020 · 34 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/duplicate Indicates an issue is a duplicate of other open issue.

Comments

@cfabio
Copy link

cfabio commented Mar 29, 2020

What happened:
I have a 1 master 2 slaves Kubernetes cluster, nodes IP addresses are 192.168.122.[110-111-112]/24.
Deployment and Service defined as follows:

apiVersion: v1
kind: Service
metadata:
  name: test
  labels:
    app: test
spec:
  type: NodePort
  ports:
    - port: 1883
      name: eclipse-mosquitto
      protocol: TCP
      targetPort: 1883
      nodePort: 31883
  selector:
    app: test

---

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test
  template:
    metadata:
      labels:
        app: test
    spec:
      containers:
        - name: eclipse-mosquitto
          image: eclipse-mosquitto
          ports:
          - containerPort: 1883

What you expected to happen:
I would expect to be able to reach the service exposed on port 31883 from every IP of the cluster, this worked fine in Kubernetes version 1.15 and 1.16.
Since Kubernetes 1.17.0 this is not the case anymore, service exposed on port 31883 is only accessible on the IP address of the physical node where the POD is actually running.

How to reproduce it (as minimally and precisely as possible):
Kubernetes cluster is created using kubeadm init --pod-network-cidr=10.244.0.0/16 command, network plugin used is flannel which is installed using the yaml definition file provided here: https://github.com/coreos/flannel#deploying-flannel-manually

Anything else we need to know?:
I tried going back to Kubernetes version 1.15.4 and 1.16.8 and the same service/deployment definition magically start to work again.
When using kubernetes 1.18.0 if I run nmap against all 3 nodes this is the output I get:

$ nmap 192.168.122.110,111,112 -p 31883
Starting Nmap 7.80 ( https://nmap.org ) at 2020-03-29 17:01 CEST
Nmap scan report for 192.168.122.110
Host is up (0.00017s latency).

PORT      STATE    SERVICE
31883/tcp filtered unknown

Nmap scan report for 192.168.122.111
Host is up (0.00036s latency).

PORT      STATE SERVICE
31883/tcp open  unknown

Nmap scan report for 192.168.122.112
Host is up (0.00021s latency).

PORT      STATE    SERVICE
31883/tcp filtered unknown

Nmap done: 3 IP addresses (3 hosts up) scanned in 0.23 second

I am not really sure if this is actually a bug in Kubernetes, flannel or some kind of incompatibility between some other component of the system.

Environment:

  • Kubernetes version (use kubectl version): 1.18.0 (same issue also on version 1.17.4)
  • Cloud provider or hardware configuration: on-premise virtual machines
  • OS (e.g: cat /etc/os-release): CentOS Linux release 7.7.1908
  • Kernel (e.g. uname -a): Linux centos 3.10.0-1062.18.1.el7.x86_64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Tue Mar 17 23:49:17 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools: kubeadm
  • Network plugin and version (if this is a network-related bug): flannel
  • Others:
@cfabio cfabio added the kind/bug Categorizes issue or PR as related to a bug. label Mar 29, 2020
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 29, 2020
@neolit123
Copy link
Member

/sig network

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 29, 2020
@uablrek
Copy link
Contributor

uablrek commented Mar 31, 2020

Proxy-mode ipvs or iptables?

@uablrek
Copy link
Contributor

uablrek commented Mar 31, 2020

It seems I can reproduce this problem. It is not consistent, I can reach the nodeport on some nodes. Not only the one where the pod is executing. But on some nodes it doesn't work.

I never get this problem with proxy-mode=ipvs, only with iptables.

@cfabio If possible try with proxy-mode=ipvs.

/assign

@uablrek
Copy link
Contributor

uablrek commented Mar 31, 2020

Weirdest finding today? Well here it is;

On nodes where no pod has ever executed forwarding to NodePort doesn't work(?!)

It works to access the nodeport from within a node where no pod has executed but traffic from an external source to the nodeport via such a node does not work.

Tested on K8s v1.16.7, v1.17.3, v1.18.0.

So it is not bound to K8s > v1.16.x

@cfabio Can it be that when you tested on k8s < v1.17 pods had been executing on the nodes?

Story

I accidentally loaded a deployment with many replicas for test and thought I just scale it down to 1, but then .... it worked! Access to nodeport worked via all nodes. I then loaded an alpine daemonset completely unrelated to the test-app and access to the nodeport worked via all nodes. Finally I added the alpine daemonsed, removed it again, waited until no alpine pods were running (and 5 extra sec after that) and then loaded the test-app and it still worked fine!

The alpine manifest;

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: alpine-daemonset
spec:
  selector:
    matchLabels:
      app: alpine
  template:
    metadata:
      labels:
        app: alpine
    spec:
      containers:
      - name: alpine
        image: library/alpine:latest
        command: ["tail", "-f", "/dev/null"]

@cfabio When you have the problem can you load this manifest and see if it helps?

By coincidence (I think) the header for this issue is very accurate 😄

@uablrek
Copy link
Contributor

uablrek commented Mar 31, 2020

My first idea was that some kernel module was not loaded, but lsmod was identical.

@cfabio
Copy link
Author

cfabio commented Mar 31, 2020

I am currently playing with IPVS to see if it makes any difference.

@cfabio Can it be that when you tested on k8s < v1.17 pods had been executing on the nodes?

Before rolling back to v1.15.* and v1.16.* I killed the cluster (drain, delete nodes, remove-etcd-member, cleaned up iptables, etc) and started again from scratch.

I'll run your Alpine daemonset and let you know about that.

@cfabio
Copy link
Author

cfabio commented Mar 31, 2020

@uablrek here is what I did:

  • configured a cluster with KubeProxy in iptables mode, executed my usual MQTT broker POD and as expected the NodePort was open only on the host where the POD is being run.
    Your Alpine DaemonSet runs just fine but the issue with the MQTT broker is still there.
  • killed the cluster and rebuilt it with kubeProxy in ipvs mode, even in this case NodePort is accessible only on the IP of the host where the POD is being run.
    KubeProxy logs are full of this error though:
E0331 13:11:06.676253       1 proxier.go:1950] Failed to list IPVS destinations, error: parseIP Error ip=[10 244 2 90 0 0 0 0 0 0 0 0 0 0 0 0]
E0331 13:11:06.676265       1 proxier.go:1533] Failed to sync endpoint for service: 192.168.122.112:31883/TCP, err: parseIP Error ip=[10 244 2 90 0 0 0 0 0 0 0 0 0 0 0 0]

This might have to do with me doing something wrong when changing KubeProxy config from iptables to ipvs though.

@uablrek
Copy link
Contributor

uablrek commented Mar 31, 2020

The error printouts are caused by this problem;
#89520

It is not present in k8s v1.17.x.

@uablrek
Copy link
Contributor

uablrek commented Mar 31, 2020

An advantage with ipvs is that it is easier to troubleshoot. When you use proxy-mode=ipvs, do;

ipvsadm -Ln

on nodes where forwarding doesn't work.

@uablrek
Copy link
Contributor

uablrek commented Mar 31, 2020

This is how it looks on my test system. My test server uses port 5001 instead of 1883 otherwise I took your manifests more or less as-is;

vm-002 ~ # ipvsadm -Ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  192.168.0.2:31883 rr
  -> 11.0.3.2:5001                Masq    1      0          0         
TCP  192.168.1.2:31883 rr
  -> 11.0.3.2:5001                Masq    1      0          1         
TCP  12.0.0.1:443 rr
  -> 192.168.1.1:6443             Masq    1      0          0         
TCP  12.0.144.99:1883 rr
  -> 11.0.3.2:5001                Masq    1      0          0         
TCP  12.0.162.229:443 rr
  -> 11.0.2.2:4443                Masq    1      0          0         
TCP  127.0.0.1:31883 rr
  -> 11.0.3.2:5001                Masq    1      0          0         

@cfabio
Copy link
Author

cfabio commented Mar 31, 2020

Here is the output of ipvsadm -Ln command executed on a node where NodePort is filtered:

Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  127.0.0.1:31883 rr
  -> 10.244.2.3:1883              Masq    1      0          0         
TCP  172.17.0.1:31883 rr
  -> 10.244.2.3:1883              Masq    1      0          0         
TCP  172.18.0.1:31883 rr
  -> 10.244.2.3:1883              Masq    1      0          0         
TCP  172.19.0.1:31883 rr
  -> 10.244.2.3:1883              Masq    1      0          0         
TCP  192.168.122.111:31883 rr
  -> 10.244.2.3:1883              Masq    1      0          0         
TCP  10.96.0.1:443 rr
  -> 192.168.122.110:6443         Masq    1      1          0         
TCP  10.96.0.10:53 rr
  -> 10.244.0.2:53                Masq    1      0          0         
  -> 10.244.0.3:53                Masq    1      0          0         
TCP  10.96.0.10:9153 rr
  -> 10.244.0.2:9153              Masq    1      0          0         
  -> 10.244.0.3:9153              Masq    1      0          0         
TCP  10.107.158.203:1883 rr
  -> 10.244.2.3:1883              Masq    1      0          0         
TCP  10.244.1.0:31883 rr
  -> 10.244.2.3:1883              Masq    1      0          0         
UDP  10.96.0.10:53 rr
  -> 10.244.0.2:53                Masq    1      0          0         
  -> 10.244.0.3:53                Masq    1      0          0         

I am using CentOS 7 which has a very old kernel 3.10.*, which seems also the case for bug #89520.

@uablrek
Copy link
Contributor

uablrek commented Mar 31, 2020

Looks good actually. The entry in ipvs is there with 2 inactive connections. Your problem seem to be another than the one I found 😞

@cfabio
Copy link
Author

cfabio commented Mar 31, 2020

The way I have setup KubeProxy to work with IPVS is the following:

  • kubectl edit configmap kube-proxy -n kube-system and changed mode: "" to mode: ipvs.
  • kubectl delete po -n kube-system kube-proxy-*** and deleted every kube-proxy pod.
  • kubectl logs kube-proxy-*** | grep -i "using ipvs" checked that the freshly started kube-proxy pods actually use ipvs.
  • rebooted the machines of the cluster just to be safe.

Does this procedure make sense to you?

@uablrek
Copy link
Contributor

uablrek commented Mar 31, 2020

Yes looks ok, but I might have missed something. To use kubeadm use a config file with the --config param and add att the end;

---
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: ipvs

@cfabio
Copy link
Author

cfabio commented Mar 31, 2020

@uablrek manually changing kube-proxy configuration by editing the configmap or rebuilding the cluster from scratch using kubeadm --config doesn't seem to make any difference; in both cases kube-proxy PODs logs are spammed with the error:

E0331 14:38:40.106113       1 proxier.go:1950] Failed to list IPVS destinations, error: parseIP Error ip=[10 244 2 2 0 0 0 0 0 0 0 0 0 0 0 0]
E0331 14:38:40.106142       1 proxier.go:1533] Failed to sync endpoint for service: 172.18.0.1:31883/TCP, err: parseIP Error ip=[10 244 2 2 0 0 0 0 0 0 0 0 0 0 0 0]

This is the yml file used to build the cluster with kubeadm --config:

apiServer:
  extraArgs:
    authorization-mode: Node,RBAC
  timeoutForControlPlane: 4m0s
apiVersion: kubeadm.k8s.io/v1beta2
certificatesDir: /etc/kubernetes/pki
clusterName: kubernetes
controllerManager: {}
dns:
  type: CoreDNS
etcd:
  local:
    dataDir: /var/lib/etcd
imageRepository: k8s.gcr.io
kind: ClusterConfiguration
kubernetesVersion: v1.18.0
networking:
  dnsDomain: cluster.local
  podSubnet: 10.244.0.0/16
  serviceSubnet: 10.96.0.0/12
scheduler: {}
---
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: ipvs

At this point the only thing I can think of is trying to use an OS with a more up to date Linux kernel or staying with Kubernetes v1.16.8 till these issues are ironed out.

@uablrek
Copy link
Contributor

uablrek commented Mar 31, 2020

The "parseIP Error" is in k8s v1.18.0 only. It should work for v1.17.3 for instance.

@uablrek
Copy link
Contributor

uablrek commented Apr 1, 2020

@cfabio SInce the problem I had when reproducing this issue does not seem to be the same as yours I can't take this much further. Since the problem does not appear in other cluster it must be something in your environment. I have not time to duplicate your setup but can only give some advice for trouble-shooting.

At least the ipvs setup seems OK so next step may be some traffic monitoring with tcpdump. On a node where forwarding does not work it is interresting to see if the request is forwarded. The port should be NAT'ed so trace with;

tcpdump -ni <outgoing-if> port 1883

Where the "outgoing-if" is the interface your CNI-plugin (flannel) uses to forward packets to other nodes. Compare with a trace where it works.

If traffic never leaves the receiving node the problem is there. However there are multiple steps that can fail. Packets may reach the POD but return traffic may fail, etc

@cfabio
Copy link
Author

cfabio commented Apr 1, 2020

@uablrek thanks for the suggestion.
If I may ask, what operating system did you use for your tests?
The issue I am facing with NodePort (iptables and ipvs) might be related to CentOS 7.

@uablrek
Copy link
Contributor

uablrek commented Apr 1, 2020

I use an image I built myself. Bare minimum BusyBox based. It removes interference from other things and give control over tool versions. E.g my kernel is linux-5.6. I tried btw to back down to linux-3.10 but then cri-o (my CRI-plugin, replaces docker) stopped working because some fs-overlay feature was missing. So I dropped that track.

@uablrek
Copy link
Contributor

uablrek commented Apr 1, 2020

But the 3.10 kernel in CentOS 7 is very old. So IMHO it is worth upgrading just to rule out kernel-version problems at least.

@cfabio
Copy link
Author

cfabio commented Apr 2, 2020

I setup a brand new cluster made of 3 CentOS 8 nodes (1 master 2 slaves) and tried every combination of Kubernetes versions 1.18.0, 1.17.4 and 1.17.3 with ipvs and iptables; the situation is exactly the same as what I had on CentOS 7, same issues, same errors.
Network configuration is:

  • master 192.168.122.120
  • slave1 192.168.122.121
  • slave2 192.168.122.122

Choosen backend is ipvs.
Configured a POD with Eclipse Mosquitto MQTT broker as described in OP, it gets executed on node slave1 and gets IP address 10.244.1.2.
Physical nodes (master and node2) can ping 10.244.1.2 and nmap 10.244.1.2 -p 1883 confirms that MQTT broker is listening on port 1883.
When checking port 31883 from outside the cluster with nmap, port 31883 is reported open only on the physical node where the POD is executed (node1).
I did what you suggested and run tcpdump on a physical node (slave2) where port 31883 is reported closed, the output is the following one:

# tcpdump -ni flannel.1 port 1883
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on flannel.1, link-type EN10MB (Ethernet), capture size 262144 bytes
11:35:37.955774 IP 10.244.2.0.58212 > 10.244.1.2.mqtt: Flags [S], seq 4005778623, win 64240, options [mss 1460,sackOK,TS val 2731514499 ecr 0,nop,wscale 7], length 0

which seems correct, as traffic is being routed to the IP address and port of the MQTT broker POD (10.244.1.2.mqtt).
What is interesting is that, if I open a shell into the MQTT broker POD and run tcpdump inside it nothing shows up, not a single packet.
Same result if I run tcpdump -ni any port 1883 on the physical host where the POD is in execution (slave1).
slave1 ipvs rules:

# ipvsadm -L
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  centos8-k8s-slave1:31883 rr
  -> 10.244.1.2:mqtt              Masq    1      0          0         
TCP  centos8-k8s-slave1:https rr
  -> 192.168.122.120:sun-sr-https Masq    1      1          0         
TCP  centos8-k8s-slave1:domain rr
  -> 10.244.0.2:domain            Masq    1      0          0         
  -> 10.244.0.3:domain            Masq    1      0          0         
TCP  centos8-k8s-slave1:9153 rr
  -> 10.244.0.2:9153              Masq    1      0          0         
  -> 10.244.0.3:9153              Masq    1      0          0         
TCP  centos8-k8s-slave1:mqtt rr
  -> 10.244.1.2:mqtt              Masq    1      0          0         
TCP  centos8-k8s-slave1:31883 rr
  -> 10.244.1.2:mqtt              Masq    1      0          0         
TCP  centos8-k8s-slave1:31883 rr
  -> 10.244.1.2:mqtt              Masq    1      0          0         
TCP  localhost:31883 rr
  -> 10.244.1.2:mqtt              Masq    1      0          0         
UDP  centos8-k8s-slave1:domain rr
  -> 10.244.0.2:domain            Masq    1      0          0         
  -> 10.244.0.3:domain            Masq    1      0          0         

slave2 ipvs rules:

# ipvsadm -L
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  centos8-k8s-slave2:31883 rr
  -> 10.244.1.2:mqtt              Masq    1      2          0         
TCP  centos8-k8s-slave2:https rr
  -> 192.168.122.120:sun-sr-https Masq    1      1          0         
TCP  centos8-k8s-slave2:domain rr
  -> 10.244.0.2:domain            Masq    1      0          0         
  -> 10.244.0.3:domain            Masq    1      0          0         
TCP  centos8-k8s-slave2:9153 rr
  -> 10.244.0.2:9153              Masq    1      0          0         
  -> 10.244.0.3:9153              Masq    1      0          0         
TCP  centos8-k8s-slave2:mqtt rr
  -> 10.244.1.2:mqtt              Masq    1      0          0         
TCP  centos8-k8s-slave2:31883 rr
  -> 10.244.1.2:mqtt              Masq    1      0          0         
TCP  localhost:31883 rr
  -> 10.244.1.2:mqtt              Masq    1      0          0         
UDP  centos8-k8s-slave2:domain rr
  -> 10.244.0.2:domain            Masq    1      0          0         
  -> 10.244.0.3:domain            Masq    1      0          0 

@thockin
Copy link
Member

thockin commented Apr 2, 2020

Both IPVS and iptables modes us iptables for NodePort forwarding, I believe. But there are no other reports of this happening, so I don't buy that it is broken IN GENERAL.

I am (obviously) more inclined to blame flannel, but before I do that maybe we can rule out the iptables parts.

Can you provide a full iptables-save (feel free to redact IPs if you need, just replace them with unique identifiers :) from a working node and a failing node?

@thockin thockin added the triage/unresolved Indicates an issue that can not or will not be resolved. label Apr 2, 2020
@thockin thockin self-assigned this Apr 2, 2020
@rikatz
Copy link
Contributor

rikatz commented Apr 3, 2020

@cfabio please take a look into #88986

So far, if this is the same case as I'm seeing (CentOS 7.7, Kernel 3.10.0-1062.*, flannel + vxlan) you should disable tx offloading in the source of the communication and it will work.

I've been taking a look into ClusterIP cases but can confirm that this happens also with NodePort (made a test with your case).

@thockin do you mind I take this issue and agreggate into #88986 ?

@whites11
Copy link

whites11 commented Apr 4, 2020

I think I'm facing this same issue on two different clusters.
Not using flannel (we use AWS CNI and Azure CNI).
kube-proxy in iptables mode.

Playing around with tcpdump, it seems that the TCP handshake is not answered in a broken node:

# tcpdump -i any "port 80 or port 30010"
06:52:32.994186 IP worker2.58138 > worker1.30010: Flags [S], seq 3335691952, win 64240, options [mss 1418,nop,nop,sackOK,nop,wscale 7], length 0
06:52:34.056953 IP worker2.58138 > worker1.30010: Flags [S], seq 3335691952, win 64240, options [mss 1418,nop,nop,sackOK,nop,wscale 7], length 0

worker1 is a node without the pod running.
worker2 is another node in the cluster.
30010 is my nodeport
80 is the containerPort.

The TCP SYN packet arrives into worker1 but never gets an answer nor it gets sent elsewhere AFAIU.

I also tried to tcpdump on all interfaces in the node where the only instance of the POD is running and I get zero packets for either ports (30010 and 80).
I tried to compare the iptables-save results from one node that has the POD and one that does not.
In the broken node, I see these rules:

-A KUBE-NODEPORTS -p tcp -m comment --comment "kube-system/nginx-ingress-controller:http" -m tcp --dport 30010 -j KUBE-XLB-5URCD7LMTHSEGXBZ
-A KUBE-XLB-5URCD7LMTHSEGXBZ -m comment --comment "kube-system/nginx-ingress-controller:http has no local endpoints" -j KUBE-MARK-DROP

While in the working one I see:

-A KUBE-NODEPORTS -p tcp -m comment --comment "kube-system/nginx-ingress-controller:http" -m tcp --dport 30010 -j KUBE-XLB-5URCD7LMTHSEGXBZ
-A KUBE-XLB-5URCD7LMTHSEGXBZ -m comment --comment "Balancing rule 0 for kube-system/nginx-ingress-controller:http" -j KUBE-SEP-WGDPEU6D5NJVWF7U
-A KUBE-SEP-WGDPEU6D5NJVWF7U -s 10.0.132.147/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-WGDPEU6D5NJVWF7U -p tcp -m tcp -j DNAT --to-destination 10.0.132.147:80

My only concern is about the KUBE-MARK-DROP from the first output. I'd expect the packet to be forwarded to one of the other nodes, not to be dropped.

@whites11
Copy link

whites11 commented Apr 4, 2020

For what it's worth, I can't reproduce this issue with a multi-node kind kubernetes v1.16.3 nor v1.17.0 cluster.

@cfabio
Copy link
Author

cfabio commented Apr 4, 2020

@thockin iptables-save from slave1 (the one where the POD is in execution):

# Generated by iptables-save v1.4.21 on Sat Apr  4 16:01:29 2020
*mangle
:PREROUTING ACCEPT [34057:20971889]
:INPUT ACCEPT [34043:20970013]
:FORWARD ACCEPT [1:60]
:OUTPUT ACCEPT [30134:2322377]
:POSTROUTING ACCEPT [29688:2292643]
:KUBE-KUBELET-CANARY - [0:0]
COMMIT
# Completed on Sat Apr  4 16:01:29 2020
# Generated by iptables-save v1.4.21 on Sat Apr  4 16:01:29 2020
*filter
:INPUT ACCEPT [182:52274]
:FORWARD DROP [0:0]
:OUTPUT ACCEPT [137:15752]
:DOCKER - [0:0]
:DOCKER-ISOLATION-STAGE-1 - [0:0]
:DOCKER-ISOLATION-STAGE-2 - [0:0]
:DOCKER-USER - [0:0]
:KUBE-FIREWALL - [0:0]
:KUBE-FORWARD - [0:0]
:KUBE-KUBELET-CANARY - [0:0]
-A INPUT -j KUBE-FIREWALL
-A FORWARD -m comment --comment "kubernetes forwarding rules" -j KUBE-FORWARD
-A FORWARD -j DOCKER-USER
-A FORWARD -j DOCKER-ISOLATION-STAGE-1
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT
-A FORWARD -o br-968244e70a0e -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o br-968244e70a0e -j DOCKER
-A FORWARD -i br-968244e70a0e ! -o br-968244e70a0e -j ACCEPT
-A FORWARD -i br-968244e70a0e -o br-968244e70a0e -j ACCEPT
-A FORWARD -o docker_gwbridge -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o docker_gwbridge -j DOCKER
-A FORWARD -i docker_gwbridge ! -o docker_gwbridge -j ACCEPT
-A FORWARD -i docker_gwbridge -o docker_gwbridge -j DROP
-A FORWARD -s 10.244.0.0/16 -j ACCEPT
-A FORWARD -d 10.244.0.0/16 -j ACCEPT
-A OUTPUT -j KUBE-FIREWALL
-A DOCKER-ISOLATION-STAGE-1 -i docker0 ! -o docker0 -j DOCKER-ISOLATION-STAGE-2
-A DOCKER-ISOLATION-STAGE-1 -i br-968244e70a0e ! -o br-968244e70a0e -j DOCKER-ISOLATION-STAGE-2
-A DOCKER-ISOLATION-STAGE-1 -i docker_gwbridge ! -o docker_gwbridge -j DOCKER-ISOLATION-STAGE-2
-A DOCKER-ISOLATION-STAGE-1 -j RETURN
-A DOCKER-ISOLATION-STAGE-2 -o docker0 -j DROP
-A DOCKER-ISOLATION-STAGE-2 -o br-968244e70a0e -j DROP
-A DOCKER-ISOLATION-STAGE-2 -o docker_gwbridge -j DROP
-A DOCKER-ISOLATION-STAGE-2 -j RETURN
-A DOCKER-USER -j RETURN
-A KUBE-FIREWALL -m comment --comment "kubernetes firewall for dropping marked packets" -m mark --mark 0x8000/0x8000 -j DROP
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding rules" -m mark --mark 0x4000/0x4000 -j ACCEPT
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding conntrack pod destination rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
COMMIT
# Completed on Sat Apr  4 16:01:29 2020
# Generated by iptables-save v1.4.21 on Sat Apr  4 16:01:29 2020
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:DOCKER - [0:0]
:KUBE-FIREWALL - [0:0]
:KUBE-KUBELET-CANARY - [0:0]
:KUBE-LOAD-BALANCER - [0:0]
:KUBE-MARK-DROP - [0:0]
:KUBE-MARK-MASQ - [0:0]
:KUBE-NODE-PORT - [0:0]
:KUBE-POSTROUTING - [0:0]
:KUBE-SERVICES - [0:0]
-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A POSTROUTING -s 172.19.0.0/16 ! -o br-968244e70a0e -j MASQUERADE
-A POSTROUTING -s 172.18.0.0/16 ! -o docker_gwbridge -j MASQUERADE
-A POSTROUTING -s 10.244.0.0/16 -d 10.244.0.0/16 -j RETURN
-A POSTROUTING -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE
-A POSTROUTING ! -s 10.244.0.0/16 -d 10.244.1.0/24 -j RETURN
-A POSTROUTING ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE
-A DOCKER -i docker0 -j RETURN
-A DOCKER -i br-968244e70a0e -j RETURN
-A DOCKER -i docker_gwbridge -j RETURN
-A KUBE-FIREWALL -j KUBE-MARK-DROP
-A KUBE-LOAD-BALANCER -j KUBE-MARK-MASQ
-A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
-A KUBE-NODE-PORT -p tcp -m comment --comment "Kubernetes nodeport TCP port for masquerade purpose" -m set --match-set KUBE-NODE-PORT-TCP dst -j KUBE-MARK-MASQ
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE
-A KUBE-POSTROUTING -m comment --comment "Kubernetes endpoints dst ip:port, source ip for solving hairpin purpose" -m set --match-set KUBE-LOOP-BACK dst,dst,src -j MASQUERADE
-A KUBE-SERVICES ! -s 10.244.0.0/16 -m comment --comment "Kubernetes service cluster ip + port for masquerade purpose" -m set --match-set KUBE-CLUSTER-IP dst,dst -j KUBE-MARK-MASQ
-A KUBE-SERVICES -m addrtype --dst-type LOCAL -j KUBE-NODE-PORT
-A KUBE-SERVICES -m set --match-set KUBE-CLUSTER-IP dst,dst -j ACCEPT
COMMIT
# Completed on Sat Apr  4 16:01:29 2020

iptables-save from slave2:

# Generated by iptables-save v1.4.21 on Sat Apr  4 16:02:41 2020
*mangle
:PREROUTING ACCEPT [34529:21104949]
:INPUT ACCEPT [34516:21103133]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [30632:2366423]
:POSTROUTING ACCEPT [30174:2336057]
:KUBE-KUBELET-CANARY - [0:0]
COMMIT
# Completed on Sat Apr  4 16:02:41 2020
# Generated by iptables-save v1.4.21 on Sat Apr  4 16:02:41 2020
*filter
:INPUT ACCEPT [92:18439]
:FORWARD DROP [0:0]
:OUTPUT ACCEPT [63:10736]
:DOCKER - [0:0]
:DOCKER-ISOLATION-STAGE-1 - [0:0]
:DOCKER-ISOLATION-STAGE-2 - [0:0]
:DOCKER-USER - [0:0]
:KUBE-FIREWALL - [0:0]
:KUBE-FORWARD - [0:0]
:KUBE-KUBELET-CANARY - [0:0]
-A INPUT -j KUBE-FIREWALL
-A FORWARD -m comment --comment "kubernetes forwarding rules" -j KUBE-FORWARD
-A FORWARD -j DOCKER-USER
-A FORWARD -j DOCKER-ISOLATION-STAGE-1
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT
-A FORWARD -o docker_gwbridge -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o docker_gwbridge -j DOCKER
-A FORWARD -i docker_gwbridge ! -o docker_gwbridge -j ACCEPT
-A FORWARD -o br-968244e70a0e -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o br-968244e70a0e -j DOCKER
-A FORWARD -i br-968244e70a0e ! -o br-968244e70a0e -j ACCEPT
-A FORWARD -i br-968244e70a0e -o br-968244e70a0e -j ACCEPT
-A FORWARD -i docker_gwbridge -o docker_gwbridge -j DROP
-A FORWARD -s 10.244.0.0/16 -j ACCEPT
-A FORWARD -d 10.244.0.0/16 -j ACCEPT
-A OUTPUT -j KUBE-FIREWALL
-A DOCKER-ISOLATION-STAGE-1 -i docker0 ! -o docker0 -j DOCKER-ISOLATION-STAGE-2
-A DOCKER-ISOLATION-STAGE-1 -i docker_gwbridge ! -o docker_gwbridge -j DOCKER-ISOLATION-STAGE-2
-A DOCKER-ISOLATION-STAGE-1 -i br-968244e70a0e ! -o br-968244e70a0e -j DOCKER-ISOLATION-STAGE-2
-A DOCKER-ISOLATION-STAGE-1 -j RETURN
-A DOCKER-ISOLATION-STAGE-2 -o docker0 -j DROP
-A DOCKER-ISOLATION-STAGE-2 -o docker_gwbridge -j DROP
-A DOCKER-ISOLATION-STAGE-2 -o br-968244e70a0e -j DROP
-A DOCKER-ISOLATION-STAGE-2 -j RETURN
-A DOCKER-USER -j RETURN
-A KUBE-FIREWALL -m comment --comment "kubernetes firewall for dropping marked packets" -m mark --mark 0x8000/0x8000 -j DROP
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding rules" -m mark --mark 0x4000/0x4000 -j ACCEPT
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding conntrack pod destination rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
COMMIT
# Completed on Sat Apr  4 16:02:41 2020
# Generated by iptables-save v1.4.21 on Sat Apr  4 16:02:41 2020
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:DOCKER - [0:0]
:KUBE-FIREWALL - [0:0]
:KUBE-KUBELET-CANARY - [0:0]
:KUBE-LOAD-BALANCER - [0:0]
:KUBE-MARK-DROP - [0:0]
:KUBE-MARK-MASQ - [0:0]
:KUBE-NODE-PORT - [0:0]
:KUBE-POSTROUTING - [0:0]
:KUBE-SERVICES - [0:0]
-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A POSTROUTING -s 172.18.0.0/16 ! -o docker_gwbridge -j MASQUERADE
-A POSTROUTING -s 172.19.0.0/16 ! -o br-968244e70a0e -j MASQUERADE
-A POSTROUTING -s 10.244.0.0/16 -d 10.244.0.0/16 -j RETURN
-A POSTROUTING -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE
-A POSTROUTING ! -s 10.244.0.0/16 -d 10.244.2.0/24 -j RETURN
-A POSTROUTING ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE
-A DOCKER -i docker0 -j RETURN
-A DOCKER -i docker_gwbridge -j RETURN
-A DOCKER -i br-968244e70a0e -j RETURN
-A KUBE-FIREWALL -j KUBE-MARK-DROP
-A KUBE-LOAD-BALANCER -j KUBE-MARK-MASQ
-A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
-A KUBE-NODE-PORT -p tcp -m comment --comment "Kubernetes nodeport TCP port for masquerade purpose" -m set --match-set KUBE-NODE-PORT-TCP dst -j KUBE-MARK-MASQ
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE
-A KUBE-SERVICES ! -s 10.244.0.0/16 -m comment --comment "Kubernetes service cluster ip + port for masquerade purpose" -m set --match-set KUBE-CLUSTER-IP dst,dst -j KUBE-MARK-MASQ
-A KUBE-SERVICES -m addrtype --dst-type LOCAL -j KUBE-NODE-PORT
-A KUBE-SERVICES -m set --match-set KUBE-CLUSTER-IP dst,dst -j ACCEPT
COMMIT
# Completed on Sat Apr  4 16:02:41 2020

@rikatz disabing TX and RX offload for flannel.1 interface on every node of the cluster (ansible centos7-k8s-local -u root -a "ethtool --offload flannel.1 rx off tx off") seems to be an effective workaround on CentOS 7 with Kubernetes 1.18.0, Flannel and IPVS.
I wonder what the implications of disabling TX offload are.

@whites11
Copy link

whites11 commented Apr 4, 2020

It would be awesome to have an iptables-save example for a cluster not facing this issue to compare the iptables rules.

@whites11
Copy link

whites11 commented Apr 5, 2020

I realised that changing the Service's externalTrafficPolicy from Local to Cluster makes the networking work again. Not really a solution in my case, but it might help explaining the behaviour.

@rikatz
Copy link
Contributor

rikatz commented Apr 6, 2020

@rikatz disabing TX and RX offload for flannel.1 interface on every node of the cluster (ansible centos7-k8s-local -u root -a "ethtool --offload flannel.1 rx off tx off") seems to be an effective workaround on CentOS 7 with Kubernetes 1.18.0, Flannel and IPVS.
I wonder what the implications of disabling TX offload are.

Yes, this is a known issue with RH/CentOS 7 kernel (3.10.0-1062) with vxlan and txoffload, so the only workaround right now is to disable the txoffload, wait for an update into the kernel or use another OS.

If this solves the problem, so this is a known problem. I'll mark this issue as a Duplicate of #88986 and close this, if you want you can also follow in this other issue, as soon as there is a release of an updated kernel version we can test and post there.

But reinforcing, in this case is not a Kubernetes problem, but a CNI problem with the current CentOS 7.7 kernel

Thank you!

/remove-triage unresolved
/triage duplicate
/close

@k8s-ci-robot
Copy link
Contributor

@rikatz: The label(s) triage/ cannot be applied, because the repository doesn't have them

In response to this:

@rikatz disabing TX and RX offload for flannel.1 interface on every node of the cluster (ansible centos7-k8s-local -u root -a "ethtool --offload flannel.1 rx off tx off") seems to be an effective workaround on CentOS 7 with Kubernetes 1.18.0, Flannel and IPVS.
I wonder what the implications of disabling TX offload are.

Yes, this is a known issue with RH/CentOS 7 kernel (3.10.0-1062) with vxlan and txoffload, so the only workaround right now is to disable the txoffload, wait for an update into the kernel or use another OS.

If this solves the problem, so this is a known problem. I'll mark this issue as a Duplicate of #88986 and close this, if you want you can also follow in this other issue, as soon as there is a release of an updated kernel version we can test and post there.

But reinforcing, in this case is not a Kubernetes problem, but a CNI problem with the current CentOS 7.7 kernel

Thank you!

/remove-triage unresolved
/triage duplicate
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added triage/duplicate Indicates an issue is a duplicate of other open issue. and removed triage/unresolved Indicates an issue that can not or will not be resolved. labels Apr 6, 2020
@k8s-ci-robot
Copy link
Contributor

@rikatz: Closing this issue.

In response to this:

@rikatz disabing TX and RX offload for flannel.1 interface on every node of the cluster (ansible centos7-k8s-local -u root -a "ethtool --offload flannel.1 rx off tx off") seems to be an effective workaround on CentOS 7 with Kubernetes 1.18.0, Flannel and IPVS.
I wonder what the implications of disabling TX offload are.

Yes, this is a known issue with RH/CentOS 7 kernel (3.10.0-1062) with vxlan and txoffload, so the only workaround right now is to disable the txoffload, wait for an update into the kernel or use another OS.

If this solves the problem, so this is a known problem. I'll mark this issue as a Duplicate of #88986 and close this, if you want you can also follow in this other issue, as soon as there is a release of an updated kernel version we can test and post there.

But reinforcing, in this case is not a Kubernetes problem, but a CNI problem with the current CentOS 7.7 kernel

Thank you!

/remove-triage unresolved
/triage duplicate
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rikatz
Copy link
Contributor

rikatz commented Apr 6, 2020

Oh and please feel free to reopen if you don't think its the same case :)

@cfabio
Copy link
Author

cfabio commented Apr 6, 2020

I am not sure this can really be considered "closed", there are two issues here, present in both CentOS 7 and CentOS 8:

  • kube-proxy iptables backend is broken in any kubernetes versions from 1.17.0 up to 1.18.0.
  • kube-proxy ipvs backend also has some issues considering it only works with tx offload disabled.
    For the first one there isn't even a workaround aside from downgrading to kubernetes 1.16.x.

In the next days I will try Debian or Fedora with an up to date kernel.

@baokiemanh
Copy link

May be this issue related to firewall setting on cloud provider.

In my case, I setup cluster (kubeadm) on AWS EC2 instances & got the same issue. After I change Security Group (firewall) setting to allow all traffic then everything back to normal.

I never got this issue when setup cluster on bare metal.

Hope it help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/duplicate Indicates an issue is a duplicate of other open issue.
Projects
None yet
Development

No branches or pull requests

8 participants