dnsPolicy in hostNetwork not working as expected #87852

rdxmb · 2020-02-05T16:47:16Z

What happened:
In kubernetes 1.17, pods running with hostNetwork: true are not able to get dns answers from the coredns-service - especially if using the strongly recommended clusterPolicy: ClusterFirstWithHostNet

Also, I noticed that the coredns Service seems to be not always reachable from the host itself.

What you expected to happen:
The coredns Service is reachable from within the pod in the hostNetwork, especially when using clusterPolicy: ClusterFirstWithHostNet. Also, the coredns Service is reachable from the host, like this is in kubernetes 1.15

How to reproduce it (as minimally and precisely as possible):

# kubectl -n kube-system get svc
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
kube-dns   ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   97d
# dig @10.96.0.10 kubernetes.io

; <<>> DiG 9.10.3-P4-Ubuntu <<>> @10.96.0.10 kubernetes.io
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

# cat dns-pods-in-host-network.yaml
# kubectl apply -f dns-pods-in-host-network.yaml
---
apiVersion: v1
kind: Pod
metadata:
  name: cluster-first
  namespace: default
spec:
  containers:
  - name: dnsutils
    image: gcr.io/kubernetes-e2e-test-images/dnsutils:1.3
    command:
      - sleep
      - "3600"
    imagePullPolicy: IfNotPresent
  restartPolicy: Always
  hostNetwork: true
  dnsPolicy: ClusterFirst
---
apiVersion: v1
kind: Pod
metadata:
  name: cluster-first-with-hostnet
  namespace: default
spec:
  containers:
  - name: dnsutils
    image: gcr.io/kubernetes-e2e-test-images/dnsutils:1.3
    command:
      - sleep
      - "3600"
    imagePullPolicy: IfNotPresent
  restartPolicy: Always
  dnsPolicy: ClusterFirstWithHostNet
  hostNetwork: true

root@master:/tmp# kubectl exec -ti cluster-first -- nslookup kubernetes.io
Server:         1.1.1.1
Address:        1.1.1.1#53

Non-authoritative answer:
Name:   kubernetes.io
Address: 147.75.40.148

root@master:/tmp# kubectl exec -ti cluster-first-with-hostnet -- nslookup kubernetes.io
;; connection timed out; no servers could be reached

command terminated with exit code 1

Anything else we need to know?:
I noticed this on three small clusters with kubernetes 1.17, each running with 1 master and 2 or 3 nodes. Most of them were upgraded from lower kubernetes-versions (e.g. starting from 1.13 -> 1.14 -> 1.15 -> 1.16 -> 1.17)

Environment:

Kubernetes version (use kubectl version): 1.17
Cloud provider or hardware configuration: BareMetal, mostly running on VMware
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a): Linux eins 4.15.0-74-generic add travis integration #83~16.04.1-Ubuntu SMP Wed Dec 18 04:56:23 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Install tools: kubeadm
Network plugin and version (if this is a network-related bug): flannel

The text was updated successfully, but these errors were encountered:

neolit123 · 2020-02-05T19:28:03Z

/sig network

rdxmb · 2020-02-05T19:30:11Z

/sig Network

athenabot · 2020-02-05T19:57:07Z

/triage unresolved

Comment /remove-triage unresolved when the issue is assessed and confirmed.

🤖 I am a bot run by vllry. 👩‍🔬

danwinship · 2020-02-06T22:10:30Z

/assign @aojea

aojea · 2020-02-08T09:03:30Z

@rdxmb based on your description it seems that the Kubernetes DNS service is not able to resolve or forward the external domain queries

root@master:/tmp# kubectl exec -ti cluster-first-with-hostnet -- nslookup kubernetes.io
;; connection timed out; no servers could be reached

command terminated with exit code 1

The fact that this is happening on clusters being upgraded from 1.13 makes me wonder if there could be some issue on the cluster DNS upgrade.

What DNS are you using kube-dns or CoreDNS?

Can you check if there are any errors on the DNS pods?
https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/#check-for-errors-in-the-dns-pod

If there are no errors we should debug the DNS queries as explained here
https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/#are-dns-queries-being-received-processed

MansM · 2020-02-12T15:17:35Z

I have kinda the same issue.
using the netdata helm chart the pods have hostnetwork true and ClusterFirstWithHostNet.
only the pod that is on the same host as the master (where the service is pointing to) can reach the service.

kubectl get pods -n netdata -o wide
NAME                  READY   STATUS      RESTARTS   AGE     IP            NODE                  NOMINATED NODE   READINESS GATES
netdata-master-0      1/1     Running     4          22h     10.244.2.20   r710bot               <none>           <none>
netdata-slave-cqrm4   1/1     Running     0          14m     10.0.2.20     masterpi              <none>           <none>
netdata-slave-f29hg   1/1     Running     0          94s     10.0.2.23     r710bot               <none>           <none> <--- this one can reach master
netdata-slave-j8zs7   1/1     Running     0          81s     10.0.2.24     r710top.localdomain   <none>           <none>
test                  0/1     Completed   0          6h27m   10.244.1.31   r710top.localdomain   <none>           <none>

Logs are pretty clear:

2020-02-12 15:11:52: netdata INFO  : STREAM_SENDER[r710top.localdomain] : STREAM r710top.localdomain [send to netdata:19999]: connecting...
2020-02-12 15:11:57: netdata ERROR : STREAM_SENDER[r710top.localdomain] : Cannot resolve host 'netdata', port '19999': Try again (errno 22, Invalid argument)
2020-02-12 15:11:57: netdata ERROR : STREAM_SENDER[r710top.localdomain] : STREAM r710top.localdomain [send to netdata:19999]: failed to connect

using the serviceIP I cant curl it, using the podIP I can curl it directly

aojea · 2020-02-12T15:24:26Z

I have kinda the same issue.
...
only the pod that is on the same host as the master (where the service is pointing to) can reach the service.
....

using the serviceIP I cant curl it, using the podIP I can curl it directly

that sounds more related to a connectivity problem, maybe related to the CNI, maybe related to iptables, .... that you are not able to access the services from the nodes 🤷‍♂

MansM · 2020-02-12T15:34:00Z

I have kinda the same issue.
...
only the pod that is on the same host as the master (where the service is pointing to) can reach the service.
....
using the serviceIP I cant curl it, using the podIP I can curl it directly

that sounds more related to a connectivity problem, maybe related to the CNI, maybe related to iptables, .... that you are not able to access the services from the nodes 🤷‍♂

If I do a kubectl run and it ends up on the same node: zero issues. Will try if I can adjust the pod to dns by podIP and see if that connects

MansM · 2020-02-13T09:23:42Z

I have kinda the same issue.
...
only the pod that is on the same host as the master (where the service is pointing to) can reach the service.
....
using the serviceIP I cant curl it, using the podIP I can curl it directly

that sounds more related to a connectivity problem, maybe related to the CNI, maybe related to iptables, .... that you are not able to access the services from the nodes 🤷‍♂

If I do a kubectl run and it ends up on the same node: zero issues. Will try if I can adjust the pod to dns by podIP and see if that connects

Didnt work

athenabot · 2020-02-13T22:57:22Z

@aojea
If this issue has been triaged, please comment /remove-triage unresolved.

If you aren't able to handle this issue, consider unassigning yourself and/or adding the help-wanted label.

🤖 I am a bot run by vllry. 👩‍🔬

jhohertz · 2020-02-14T21:08:23Z

Seeing the same thing when trying to run kiam on 1.17, seen the issue from at least rc.2 through 1.17.3, but wasn't sure at the time where the issue was.

Ticket I logged w/ the kiam folks: uswitch/kiam#378
Ticket I logged w/ the kops folks: kubernetes/kops#8562

As this seems networking related, I will state we are running Canal CNI, and IPVS mode of kube-proxy.

(Correction, we were on Canal, not Calico, which means the mentioned Flannel issue is likely at the root of it....)

MansM · 2020-02-15T12:34:34Z

Did find a workaround: switching flannel to host-gw instead of vxlan:
flannel-io/flannel#1245 (comment)

kubectl edit cm -n kube-system kube-flannel-cfg

replace vxlan with host-gw
save
not sure if needed, but I did it anyway: kubectl delete pods -l app=flannel -n kube-system

rdxmb · 2020-02-17T16:50:38Z

@aojea

@rdxmb based on your description it seems that the Kubernetes DNS service is not able to resolve or forward the external domain queries

No. The problem seems to be the routing over the kubernetes-service, not the dns itself. Have a look at the following output:

# k -n kube-system get po -o wide | grep coredns
coredns-6955765f44-fhjkm       1/1     Running     0          24d    10.244.2.69    drei   <none>      
coredns-6955765f44-ll6m9       1/1     Running     1          24d    10.244.3.182   vier   <none>

# dig @10.244.2.69 kubernetes.io

; <<>> DiG 9.10.3-P4-Ubuntu <<>> @10.244.2.69 kubernetes.io
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 20419
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;kubernetes.io.                 IN      A

;; ANSWER SECTION:
kubernetes.io.          30      IN      A       147.75.40.148

;; Query time: 9 msec
;; SERVER: 10.244.2.69#53(10.244.2.69)
;; WHEN: Mon Feb 17 17:44:30 CET 2020
;; MSG SIZE  rcvd: 71

# dig @10.244.3.182 kubernetes.io

; <<>> DiG 9.10.3-P4-Ubuntu <<>> @10.244.3.182 kubernetes.io
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2658
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;kubernetes.io.                 IN      A

;; ANSWER SECTION:
kubernetes.io.          30      IN      A       147.75.40.148

;; Query time: 12 msec
;; SERVER: 10.244.3.182#53(10.244.3.182)
;; WHEN: Mon Feb 17 17:44:39 CET 2020
;; MSG SIZE  rcvd: 71

BUT

# k -n kube-system get svc
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
kube-dns   ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   109d

# dig @10.96.0.10 google.de
^C# dig @10.96.0.10 kubernetes.io

; <<>> DiG 9.10.3-P4-Ubuntu <<>> @10.96.0.10 kubernetes.io
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

# k -n kube-system describe svc kube-dns 
Name:              kube-dns
Namespace:         kube-system
Labels:            k8s-app=kube-dns
                   kubernetes.io/cluster-service=true
                   kubernetes.io/name=KubeDNS
Annotations:       prometheus.io/port: 9153
                   prometheus.io/scrape: true
Selector:          k8s-app=kube-dns
Type:              ClusterIP
IP:                10.96.0.10
Port:              dns  53/UDP
TargetPort:        53/UDP
Endpoints:         10.244.2.69:53,10.244.3.182:53
Port:              dns-tcp  53/TCP
TargetPort:        53/TCP
Endpoints:         10.244.2.69:53,10.244.3.182:53
Port:              metrics  9153/TCP
TargetPort:        9153/TCP
Endpoints:         10.244.2.69:9153,10.244.3.182:9153
Session Affinity:  None
Events:            <none>

rdxmb · 2020-02-17T17:01:25Z

# for p in $(kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o name); do kubectl logs --namespace=kube-system $p; done
.:53
[INFO] plugin/reload: Running configuration MD5 = 4e235fcc3696966e76816bcd9034ebc7
CoreDNS-1.6.5
linux/amd64, go1.13.4, c2fd1b2
.:53
[INFO] plugin/reload: Running configuration MD5 = 4e235fcc3696966e76816bcd9034ebc7
CoreDNS-1.6.5
linux/amd64, go1.13.4, c2fd1b2

aojea · 2020-02-17T22:06:43Z

No. The problem seems to be the routing over the kubernetes-service, not the dns itself. Have a look at the following output:

then ... it is not a problem in Kubernetes ... it has to be something external or related to the CNI plugin, right?

mariusgrigoriu · 2020-02-17T22:09:26Z

then ... it is not a problem in Kubernetes ... it has to be something external or related to the CNI plugin, right?

Maybe, though it would be good to understand what changed in 1.17 that's causing issues with flannel and vxlan so we get to a root cause.

aojea · 2020-02-18T09:10:24Z

Maybe, though it would be good to understand what changed in 1.17 that's causing issues with flannel and vxlan so we get to a root cause.

I agree with you, but having 2 issues open in parallel will not help to focus the investigation, and since this seems a flannel specific issue I will close this one in favor of flannel-io/flannel#1243

Please feel free to reopen if there are more CNIs affected or there is any evidence that this is a Kubernetes issue

/close

k8s-ci-robot · 2020-02-18T09:10:26Z

@aojea: Closing this issue.

In response to this:

Maybe, though it would be good to understand what changed in 1.17 that's causing issues with flannel and vxlan so we get to a root cause.

I agree with you, but having 2 issues open in parallel will not help to focus the investigation, and since this seems a flannel specific issue I will close this one in favor of flannel-io/flannel#1243

Please feel free to reopen if there are more CNIs affected or there is any evidence that this a Kubernetes issue

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

elinesterov · 2020-03-10T18:40:57Z

I see exactly the same issue using GCP with calico vxlan for networking. Had to switch to ipip to solve it. I didn't have much time to figure out vxlan issue, so expect to bump into it in some virtual environments that might use vxlan under the hood. Not sure if it is related to kubernetes 17.0 I didn't test earlier versions in that environment.

LuckySB · 2020-04-12T22:52:28Z

I noticed that packets directed to the service address receive the iptables 0x4000 label
and the same label appears on the packet, after encapsulating it in vxlan
and the outgoing vxlan packet passes through MASQUERADE again

LuckySB · 2020-04-12T23:08:05Z

also help workaround from issue #88986

ethtool --offload flannel.1 rx off tx off

leodotcloud · 2020-12-01T14:14:29Z

TCP offloading on vxlan.calico adaptor causing 63 second delays in VXLAN communications node->nodeport or node->clusterip:port. projectcalico/calico#3145
Disable tx and rx offloading on VXLAN interfaces flannel-io/flannel#1282

rdxmb added the kind/bug Categorizes issue or PR as related to a bug. label Feb 5, 2020

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Feb 5, 2020

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 5, 2020

k8s-ci-robot added the triage/unresolved Indicates an issue that can not or will not be resolved. label Feb 5, 2020

k8s-ci-robot assigned aojea Feb 6, 2020

MansM mentioned this issue Feb 12, 2020

netdata fail to resolve in kubernetes netdata/netdata#8060

Closed

jhohertz mentioned this issue Feb 14, 2020

Issue on k8s/kops 1.17 clusters uswitch/kiam#378

Closed

MansM mentioned this issue Feb 15, 2020

ClusterIP services not accessible when using flannel CNI from host machines in Kubernetes flannel-io/flannel#1243

Closed

k8s-ci-robot closed this as completed Feb 18, 2020

Miouge1 mentioned this issue Apr 11, 2020

Update Flannel manifests, install script and version (0.12) + fix tests scripts kubernetes-sigs/kubespray#5937

Merged

LuckySB mentioned this issue Apr 11, 2020

kube-proxy and vxlan - service net unavailable from pod with 'hostnetwork: true' #90084

Closed

LuckySB mentioned this issue Apr 12, 2020

Bare Metal K8S 63 Second Service Routing Delay - when accessing service via ClusterIP, or ExternalIP #88986

Closed

prologic mentioned this issue May 5, 2020

Netdata slaves dissapear after enabling TLS netdata/netdata#8847

Closed

cakrit mentioned this issue May 6, 2020

Fix DNS errors netdata/helmchart#96

Closed

oskapt mentioned this issue May 30, 2020

DNS resolution fails with dnsPolicy: ClusterFirstWithHostNet and hostNetwork: true k3s-io/k3s#1827

Closed

prachidamle mentioned this issue Jul 10, 2020

CIS scan 1.4 does not work on a multi node cluster rancher/rancher#27652

Closed

fideltak mentioned this issue Aug 26, 2020

CSI driver fails to provision volume hpe-storage/csi-driver#185

Open

This was referenced Jun 9, 2021

Use AzureClusterIdentity when running ci e2e tests kubernetes-sigs/cluster-api-provider-azure#1360

Merged

DNS issues when using AzureIdentity kubernetes-sigs/cluster-api-provider-azure#1448

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dnsPolicy in hostNetwork not working as expected #87852

dnsPolicy in hostNetwork not working as expected #87852

rdxmb commented Feb 5, 2020

neolit123 commented Feb 5, 2020

rdxmb commented Feb 5, 2020

athenabot commented Feb 5, 2020

danwinship commented Feb 6, 2020

aojea commented Feb 8, 2020

MansM commented Feb 12, 2020

aojea commented Feb 12, 2020

MansM commented Feb 12, 2020

MansM commented Feb 13, 2020

athenabot commented Feb 13, 2020

jhohertz commented Feb 14, 2020 •

edited

MansM commented Feb 15, 2020

rdxmb commented Feb 17, 2020 •

edited

rdxmb commented Feb 17, 2020

aojea commented Feb 17, 2020

mariusgrigoriu commented Feb 17, 2020

aojea commented Feb 18, 2020 •

edited

k8s-ci-robot commented Feb 18, 2020

elinesterov commented Mar 10, 2020

LuckySB commented Apr 12, 2020

LuckySB commented Apr 12, 2020

leodotcloud commented Dec 1, 2020 •

edited

dnsPolicy in hostNetwork not working as expected #87852

dnsPolicy in hostNetwork not working as expected #87852

Comments

rdxmb commented Feb 5, 2020

neolit123 commented Feb 5, 2020

rdxmb commented Feb 5, 2020

athenabot commented Feb 5, 2020

danwinship commented Feb 6, 2020

aojea commented Feb 8, 2020

MansM commented Feb 12, 2020

aojea commented Feb 12, 2020

MansM commented Feb 12, 2020

MansM commented Feb 13, 2020

athenabot commented Feb 13, 2020

jhohertz commented Feb 14, 2020 • edited

MansM commented Feb 15, 2020

rdxmb commented Feb 17, 2020 • edited

rdxmb commented Feb 17, 2020

aojea commented Feb 17, 2020

mariusgrigoriu commented Feb 17, 2020

aojea commented Feb 18, 2020 • edited

k8s-ci-robot commented Feb 18, 2020

elinesterov commented Mar 10, 2020

LuckySB commented Apr 12, 2020

LuckySB commented Apr 12, 2020

leodotcloud commented Dec 1, 2020 • edited

jhohertz commented Feb 14, 2020 •

edited

rdxmb commented Feb 17, 2020 •

edited

aojea commented Feb 18, 2020 •

edited

leodotcloud commented Dec 1, 2020 •

edited