New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-dns does not work due to DNS server lookup loop (/etc/resolv.conf contains 127.0.0.01 as the upstream nameserver)kubedns-masq and sidecar containers crash after doing nslookup @kubernetes/sig-network-bugs /sig area/dns #49411

Closed
jayeshnazre opened this Issue Jul 21, 2017 · 17 comments

Comments

Projects
None yet
6 participants
@jayeshnazre

jayeshnazre commented Jul 21, 2017

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug
@kubernetes/sig-network-bugs
/sig area/dns

What happened:
kubedns-masq and sidecar containers crash after doing nslookup on logical names that do not exists in the kubedns.

What you expected to happen:
The crashing of containers is not an expected behavior.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:
Se details below
Environment:

  • Kubernetes version (use kubectl version):
    1.6.4
  • Cloud provider or hardware configuration**:
    Local machine running K8s cluster on VMWare Workstation 12 pro (single node)
  • OS (e.g. from /etc/os-release):
    Ubuntu 17.04
  • Kernel (e.g. uname -a):

Linux ubuntumaster 4.10.0-19-generic #21-Ubuntu SMP Thu Apr 6 17:04:57 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

  • Install tools:
  • Others:

I am facing a strange issue
Here are my kubernetes details
Kubernetes version: 1.6.4
OS: Ubuntu 17.04
I use the YAML files from the following link to install kube-dns
https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/dns/

Step 1

I then try to launch busybox using
kubectl run -i --tty busybox --image=busybox – sh

Now when I do nslookup kubernetes it works.
Now when I try to see the docker logs for my sidecar I see the following

ERROR: logging before flag.Parse: I0721 17:15:25.067730 1 main.go:48] Version v1.14.3-4-gee838f6
ERROR: logging before flag.Parse: I0721 17:15:25.067981 1 server.go:45] Starting server (options {DnsMasqPort:53 DnsMasqAddr:127.0.0.1 DnsMasqPollIntervalMs:5000 Probes:[{Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1} {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}] PrometheusAddr:0.0.0.0 PrometheusPort:10054 PrometheusPath:/metrics PrometheusNamespace:kubedns})
ERROR: logging before flag.Parse: I0721 17:15:25.068191 1 dnsprobe.go:75] Starting dnsProbe {Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}
ERROR: logging before flag.Parse: I0721 17:15:25.068398 1 dnsprobe.go:75] Starting dnsProbe {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}
ERROR: logging before flag.Parse:

Step 2
If I do a nslookup against a name (say nslookup ABCD) that I know does not exist in kubedns (as a A record) then I see following logs in my dnsmesg container

I0718 17:26:33.738368 1 nanny.go:108] dnsmasq[13]: Maximum number of concurrent DNS queries reached (max: 150)

And at the same time I see following in sidecar

dns sidecar errorERROR: logging before flag.Parse: I0720 02:09:37.975054 1 main.go:48] Version v1.14.3-4-gee838f6
ERROR: logging before flag.Parse: I0720 02:09:37.975140 1 server.go:45] Starting server (options {DnsMasqPort:53 DnsMasqAddr:127.0.0.1 DnsMasqPollIntervalMs:5000 Probes:[{Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1} {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}] PrometheusAddr:0.0.0.0 PrometheusPort:10054 PrometheusPath:/metrics PrometheusNamespace:kubedns})
ERROR: logging before flag.Parse: I0720 02:09:37.975160 1 dnsprobe.go:75] Starting dnsProbe {Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}
ERROR: logging before flag.Parse: I0720 02:09:37.975200 1 dnsprobe.go:75] Starting dnsProbe {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}
ERROR: logging before flag.Parse: W0720 02:12:50.090595 1 server.go:64] Error getting metrics from dnsmasq: read udp 127.0.0.1:36473->127.0.0.1:53: i/o timeout
ERROR: logging before flag.Parse: W0720 02:13:03.607740 1 server.go:64] Error getting metrics from dnsmasq: read udp 127.0.0.1:56334->127.0.0.1:53: i/o timeout
ERROR: logging before flag.Parse: W0720 02:13:10.609651 1 server.go:64] Error getting metrics from dnsmasq: read udp 127.0.0.1:39976->127.0.0.1:53: i/o timeout
ERROR: logging before flag.Parse: W0720 02:13:23.644035 1 server.go:64] Error getting metrics from dnsmasq: read udp 127.0.0.1:57226->127.0.0.1:53: i/o timeout

Step 3
Now here is a bummer after 5minutes or 10 minutes my sidecar and dnsmasq both crash and new containers get recreated.

A few other details about my K8s cluster
I am using VMWare workstation 12 Pro to run my cluster on one node and I have enabled RBAC and am using client certificates for authentication

SOS. Can someone point me in the right direction? Have spent a lot of time trying to figure this one out.
Thanks in advance

@k8s-merge-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-merge-robot

k8s-merge-robot Jul 21, 2017

Contributor

@jayeshnazre
There are no sig labels on this issue. Please add a sig label by:

  1. mentioning a sig: @kubernetes/sig-<group-name>-<group-suffix>
    e.g., @kubernetes/contributor-experience-<group-suffix> to notify the contributor experience sig, OR

  2. specifying the label manually: /sig <label>
    e.g., /sig scalability to apply the sig/scalability label

Note: Method 1 will trigger an email to the group. You can find the group list here and label list here.
The <group-suffix> in the method 1 has to be replaced with one of these: bugs, feature-requests, pr-reviews, test-failures, proposals

Contributor

k8s-merge-robot commented Jul 21, 2017

@jayeshnazre
There are no sig labels on this issue. Please add a sig label by:

  1. mentioning a sig: @kubernetes/sig-<group-name>-<group-suffix>
    e.g., @kubernetes/contributor-experience-<group-suffix> to notify the contributor experience sig, OR

  2. specifying the label manually: /sig <label>
    e.g., /sig scalability to apply the sig/scalability label

Note: Method 1 will trigger an email to the group. You can find the group list here and label list here.
The <group-suffix> in the method 1 has to be replaced with one of these: bugs, feature-requests, pr-reviews, test-failures, proposals

@jayeshnazre jayeshnazre changed the title from kubedns-masq and sidecar containers crash after doing nslookup to kubedns-masq and sidecar containers crash after doing nslookup @kubernetes/sig-network-bugs /sig area/dns Jul 21, 2017

@jayeshnazre

This comment has been minimized.

Show comment
Hide comment
@jayeshnazre

jayeshnazre Jul 21, 2017

@kubernetes/sig-network-bugs
/sig area/dns

jayeshnazre commented Jul 21, 2017

@kubernetes/sig-network-bugs
/sig area/dns

@k8s-ci-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-ci-robot

k8s-ci-robot Jul 21, 2017

Contributor

@jayeshnazre: Reiterating the mentions to trigger a notification:
@kubernetes/sig-network-bugs.

In response to this:

@kubernetes/sig-network-bugs
/sig area/dns

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Contributor

k8s-ci-robot commented Jul 21, 2017

@jayeshnazre: Reiterating the mentions to trigger a notification:
@kubernetes/sig-network-bugs.

In response to this:

@kubernetes/sig-network-bugs
/sig area/dns

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jayeshnazre

This comment has been minimized.

Show comment
Hide comment
@jayeshnazre

jayeshnazre Jul 22, 2017

Anyone facing this issue?

jayeshnazre commented Jul 22, 2017

Anyone facing this issue?

@m1093782566

This comment has been minimized.

Show comment
Hide comment
@m1093782566
Member

m1093782566 commented Jul 25, 2017

ping @bowei

@bowei

This comment has been minimized.

Show comment
Hide comment
@bowei

bowei Jul 25, 2017

Member

Can you post your /etc/resolv.conf from the VM and inside the container. I suspect you have a DNS lookup loop (resolv.conf references 127.0.0.1 as the upstream server).

Member

bowei commented Jul 25, 2017

Can you post your /etc/resolv.conf from the VM and inside the container. I suspect you have a DNS lookup loop (resolv.conf references 127.0.0.1 as the upstream server).

@jayeshnazre

This comment has been minimized.

Show comment
Hide comment
@jayeshnazre

jayeshnazre Jul 26, 2017

Thank you all for responding. I believe I found the issue with my timeouts or rather came up with a solution that avoids one

  1. First thing Ubuntu 17.04 is not using dnsmasq; My K8s master and worker nodes are running in VMWare Workstation Pro 12 in my simulated test environment
  2. Ubuntu 17.04 uses “systemd-resolved” instead of dnsmasq.
  3. Kubedns inherits the contents of “/etc/resolv.conf” something the maintainers of these pods should document at the following site (https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/dns ) as it takes a lot of research and googling to find such details hidden under forum comments. My two cents
  4. systemd-resolved on my host listens on 127.0.0.53:53 for dns queries, as soon as I uninstall systemd-resolved and install dnsmasq my node has an entry of 127.0.0.1 in /etc/resolv.conf instead of 127.0.0.53, this gets inherited by the kubedns pods and for some reason its able to forward unresolved queries to my host dnsmasq. Earlier, the kubedns inherited the 127.0.0.53 IP from node “/etc/resolv.conf” and for some reason its not able to talk to the node systemd-resolved at that IP.

@bowei so now its working but can you explain why its not working with systemd-resolved installed on host. You are correct there is a dns loop occuring but I am not sure with dnsmasq why that is not the case. Your help will make the picture clear to me and others. Thanks

jayeshnazre commented Jul 26, 2017

Thank you all for responding. I believe I found the issue with my timeouts or rather came up with a solution that avoids one

  1. First thing Ubuntu 17.04 is not using dnsmasq; My K8s master and worker nodes are running in VMWare Workstation Pro 12 in my simulated test environment
  2. Ubuntu 17.04 uses “systemd-resolved” instead of dnsmasq.
  3. Kubedns inherits the contents of “/etc/resolv.conf” something the maintainers of these pods should document at the following site (https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/dns ) as it takes a lot of research and googling to find such details hidden under forum comments. My two cents
  4. systemd-resolved on my host listens on 127.0.0.53:53 for dns queries, as soon as I uninstall systemd-resolved and install dnsmasq my node has an entry of 127.0.0.1 in /etc/resolv.conf instead of 127.0.0.53, this gets inherited by the kubedns pods and for some reason its able to forward unresolved queries to my host dnsmasq. Earlier, the kubedns inherited the 127.0.0.53 IP from node “/etc/resolv.conf” and for some reason its not able to talk to the node systemd-resolved at that IP.

@bowei so now its working but can you explain why its not working with systemd-resolved installed on host. You are correct there is a dns loop occuring but I am not sure with dnsmasq why that is not the case. Your help will make the picture clear to me and others. Thanks

@jayeshnazre

This comment has been minimized.

Show comment
Hide comment
@jayeshnazre

jayeshnazre Jul 26, 2017

One last thing I still see the following in sidecar logs inside docker container, but at least the crashing has stopped

ERROR: logging before flag.Parse: I0726 15:54:59.427544 1 main.go:48] Version v1.14.3-4-gee838f6
ERROR: logging before flag.Parse: I0726 15:54:59.427854 1 server.go:45] Starting server (options {DnsMasqPort:53 DnsMasqAddr:127.0.0.1 DnsMasqPollIntervalMs:5000 Probes:[{Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1} {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}] PrometheusAddr:0.0.0.0 PrometheusPort:10054 PrometheusPath:/metrics PrometheusNamespace:kubedns})
ERROR: logging before flag.Parse: I0726 15:54:59.427892 1 dnsprobe.go:75] Starting dnsProbe {Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}
ERROR: logging before flag.Parse: I0726 15:54:59.427951 1 dnsprobe.go:75] Starting dnsProbe {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}

jayeshnazre commented Jul 26, 2017

One last thing I still see the following in sidecar logs inside docker container, but at least the crashing has stopped

ERROR: logging before flag.Parse: I0726 15:54:59.427544 1 main.go:48] Version v1.14.3-4-gee838f6
ERROR: logging before flag.Parse: I0726 15:54:59.427854 1 server.go:45] Starting server (options {DnsMasqPort:53 DnsMasqAddr:127.0.0.1 DnsMasqPollIntervalMs:5000 Probes:[{Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1} {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}] PrometheusAddr:0.0.0.0 PrometheusPort:10054 PrometheusPath:/metrics PrometheusNamespace:kubedns})
ERROR: logging before flag.Parse: I0726 15:54:59.427892 1 dnsprobe.go:75] Starting dnsProbe {Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}
ERROR: logging before flag.Parse: I0726 15:54:59.427951 1 dnsprobe.go:75] Starting dnsProbe {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}

@bowei

This comment has been minimized.

Show comment
Hide comment
@bowei

bowei Jul 26, 2017

Member

As mentioned above, please post your VM and kube-dns container /etc/resolv.conf.

Member

bowei commented Jul 26, 2017

As mentioned above, please post your VM and kube-dns container /etc/resolv.conf.

@jayeshnazre

This comment has been minimized.

Show comment
Hide comment
@jayeshnazre

jayeshnazre Jul 26, 2017

@ bowei
Section1
Contents of /etc/resolv.conf from node

root@ubuntumaster:~# cat /etc/resolv.conf

Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)

DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN

127.0.0.53 is the systemd-resolved stub resolver.

run "systemd-resolve --status" to see details about the actual nameservers.

nameserver 127.0.0.1
nameserver 192.168.164.2

Section2
Contents of /etc/resolv.conf from kubedns
/ # cat /etc/resolv.conf
nameserver 127.0.0.1
nameserver 192.168.164.2
/ #

Section3
as you can see kubedns inherits from node; 192.168.164.2 is the nameserver of my vmware workstation pro 12 and I am using it to allow internet access to all cluster pods. With the above the sidecar and dnsmasq pods does not crash. but not sure why the dns loop stopped after uninstalling dnsmasq and uninstalling systemd-resolved from the node
@ bowei Thanks for the links. Can you shed some light on how dns loop was eliminated.

jayeshnazre commented Jul 26, 2017

@ bowei
Section1
Contents of /etc/resolv.conf from node

root@ubuntumaster:~# cat /etc/resolv.conf

Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)

DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN

127.0.0.53 is the systemd-resolved stub resolver.

run "systemd-resolve --status" to see details about the actual nameservers.

nameserver 127.0.0.1
nameserver 192.168.164.2

Section2
Contents of /etc/resolv.conf from kubedns
/ # cat /etc/resolv.conf
nameserver 127.0.0.1
nameserver 192.168.164.2
/ #

Section3
as you can see kubedns inherits from node; 192.168.164.2 is the nameserver of my vmware workstation pro 12 and I am using it to allow internet access to all cluster pods. With the above the sidecar and dnsmasq pods does not crash. but not sure why the dns loop stopped after uninstalling dnsmasq and uninstalling systemd-resolved from the node
@ bowei Thanks for the links. Can you shed some light on how dns loop was eliminated.

@bowei

This comment has been minimized.

Show comment
Hide comment
@bowei

bowei Jul 26, 2017

Member

You need to delete nameserver 127.0.0.1 from your VMs /etc/resolv.conf and restart kubelet. If you leave the entry in your resolv.conf, you will generate a loop for at least some of the queries.

The loop was not eliminated, half the queries will still go into a loop, however, the other half will be successful and it may appear to be working.

I would close this issue as it appears to be a configuration issue, not a Kubernetes bug.

Member

bowei commented Jul 26, 2017

You need to delete nameserver 127.0.0.1 from your VMs /etc/resolv.conf and restart kubelet. If you leave the entry in your resolv.conf, you will generate a loop for at least some of the queries.

The loop was not eliminated, half the queries will still go into a loop, however, the other half will be successful and it may appear to be working.

I would close this issue as it appears to be a configuration issue, not a Kubernetes bug.

@jayeshnazre

This comment has been minimized.

Show comment
Hide comment
@jayeshnazre

jayeshnazre Jul 26, 2017

@bowei If I remove the nameserver 127.0.0.1 from my VM how will my node (which is the VM) use dnsmasq locally. By default Ubuntu 17.04 installs that nameserver entry when we install dnsmasq. Unless you are saying that I cannot have dnsmasq or for that matter systemd-resolved running on my VM.

jayeshnazre commented Jul 26, 2017

@bowei If I remove the nameserver 127.0.0.1 from my VM how will my node (which is the VM) use dnsmasq locally. By default Ubuntu 17.04 installs that nameserver entry when we install dnsmasq. Unless you are saying that I cannot have dnsmasq or for that matter systemd-resolved running on my VM.

@bowei

This comment has been minimized.

Show comment
Hide comment
@bowei

bowei Jul 27, 2017

Member

This is a current limitation of DnsPolicy: Default. I filed an issue here: #49675

For now, you should not put a local resolver on the outer VM.

Member

bowei commented Jul 27, 2017

This is a current limitation of DnsPolicy: Default. I filed an issue here: #49675

For now, you should not put a local resolver on the outer VM.

@jayeshnazre

This comment has been minimized.

Show comment
Hide comment
@jayeshnazre

jayeshnazre Jul 28, 2017

Thanks @bowei for helping me with this. For others who may end up with this problem, this is what I did in Ubuntu17.04
step-a) I have disabled systemd-resolved (Ubuntu 17.04 does not come with dnsmasq but if you have other Linux versions/distro's which come with dnsmasq, then uninstall dnsmasq).
step-b) I then created a "tail" file for my resolveconf daemon to create the /etc/resolv.conf from at " /etc/resolvconf/resolv.conf.d/" folder
step-c) I then edited the /etc/NetworkManager/NetworkManager.conf to include dns=none, I did this so NetworkManager does not edit the resolveconf daemons config files.

In the end I ended up with a /etc/resolv.conf contents of

nameserver 192.168.164.2

Which is my VMWare's dns server.

jayeshnazre commented Jul 28, 2017

Thanks @bowei for helping me with this. For others who may end up with this problem, this is what I did in Ubuntu17.04
step-a) I have disabled systemd-resolved (Ubuntu 17.04 does not come with dnsmasq but if you have other Linux versions/distro's which come with dnsmasq, then uninstall dnsmasq).
step-b) I then created a "tail" file for my resolveconf daemon to create the /etc/resolv.conf from at " /etc/resolvconf/resolv.conf.d/" folder
step-c) I then edited the /etc/NetworkManager/NetworkManager.conf to include dns=none, I did this so NetworkManager does not edit the resolveconf daemons config files.

In the end I ended up with a /etc/resolv.conf contents of

nameserver 192.168.164.2

Which is my VMWare's dns server.

@bowei

This comment has been minimized.

Show comment
Hide comment
@bowei

bowei Jul 29, 2017

Member

Could you change the title of the bug to "kube-dns does not work due to DNS server lookup loop (/etc/resolv.conf contains 127.0.0.01 as the upstream nameserver)" so the discussion in this issue is discoverable via search engines?

Member

bowei commented Jul 29, 2017

Could you change the title of the bug to "kube-dns does not work due to DNS server lookup loop (/etc/resolv.conf contains 127.0.0.01 as the upstream nameserver)" so the discussion in this issue is discoverable via search engines?

@jayeshnazre jayeshnazre changed the title from kubedns-masq and sidecar containers crash after doing nslookup @kubernetes/sig-network-bugs /sig area/dns to kube-dns does not work due to DNS server lookup loop (/etc/resolv.conf contains 127.0.0.01 as the upstream nameserver)kubedns-masq and sidecar containers crash after doing nslookup @kubernetes/sig-network-bugs /sig area/dns Jul 30, 2017

@jayeshnazre

This comment has been minimized.

Show comment
Hide comment

jayeshnazre commented Jul 30, 2017

@jonashackt

This comment has been minimized.

Show comment
Hide comment
@jonashackt

jonashackt Aug 28, 2018

I had the same problem. To end up with the correct nameserver in /etc/resolve.conf, I found an easier way to go and answered in the mentioned stack overflow question: https://stackoverflow.com/a/52036125/4964553

jonashackt commented Aug 28, 2018

I had the same problem. To end up with the correct nameserver in /etc/resolve.conf, I found an easier way to go and answered in the mentioned stack overflow question: https://stackoverflow.com/a/52036125/4964553

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment