Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ubuntu 18.04 dns problems in pods #448

Closed
giovannicandido opened this issue Jun 24, 2018 · 3 comments · Fixed by #450
Closed

Ubuntu 18.04 dns problems in pods #448

giovannicandido opened this issue Jun 24, 2018 · 3 comments · Fixed by #450
Assignees
Labels
bug Something isn't working
Milestone

Comments

@giovannicandido
Copy link

giovannicandido commented Jun 24, 2018

Symptoms:

kubectl forward doens't work, Example:

kubectl -n kube-system port-forward service/tiller-deploy 44134:44134

Results in

E0624 12:06:20.664091 34312 portforward.go:331] an error occurred forwarding 42399 -> 44134: error forwarding port 44134 to pod 255e06439c2da94a4b6a8b1ad2d3d7f4d6d1ba1f82ab6eb2ae519133b1f2bc58, uid : exit status 1: 2018/06/24 15:06:20 socat[22114] E getaddrinfo("localhost", "NULL", {1,2,1,6}, {}): Temporary failure in name resolution

Geting the log in kubelet on the worker node

journalctl -u kubelet

Results in something like:

7797 httpstream.go:251] error forwarding port 44134 to pod 255e06439c2da94a4b6a8b1ad2d3d7f4d6d1ba1f82ab6eb2ae519133b1f2bc58, uid : exit status 1: 2018/06/24 15:06:21 socat[22130] E getaddrinfo("localhost", "NULL", {1,2,1,6}, {}): Temporary failure in namJun 24 15:09:53 worker0 kubelet[7797]: E0624 15:09:53.133330 7797 httpstream.go:251] error forwarding port 44134 to pod 255e06439c2da94a4b6a8b1ad2d3d7f4d6d1ba1f82ab6eb2ae519133b1f2bc58, uid : exit status 1: 2018/06/24 15:09:53 socat[22418] E getaddrinfo("localhost", "NULL", {1,2,1,6}, {}): Temporary failure in namJun 24 15:15:56 worker0 kubelet[7797]: E0624 15:15:56.303891 7797 httpstream.go:251] error forwarding port 44134 to pod

Which points to getaddrinfo("localhost")..., that means pod is not able to resolve localhost.

Other commands that use port-forward like helm version or other helm commands (helm package interact with tiller server using port forwards) has the same symptoms

Cause

Ubuntu 18.04 uses systemd-resolve which change /etc/resolv.conf to use a local dns. Kubelet needs to be started with the flag --resolv-conf=/run/systemd/resolve/resolv.conf on this systems.

Possible Solutions

This have been addressed by kubernetes/kubeadm#787 and probably will be on kubernetes 1.11, as a work around there is two easy fix:

  1. Create a systemd dropping to override kubelet. Example:
[Service]
Environment='KUBELET_DNS_ARGS=--cluster-dns=172.31.0.10 --cluster-domain=cluster.local --resolv-conf=/run/systemd/resolve/resolv.conf'

Restart kubelet:

systemctl daemon-reload
systemctl restart kubelet
  1. Create a symlink on /etc/resolv.conf pointing to /run/systemd/resolve/resolv.conf (backup first)

Do the same on all machines.

Pharos Installer

I suggest that pharos-cluster check for the existence of the file in Ubuntu 18.04 and perform one of the fix above. We need to check after the kubeadm fix is released to make sure it do not conflict (possible resulting in two flags added). Check the pull request to see how it was fixed in kubeadm side.

@giovannicandido
Copy link
Author

Update: After restarting kubelet, destroy kube-dns otherwise dns queries will stop working:

kubectl -n kube-system get pods -l k8s-app=kube-dns

Delete all pods. You may kubectl -n kube-system delete pods -l k8s-app=kube-dns --all or do one by one

@jakolehm jakolehm added this to the 1.2.0 milestone Jun 25, 2018
@jakolehm jakolehm added the bug Something isn't working label Jun 25, 2018
@SpComb
Copy link
Contributor

SpComb commented Jun 25, 2018

Seems like the issue isn't necessarily specific to Ubuntu 18.04 and systemd-resolved, but any configuration where /etc/resolv.conf contains localhost as a resolver will cause the kube-dns pod to use localhost as the upstream? Because kube-dns runs in a pod network namespace, the kube-dns upstream queries will loop back to itself and fail...

The fix will however need to be specific to the local resolver in use... for the systemd-resolved case we can assume that the real upstream resolvers are available at /run/systemd/resolve/resolv.conf... but for e.g. Ubuntu xenial desktop with NetworkManager, /etc/resolv.conf also contains 127.0.0.1, but the upstream nameservers are only available within dnsmasq internally, as set dynamically via DBus... they are not available anywhere on the filesystem:

Jun 25 09:43:29 tehobari dnsmasq[2015]: setting upstream servers from DBus
Jun 25 09:44:03 tehobari dnsmasq[2015]: setting upstream servers from DBus
Jun 25 09:44:03 tehobari dnsmasq[2015]: using nameserver 172.28.0.1#53(via wlp4s0)

The upstream kubernetes fix for 1.11 seems to have kubeadm init/join conditionally generate the systemd dropin with a --resolv-conf=/run/systemd/resolve/resolv.conf flag for the kubelet: kubernetes/kubernetes#64665

Our fix for this in pharos 1.3 would be to upgrade to kube 1.11 which would fix this for new installs... to fix this for pharos 1.2 as well as existing 1.2 -> 1.3 upgrades, then we will need to detect this configuration and set that ourselves.

@SpComb SpComb mentioned this issue Jun 25, 2018
@SpComb
Copy link
Contributor

SpComb commented Jun 25, 2018

Confirm that dnsPolicy: Default pods are broken on Ubuntu bionic / 18.04.

Normal dnsPolicy: ClusterFirst pods work, but the cluster DNS will be broken if the kube-dns pod lands on a bionic node.

I0625 12:18:27.906305       1 main.go:76] opts: {{/usr/sbin/dnsmasq [-k --cache-size=1000 --no-negcache --log-facility=- --server=/cluster.local/127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/ip6.arpa/127.0.0.1#10053] true} /etc/k8s/dns/dnsmasq-nanny 10000000000}
I0625 12:18:27.907288       1 nanny.go:94] Starting dnsmasq [-k --cache-size=1000 --no-negcache --log-facility=- --server=/cluster.local/127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/ip6.arpa/127.0.0.1#10053]
I0625 12:18:28.281314       1 nanny.go:119] 
W0625 12:18:28.281524       1 nanny.go:120] Got EOF from stdout
I0625 12:18:28.283464       1 nanny.go:116] dnsmasq[9]: started, version 2.78 cachesize 1000
I0625 12:18:28.283637       1 nanny.go:116] dnsmasq[9]: compile time options: IPv6 GNU-getopt no-DBus no-i18n no-IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth no-DNSSEC loop-detect inotify
I0625 12:18:28.283771       1 nanny.go:116] dnsmasq[9]: using nameserver 127.0.0.1#10053 for domain ip6.arpa 
I0625 12:18:28.283872       1 nanny.go:116] dnsmasq[9]: using nameserver 127.0.0.1#10053 for domain in-addr.arpa 
I0625 12:18:28.284017       1 nanny.go:116] dnsmasq[9]: using nameserver 127.0.0.1#10053 for domain cluster.local 
I0625 12:18:28.284217       1 nanny.go:116] dnsmasq[9]: reading /etc/resolv.conf
I0625 12:18:28.284353       1 nanny.go:116] dnsmasq[9]: using nameserver 127.0.0.1#10053 for domain ip6.arpa 
I0625 12:18:28.284477       1 nanny.go:116] dnsmasq[9]: using nameserver 127.0.0.1#10053 for domain in-addr.arpa 
I0625 12:18:28.284595       1 nanny.go:116] dnsmasq[9]: using nameserver 127.0.0.1#10053 for domain cluster.local 
I0625 12:18:28.284684       1 nanny.go:116] dnsmasq[9]: using nameserver 127.0.0.53#53
I0625 12:18:28.284882       1 nanny.go:116] dnsmasq[9]: read /etc/hosts - 7 addresses
I0625 12:18:48.681942       1 nanny.go:116] dnsmasq[9]: Maximum number of concurrent DNS queries reached (max: 150)
I0625 12:18:58.688477       1 nanny.go:116] dnsmasq[9]: Maximum number of concurrent DNS queries reached (max: 150)
I0625 12:19:08.702231       1 nanny.go:116] dnsmasq[9]: Maximum number of concurrent DNS queries reached (max: 150)
I0625 12:19:18.708944       1 nanny.go:116] dnsmasq[9]: Maximum number of concurrent DNS queries reached (max: 150)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants