Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#2 Cannot reach Kube DNS from pod #60315

Open
ivantichy opened this Issue Feb 23, 2018 · 11 comments

Comments

Projects
None yet
9 participants
@ivantichy
Copy link

ivantichy commented Feb 23, 2018

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:
Creating new issue as the original one was closed without solution, more people announced the problem and there was no response on Closed issue thread.
Cannot reach kube dns (e.g. nslookup kubernetes.default) from busybox test pod.

Details here:
#57558

I would prefer reopening the old issue.

@ivantichy

This comment has been minimized.

Copy link
Author

ivantichy commented Feb 23, 2018

/sig network

@k8s-ci-robot k8s-ci-robot added sig/network and removed needs-sig labels Feb 23, 2018

@dgabrysch

This comment has been minimized.

Copy link

dgabrysch commented Feb 23, 2018

Is this a kubeadm cluster? If yes, check your systemd kubelet.conf.d/10-kubeadm.conf, there should be an entry with DNS pointing to the service IP of your kube-dns service.

@ivantichy

This comment has been minimized.

Copy link
Author

ivantichy commented Mar 5, 2018

cat ./etc/systemd/system/kubelet.service.d/10-kubeadm.conf
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true"
Environment="KUBELET_NETWORK_ARGS=--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin"
Environment="KUBELET_DNS_ARGS=--cluster-dns=10.96.0.10 --cluster-domain=cluster.local"
Environment="KUBELET_AUTHZ_ARGS=--authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt"
Environment="KUBELET_CADVISOR_ARGS=--cadvisor-port=0"
Environment="KUBELET_CERTIFICATE_ARGS=--rotate-certificates=true --cert-dir=/var/lib/kubelet/pki"
ExecStart=
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_SYSTEM_PODS_ARGS $KUBELET_NETWORK_ARGS $KUBELET_DNS_ARGS $KUBELET_AUTHZ_ARGS $KUBELET_CADVISOR_ARGS $KUBELET_CERTIFICATE_ARGS $KUBELET_EXTRA_ARGS

@xguerin

This comment has been minimized.

Copy link

xguerin commented Mar 22, 2018

I'm having the same issue here. It seems to only happen when the container gets scheduled on a worker node. When scheduled on the master node, it works fine. Here is my job configuration:

---
apiVersion: batch/v1
kind: Job
metadata:
  name: n0
spec:
  template:
    spec:
      containers:
      - name: node
        image: busybox
        imagePullPolicy: Always
        command: ["sh", "-c", "sleep 3600"]
      restartPolicy: Never
  backoffLimit: 1
...

I then connect to the container:

/ #  nslookup n0
Server:    10.96.0.10
Address 1: 10.96.0.10

nslookup: can't resolve 'n0'
/ #

I am using calico so it might be a problem with that. I used kubeadm and checked that all my configuration matched.

@xiaoxubeii

This comment has been minimized.

Copy link
Member

xiaoxubeii commented Mar 29, 2018

Same here. I even cannot ping the ip of kube-dns pod.

@xguerin

This comment has been minimized.

Copy link

xguerin commented Mar 29, 2018

I determined that it was a problem with my CNI. With Calico it does not work. But with flannel it does. I did not spend time investigating why.

@CharlyF

This comment has been minimized.

Copy link
Contributor

CharlyF commented May 21, 2018

Thank you @xguerin, I have spent some time trying to figure this out. Moving to Flannel solved my problem as well.
I want to share my notes here, hopefully someone more familiar with kube-dns might be able to help.

Setup: 5 nodes cluster (for testing purposes). 1 Master + 4 workers.
kubeadm:

kubeadm version: &version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-04-27T09:10:24Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes:

# kubelet --version
Kubernetes v1.10.2

# kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-04-27T09:22:21Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:05:37Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Runtime, Docker:

# docker version
Client:
 Version:      17.03.2-ce
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   f5ec1e2
 Built:        Tue Jun 27 03:35:14 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.2-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   f5ec1e2
 Built:        Tue Jun 27 03:35:14 2017
 OS/Arch:      linux/amd64
 Experimental: false

Machines in AWS:

# uname -a
Linux ip-172-29-61-93 4.4.0-1049-aws #58-Ubuntu SMP Fri Jan 12 23:17:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

What Happened:
Followed the guidelines in the official doc.

I first tried with Weave.
Initialized my kubeadm, everything looked good.
Kube DNS started and remained in pending until I created the CNI (expected).
Started the CNI:

kube-system   kube-dns-86f4d74b45-bq4h5                 3/3       Running   0          17m       10.32.0.2       ip-172-29-61-93
kube-system   weave-net-8bz6m                           2/2       Running   0          16m       172.29.61.93    ip-172-29-61-93

Proxy is running:

kube-system   kube-proxy-87gwx                          1/1       Running   0          17m       172.29.61.93    ip-172-29-61-93

Then I joined my nodes (kubeadm join ...)

The proxy with the DS got installed as well as weave:

kube-system   kube-proxy-cq5km                          1/1       Running            0          26m       172.29.28.132   ip-172-29-28-132
kube-system   kube-proxy-dqqsd                          1/1       Running            0          24m       172.29.62.162   ip-172-29-62-162
kube-system   weave-net-5zwg7                           2/2       Running            0          15m       172.29.28.132   ip-172-29-28-132
kube-system   weave-net-kcjfb                           2/2       Running            1          13m       172.29.62.162   ip-172-29-62-162

At this point, I don't have errors in the logs of any of the containers of kube-dns or in the kubelet.
Then, setting up the debugging pod from the doc on two nodes. One that works and one that does not (reflecting the 3 others)

default       busybox                                   1/1       Running            0          11m       10.40.0.1       ip-172-29-62-162
default       busybox-1                                 1/1       Running            0          11m       10.46.0.2       ip-172-29-28-132

At this point we can investigate.
Looking at the resolv configs on both pods.

root@ip-172-29-61-93:~/kubernetes# kubectl exec -ti busybox-1 -- cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:5

root@ip-172-29-61-93:~/kubernetes# kubectl exec -ti busybox -- cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:5

They are the same.
However, only one can resolve the main svc.

root@ip-172-29-61-93:~/kubernetes# kubectl exec -ti busybox -- nslookup kubernetes.default
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      kubernetes.default
Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local

While on the other pod:

root@ip-172-29-61-93:~/kubernetes# kubectl exec -ti busybox-1 -- nslookup kubernetes.default
Server:    10.96.0.10
Address 1: 10.96.0.10

nslookup: can't resolve 'kubernetes.default'
command terminated with exit code 1

Which according to the documentation indicates an issue with kube-dns.

Checking that everything is here for it to work:

root@ip-172-29-61-93:~/kubernetes# kubectl get svc -n kube-system
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)         AGE
kube-dns   ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP   22m

root@ip-172-29-61-93:~/kubernetes# kubectl get ep kube-dns --namespace=kube-system
NAME       ENDPOINTS                   AGE
kube-dns   10.32.0.2:53,10.32.0.2:53   25m

The kubelet def is consistent across my fleet of machines and references the clusterIP of my kube-dns service.

# systemctl cat kubelet.service --no-pager
# /lib/systemd/system/kubelet.service
[Unit]
Description=kubelet: The Kubernetes Node Agent
Documentation=http://kubernetes.io/docs/

[Service]
ExecStart=/usr/bin/kubelet
Restart=always
StartLimitInterval=0
RestartSec=10

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true"
Environment="KUBELET_NETWORK_ARGS=--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin"
Environment="KUBELET_DNS_ARGS=--cluster-dns=10.96.0.10 --cluster-domain=cluster.local"
Environment="KUBELET_AUTHZ_ARGS=--authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt"
Environment="KUBELET_CADVISOR_ARGS=--cadvisor-port=0"
Environment="KUBELET_CERTIFICATE_ARGS=--rotate-certificates=true --cert-dir=/var/lib/kubelet/pki"
ExecStart=
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_SYSTEM_PODS_ARGS $KUBELET_NETWORK_ARGS $KUBELET_DNS_ARGS $KUBELET_AUTHZ_ARGS $KUBELET_CADVISOR_ARGS $KUBELET_CERTIFICATE_ARGS $KUBELET_EXTRA_ARGS

Using Calico then with the manifest linked in the doc and later tried with 1.6. Same issue.

Some background:

  • I have restarted 3 or 4 times my kubeadm set up to try to identify the problem.
    I started with a version of docker that was too recent (18.03) and downgraded to 17.03 per the recommendation in the logs.
  • The first time I started the cluster, I made my nodes join prior to deploying the CNI. I think kube-dns got deployed on a different machine than the master (not sure which, I assume ip-172-29-62-162. See below).
  • The logs above only show two workers, but only one out of my 4 was working (5 actually as it was not working on my master either).
  • After all the resets, the nodes which hosted the pods that resolved the DNS was always the same ( ip-172-29-62-162). It might be the one kube dns got deployed on at first, my assumption was that it modified the tables on this nodes.
  • I drained the nodes at each reset (assuming that would clear the local data and the routing tables).
  • The pods deployed on the 3 other workers did not have access to the internet:
# curl www.google.com
curl: (6) Couldn't resolve host 'www.google.com'

The environment variables were injected properly:

# docker exec -ti 019b23e157ba env
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=busybox-1
TERM=xterm
KUBERNETES_SERVICE_HOST=10.96.0.1
KUBERNETES_SERVICE_PORT=443
KUBERNETES_SERVICE_PORT_HTTPS=443
KUBERNETES_PORT=tcp://10.96.0.1:443
KUBERNETES_PORT_443_TCP=tcp://10.96.0.1:443
KUBERNETES_PORT_443_TCP_PROTO=tcp
KUBERNETES_PORT_443_TCP_PORT=443
KUBERNETES_PORT_443_TCP_ADDR=10.96.0.1
HOME=/root

On all the pods (regardless of their nodes).

  • Pinging the kube-dns service is not working, as well as pinging the pod directly (from the pods on the 3 "bad" workers, works fine on the 4th).

As I switched to flannel, the dns was resolved and everything started working.
I hope that helps.

@comphilip

This comment has been minimized.

Copy link

comphilip commented Jun 27, 2018

@xguerin @CharlyF
After several hours investigating, I switch to flannel and everything works well.

I have 3 CentOS 7 EC2 to setup a K8S cluster via kubeadm with Calico network. In master node, everything works well. But in worker node, DNS not working and NodePort not working either. No traffic forward on request NodePort's port.

I check with ping, nslookup, tcpdump with several hours, found nothing.

Then I switch to flannel, everything works.

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Sep 25, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@ivantichy

This comment has been minimized.

Copy link
Author

ivantichy commented Oct 3, 2018

/remove-lifecycle stale

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Feb 24, 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.