node-local-dns crash looping on masters '169.254.20.10:8080: bind: address already in use' #9245

fred-vogt · 2020-06-02T08:46:33Z

node-local-dns crash looping on masters - new in 1.18.0-beta.1 :

listen tcp 169.254.20.10:8080: bind: address already in use

Health port is conflict with API server health check sidecar.

1. What kops version are you running? The command kops version, will display
this information.

Version 1.18.0-beta.1 (git-ec8022b352)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

kubectl version --short
Client Version: v1.18.3
Server Version: v1.18.3

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
New cluster on ubuntu 18.04.

5. What happened after the commands executed?

6. What did you expect to happen?

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
+ExperimentalClusterDNS

  kubeDNS:
    provider: CoreDNS
    nodeLocalDNS:
      enabled: true

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?
Have only seen this with 1.18.0-beta.1.

KOPS Validate

NODE STATUS
NAME						ROLE	READY
ip-10-16-82-143.us-west-2.compute.internal	master	True
ip-10-16-118-80.us-west-2.compute.internal	master	True
ip-10-16-53-225.us-west-2.compute.internal	master	True
...
VALIDATION ERRORS
KIND	NAME					MESSAGE
Pod	kube-system/node-local-dns-4lp6p	system-node-critical pod "node-local-dns-4lp6p" is not ready (node-cache)
Pod	kube-system/node-local-dns-57qj4	system-node-critical pod "node-local-dns-57qj4" is not ready (node-cache)
Pod	kube-system/node-local-dns-twlnp	system-node-critical pod "node-local-dns-twlnp" is not ready (node-cache)

...
node-local-dns-4lp6p 0/1     CrashLoopBackOff   7          15m   10.16.82.143    ip-10-16-82-143.us-west-2.compute.internal
node-local-dns-57qj4 0/1     CrashLoopBackOff   7          15m   10.16.53.225    ip-10-16-53-225.us-west-2.compute.internal
node-local-dns-twlnp 0/1     CrashLoopBackOff   7          15m   10.16.118.80    ip-10-16-118-80.us-west-2.compute.internal

OS

Welcome to Ubuntu 18.04.4 LTS (GNU/Linux 5.3.0-1019-aws x86_64)
...
0 packages can be updated.
0 updates are security updates.

Listening sockets

root@ip-10-16-53-225:~# netstat -nltp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      3562/rpcbind        
tcp        0      0 127.0.0.1:21362         0.0.0.0:*               LISTEN      11487/aws-iam-authe 
tcp        0      0 127.0.0.1:36467         0.0.0.0:*               LISTEN      8395/kubelet        
tcp        0      0 127.0.0.53:53           0.0.0.0:*               LISTEN      786/systemd-resolve 
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      1008/sshd           
tcp        0      0 10.16.53.225:3996       0.0.0.0:*               LISTEN      6638/etcd-manager   
tcp        0      0 10.16.53.225:3997       0.0.0.0:*               LISTEN      6556/etcd-manager   
tcp        0      0 127.0.0.1:10248         0.0.0.0:*               LISTEN      8395/kubelet        
tcp        0      0 127.0.0.1:10249         0.0.0.0:*               LISTEN      6953/kube-proxy     
tcp        0      0 127.0.0.1:9099          0.0.0.0:*               LISTEN      12480/calico-node   
tcp6       0      0 :::2380                 :::*                    LISTEN      7996/etcd           
tcp6       0      0 :::10252                :::*                    LISTEN      7215/kube-controlle 
tcp6       0      0 :::2381                 :::*                    LISTEN      7898/etcd           
tcp6       0      0 :::10255                :::*                    LISTEN      8395/kubelet        
tcp6       0      0 :::111                  :::*                    LISTEN      3562/rpcbind        
tcp6       0      0 :::10256                :::*                    LISTEN      6953/kube-proxy     
tcp6       0      0 :::8080                 :::*                    LISTEN      7459/kube-apiserver 
tcp6       0      0 :::10257                :::*                    LISTEN      7215/kube-controlle 
tcp6       0      0 :::10259                :::*                    LISTEN      7300/kube-scheduler 
tcp6       0      0 :::22                   :::*                    LISTEN      1008/sshd           
tcp6       0      0 :::443                  :::*                    LISTEN      8131/kube-apiserver 
tcp6       0      0 :::4001                 :::*                    LISTEN      7996/etcd           
tcp6       0      0 :::4002                 :::*                    LISTEN      7898/etcd           
tcp6       0      0 :::33379                :::*                    LISTEN      11406/kops-controll 
tcp6       0      0 :::10250                :::*                    LISTEN      8395/kubelet        
tcp6       0      0 :::10251                :::*                    LISTEN      7300/kube-scheduler

Interfaces

root@ip-10-16-53-225:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
    link/ether 02:6b:0a:98:6f:9a brd ff:ff:ff:ff:ff:ff
    inet 10.16.53.225/19 brd 10.16.63.255 scope global dynamic ens5
       valid_lft 1913sec preferred_lft 1913sec
    inet6 fe80::6b:aff:fe98:6f9a/64 scope link 
       valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:91:b0:b1:1e brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
6: calie9481df5a79@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1440 qdisc noqueue state UP group default 
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::ecee:eeff:feee:eeee/64 scope link 
       valid_lft forever preferred_lft forever
10: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8951 qdisc noqueue state UNKNOWN group default 
    link/ether 8a:40:29:d0:cf:e9 brd ff:ff:ff:ff:ff:ff
    inet 100.96.1.0/32 scope global flannel.1
       valid_lft forever preferred_lft forever
    inet6 fe80::8840:29ff:fed0:cfe9/64 scope link 
       valid_lft forever preferred_lft forever
23: nodelocaldns: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default 
    link/ether 16:fd:47:d7:71:56 brd ff:ff:ff:ff:ff:ff
    inet 169.254.20.10/32 brd 169.254.20.10 scope global nodelocaldns
       valid_lft forever preferred_lft forever
    inet 100.64.0.10/32 brd 100.64.0.10 scope global nodelocaldns
       valid_lft forever preferred_lft forever

node-local-dns logs

kubectl logs node-local-dns-4zf8h -n kube-system
2020/06/02 08:40:44 [INFO] Using Corefile /etc/Corefile
2020/06/02 08:40:44 [INFO] Updated Corefile with 0 custom stubdomains and upstream servers /etc/resolv.conf
2020/06/02 08:40:44 [INFO] Using config file:
cluster.local:53 {
    errors
    cache {
      success 9984 30
      denial 9984 5
    }
    reload
    loop
    bind 169.254.20.10 100.64.0.10
    forward . 100.71.54.124 {
      force_tcp
    }
    prometheus :9253
    health 169.254.20.10:8080
}
in-addr.arpa:53 {
    errors
    cache 30
    reload
    loop
    bind 169.254.20.10 100.64.0.10
    forward . 100.71.54.124 {
      force_tcp
    }
    prometheus :9253
}
ip6.arpa:53 {
    errors
    cache 30
    reload
    loop
    bind 169.254.20.10 100.64.0.10
    forward . 100.71.54.124 {
      force_tcp
    }
    prometheus :9253
}
.:53 {
    errors
    cache 30
    reload
    loop
    bind 169.254.20.10 100.64.0.10
    forward . /etc/resolv.conf {
      force_tcp
    }
    prometheus :9253
}
2020/06/02 08:40:44 [INFO] Updated Corefile with 0 custom stubdomains and upstream servers /etc/resolv.conf
2020/06/02 08:40:44 [INFO] Using config file:
cluster.local:53 {
    errors
    cache {
      success 9984 30
      denial 9984 5
    }
    reload
    loop
    bind 169.254.20.10 100.64.0.10
    forward . 100.71.54.124 {
      force_tcp
    }
    prometheus :9253
    health 169.254.20.10:8080
}
in-addr.arpa:53 {
    errors
    cache 30
    reload
    loop
    bind 169.254.20.10 100.64.0.10
    forward . 100.71.54.124 {
      force_tcp
    }
    prometheus :9253
}
ip6.arpa:53 {
    errors
    cache 30
    reload
    loop
    bind 169.254.20.10 100.64.0.10
    forward . 100.71.54.124 {
      force_tcp
    }
    prometheus :9253
}
.:53 {
    errors
    cache 30
    reload
    loop
    bind 169.254.20.10 100.64.0.10
    forward . /etc/resolv.conf {
      force_tcp
    }
    prometheus :9253
}
2020/06/02 08:40:44 [INFO] Tearing down
2020/06/02 08:40:44 [INFO] Added back nodelocaldns rule - {raw PREROUTING [-p tcp -d 169.254.20.10 --dport 53 -j NOTRACK]}
2020/06/02 08:40:44 [INFO] Added back nodelocaldns rule - {raw PREROUTING [-p udp -d 169.254.20.10 --dport 53 -j NOTRACK]}
2020/06/02 08:40:44 [INFO] Added back nodelocaldns rule - {filter INPUT [-p tcp -d 169.254.20.10 --dport 53 -j ACCEPT]}
2020/06/02 08:40:44 [INFO] Added back nodelocaldns rule - {filter INPUT [-p udp -d 169.254.20.10 --dport 53 -j ACCEPT]}
2020/06/02 08:40:44 [INFO] Added back nodelocaldns rule - {raw OUTPUT [-p tcp -s 169.254.20.10 --sport 53 -j NOTRACK]}
2020/06/02 08:40:44 [INFO] Added back nodelocaldns rule - {raw OUTPUT [-p udp -s 169.254.20.10 --sport 53 -j NOTRACK]}
2020/06/02 08:40:44 [INFO] Added back nodelocaldns rule - {filter OUTPUT [-p tcp -s 169.254.20.10 --sport 53 -j ACCEPT]}
2020/06/02 08:40:44 [INFO] Added back nodelocaldns rule - {filter OUTPUT [-p udp -s 169.254.20.10 --sport 53 -j ACCEPT]}
2020/06/02 08:40:44 [INFO] Added back nodelocaldns rule - {raw OUTPUT [-p tcp -d 169.254.20.10 --dport 53 -j NOTRACK]}
2020/06/02 08:40:44 [INFO] Added back nodelocaldns rule - {raw OUTPUT [-p udp -d 169.254.20.10 --dport 53 -j NOTRACK]}
2020/06/02 08:40:44 [INFO] Added back nodelocaldns rule - {raw OUTPUT [-p tcp -d 169.254.20.10 --dport 8080 -j NOTRACK]}
2020/06/02 08:40:44 [INFO] Added back nodelocaldns rule - {raw OUTPUT [-p tcp -s 169.254.20.10 --sport 8080 -j NOTRACK]}
2020/06/02 08:40:44 [INFO] Added back nodelocaldns rule - {raw PREROUTING [-p tcp -d 100.64.0.10 --dport 53 -j NOTRACK]}
2020/06/02 08:40:44 [INFO] Added back nodelocaldns rule - {raw PREROUTING [-p udp -d 100.64.0.10 --dport 53 -j NOTRACK]}
2020/06/02 08:40:44 [INFO] Added back nodelocaldns rule - {filter INPUT [-p tcp -d 100.64.0.10 --dport 53 -j ACCEPT]}
2020/06/02 08:40:44 [INFO] Added back nodelocaldns rule - {filter INPUT [-p udp -d 100.64.0.10 --dport 53 -j ACCEPT]}
2020/06/02 08:40:44 [INFO] Added back nodelocaldns rule - {raw OUTPUT [-p tcp -s 100.64.0.10 --sport 53 -j NOTRACK]}
2020/06/02 08:40:44 [INFO] Added back nodelocaldns rule - {raw OUTPUT [-p udp -s 100.64.0.10 --sport 53 -j NOTRACK]}
2020/06/02 08:40:44 [INFO] Added back nodelocaldns rule - {filter OUTPUT [-p tcp -s 100.64.0.10 --sport 53 -j ACCEPT]}
2020/06/02 08:40:44 [INFO] Added back nodelocaldns rule - {filter OUTPUT [-p udp -s 100.64.0.10 --sport 53 -j ACCEPT]}
2020/06/02 08:40:44 [INFO] Added back nodelocaldns rule - {raw OUTPUT [-p tcp -d 100.64.0.10 --dport 53 -j NOTRACK]}
2020/06/02 08:40:44 [INFO] Added back nodelocaldns rule - {raw OUTPUT [-p udp -d 100.64.0.10 --dport 53 -j NOTRACK]}
2020/06/02 08:40:44 [INFO] Added back nodelocaldns rule - {raw OUTPUT [-p tcp -d 100.64.0.10 --dport 8080 -j NOTRACK]}
2020/06/02 08:40:44 [INFO] Added back nodelocaldns rule - {raw OUTPUT [-p tcp -s 100.64.0.10 --sport 8080 -j NOTRACK]}
2020/06/02 08:40:44 [INFO] Added interface - nodelocaldns
listen tcp 169.254.20.10:8080: bind: address already in use

The text was updated successfully, but these errors were encountered:

fred-vogt · 2020-06-02T08:50:23Z

Have seen this on 2 different clusters created with KOPS 1.18.0-beta.1.

Will test with 1.18.0-alpha.3 tomorrow.
I thought I've seen node local dns working before today.

rifelpet · 2020-06-02T18:52:40Z

Hi @fred-vogt thanks for the report. This does seem like an issue we'll need to fix for Kops 1.18.0.

This is the listening port we'll need to change in order to not conflict with kube-apiserver:

kops/upup/models/cloudup/resources/addons/nodelocaldns.addons.k8s.io/k8s-1.12.yaml.template

Line 58 in d18c88a

health {{ or .KubeDNS.NodeLocalDNS.LocalIP "169.254.20.10" }}:8080

@mazzy89 since you implemented it, do you have any concerns with that port being updated? We'd probably pick a new port and add it here

mazzy89 · 2020-06-02T19:02:43Z

Thank you @rifelpet to ping me on this. I brought in that port from the upstream node local dns cache manifest. I need to check this out and verify if any changes to that port could have any impact on the component itself. Going to check and i'll reply in a while

fred-vogt · 2020-06-02T20:20:32Z

@rifelpet, @mazzy89 - It doesn't happen with kops-1.18.0-alpha-3 - FYI.

Version 1.18.0-alpha.3 (git-27aab12b2)

If there is more info to collect that is helpful I can provide it.

Thanks all.
Would love to use this feature for 1.18 GA.

johngmyers · 2020-06-02T20:23:55Z

The kube-apiserver-healthcheck sidecar that this conflicts with was added after 1.18.0-alpha3.

For reference, node-local-dns isn't in 1.17 branch, so this doesn't affect that branch despite kube-apiserver-healthcheck being backported there.

johngmyers · 2020-06-02T20:25:37Z

/milestone v1.18

k8s-ci-robot · 2020-06-02T20:25:38Z

@johngmyers: You must be a member of the kubernetes/kops-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Kops Maintainers and have them propose you as an additional delegate for this responsibility.

In response to this:

/milestone v1.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

johngmyers · 2020-06-02T20:44:20Z

Both node-local-dns and kube-apiserver-healthcheck should be using ports registered in pkg/wellknownports/wellknownports.go if they're exposing on host network. Probably neither one should be using port 8080 on host network.

mazzy89 · 2020-06-02T20:59:31Z

I've checked and tried to find out how node local dns uses the port 8080. It is configured in the upstream manifest https://github.com/kubernetes/kubernetes/blob/a4e7db7cc3505bada9f98f6f0f7f21306cf217e2/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml#L70 for health check purpouse. I haven't found at the moment any reference in the upstream code base though.

rifelpet · 2020-06-02T21:48:56Z

I'm guessing the health check port is only used by the livenessProbe so that would be the only other reference we would need to update when changing the port. Perhaps the docs can be updated to mention the health and prometheus ports in case users want to consume them.

jim-barber-he · 2020-06-03T00:59:54Z

This problem has also occurred for me using the newly released Kops 1.17.0 and a Debian 10 (buster) cloud image and Kubernetes 1.17.6, so it's not just limited to kops 1.18 by the looks of things.

jim-barber-he · 2020-06-03T01:25:03Z

This problem is not happening for kops 1.17.0-beta.2.
The release notes for 1.17.0-beta.2 to 1.17.0 show that the kube-apiserver healthcheck via sidecar container patch was pulled in and enabled.

johngmyers · 2020-06-03T03:17:39Z

@jim-barber-he I don't see nodelocaldns in the 1.17 branch. It appears to have been added to 1.18 in #8780 and I see no evidence of backporting.

So how is it you are getting nodelocaldns with kops 1.17.0?

johngmyers · 2020-06-03T04:15:49Z

@fred-vogt I'm aware that kube-apiserver-healthcheck is in 1.17. But nodelocaldns is not, so there's no problem.

fred-vogt · 2020-06-03T04:16:56Z

Oops. Removed that comment.

jim-barber-he · 2020-06-03T05:12:44Z

@jim-barber-he I don't see nodelocaldns in the 1.17 branch. It appears to have been added to 1.18 in #8780 and I see no evidence of backporting.

So how is it you are getting nodelocaldns with kops 1.17.0?

It's still possible to install nodelocaldns into your cluster external to kops and have it all work with a few tweaks to their config file (apart for the find and replace on their placeholder vars we also added a tolerance to have it start on the master nodes too).
For earlier versions of nodelocaldns we used to have to set spec.kubelet.clusterDNS to 169.254.20.10 but with the 1.18 version of nodelocaldns we've been using, that isn't necessary to do anymore.

We've been doing that since Kube 1.15 for a healthier cluster; it's an essential part of our Kubernetes infrastructure.
Having this problem in 1.17 is a show-stopper for us.

johngmyers · 2020-06-03T05:28:13Z

If you're installing something into a cluster external to kops you can adjust the port it uses to not conflict.

olemarkus · 2020-06-12T20:52:38Z

I had a look at the manifest for nodelocaldns. It is sort of doing the right thing. It binds to 8080 on the IP you specify as LocalIP. The reason for the conflict is because the health check binds to all ports on 8080. It would be more suitable if it bound to localhost:8080 instead, as far as I can tell.

johngmyers · 2020-06-16T16:58:55Z

Fixed by #9373
/close

k8s-ci-robot · 2020-06-16T16:59:10Z

@johngmyers: Closing this issue.

In response to this:

Fixed by #9373
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fred-vogt changed the title ~~node-local-dns crash looping on masters 'listen tcp 169.254.20.10:8080: bind: address already in use'~~ node-local-dns crash looping on masters '169.254.20.10:8080: bind: address already in use' Jun 2, 2020

rifelpet added blocks-next priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jun 2, 2020

rifelpet added this to the v1.18 milestone Jun 2, 2020

olemarkus mentioned this issue Jun 13, 2020

Move host-network services off of port 8080 #9355

Merged

k8s-ci-robot closed this as completed Jun 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node-local-dns crash looping on masters '169.254.20.10:8080: bind: address already in use' #9245

node-local-dns crash looping on masters '169.254.20.10:8080: bind: address already in use' #9245

fred-vogt commented Jun 2, 2020 •

edited

Loading

fred-vogt commented Jun 2, 2020

rifelpet commented Jun 2, 2020

mazzy89 commented Jun 2, 2020

fred-vogt commented Jun 2, 2020 •

edited

Loading

johngmyers commented Jun 2, 2020

johngmyers commented Jun 2, 2020

k8s-ci-robot commented Jun 2, 2020

johngmyers commented Jun 2, 2020 •

edited

Loading

mazzy89 commented Jun 2, 2020

rifelpet commented Jun 2, 2020

jim-barber-he commented Jun 3, 2020

jim-barber-he commented Jun 3, 2020

johngmyers commented Jun 3, 2020

johngmyers commented Jun 3, 2020

fred-vogt commented Jun 3, 2020

jim-barber-he commented Jun 3, 2020

johngmyers commented Jun 3, 2020

olemarkus commented Jun 12, 2020

johngmyers commented Jun 16, 2020

k8s-ci-robot commented Jun 16, 2020

node-local-dns crash looping on masters '169.254.20.10:8080: bind: address already in use' #9245

node-local-dns crash looping on masters '169.254.20.10:8080: bind: address already in use' #9245

Comments

fred-vogt commented Jun 2, 2020 • edited Loading

KOPS Validate

OS

Listening sockets

Interfaces

node-local-dns logs

fred-vogt commented Jun 2, 2020

rifelpet commented Jun 2, 2020

mazzy89 commented Jun 2, 2020

fred-vogt commented Jun 2, 2020 • edited Loading

johngmyers commented Jun 2, 2020

johngmyers commented Jun 2, 2020

k8s-ci-robot commented Jun 2, 2020

johngmyers commented Jun 2, 2020 • edited Loading

mazzy89 commented Jun 2, 2020

rifelpet commented Jun 2, 2020

jim-barber-he commented Jun 3, 2020

jim-barber-he commented Jun 3, 2020

johngmyers commented Jun 3, 2020

johngmyers commented Jun 3, 2020

fred-vogt commented Jun 3, 2020

jim-barber-he commented Jun 3, 2020

johngmyers commented Jun 3, 2020

olemarkus commented Jun 12, 2020

johngmyers commented Jun 16, 2020

k8s-ci-robot commented Jun 16, 2020

fred-vogt commented Jun 2, 2020 •

edited

Loading

fred-vogt commented Jun 2, 2020 •

edited

Loading

johngmyers commented Jun 2, 2020 •

edited

Loading