Altering autscale group in AWS makes dns faulty #4050

BobbyJohansen · 2017-12-13T01:09:08Z

Thanks for submitting an issue! Please fill in as much of the template below as
you can.

------------- BUG REPORT TEMPLATE --------------------

What kops version are you running? The command kops version, will display
this information.

kops version 1.7.1

What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
What cloud provider are you using?
AWS
What commands did you run? What is the simplest way to reproduce this issue?
I changed the launch configuration in the autoscaling group for my nodes (workers) and terminated the old instances resulting in the new instances being spawned. (I upgraded the instance type)
What happened after the commands executed?
The nodes autojoined the cluster but there are intermittent dns issues when attempting to resolve dns names from my some of my pods.

I checked the dns controller in the kube-system and here is what i noticed:
(have replaced my dns zone name with cluster.dns bellow)

Found multiple zones for name "cluster.dns", won't manage zone (To fix: provide zone mapping flag with ID of zone)
dnscontroller.go:611] Update desired state: node/ip-X-X-X-X.us-west-2.compute.internal: [{A node/ip-X-X-X-X.us-west-2.compute.internal/internal X.X.X.X true} {A node/role=node/internal 10.1.54.136 true} {A node/role=node/ ip-X-X-X-X.us-west-2.compute.internal true} {A node/role=node/ ip-X-X-X-X.us-west-2.compute.internal true}]

I also noticed this error in one of the kube-dns sidecar pods

ERROR: logging before flag.Parse: W1212 23:30:15.361488 1 server.go:64] Error getting metrics from dnsmasq: read udp 127.0.0.1:58316->127.0.0.1:53: read: connection refused

I am using kube-router as my cni
and kube-dns autoscaler also throws some errors:

E1213 01:33:16.911664 1 autoscaler_server.go:86] Error while getting cluster status: Get https://100.64.0.1:443/api/v1/nodes: dial tcp 100.64.0.1:443: i/o timeout

What did you expect to happen?
DNS should be fine. I am guessing I am ignorant to something that manages dns in my cluster

The text was updated successfully, but these errors were encountered:

thomasjungblut · 2017-12-13T07:51:54Z

We actually see a similar behaviour on our newly upgraded 1.8.4 cluster. We have a cluster-autoscaler running and everytime the node is terminated it stays around as a zombie.

We solve it by deleting the node that can't be contacted via:

kubectl delete node XXX

To me it looks like kubernetes doesn't pick up a terminated node though.

BobbyJohansen · 2017-12-13T15:09:20Z

kubectl get nodes though does not show any phantom nodes. It shows all the new nodes that the autoscaling group made.

thomasjungblut · 2017-12-13T15:11:51Z

yeah maybe it is also a slightly different issue, since we see "no route to host" rather than refused connections or timeouts.

BobbyJohansen · 2017-12-13T15:38:01Z

It is almost like kube-dns, kube-dns-autoscaler, and dns-controller couldnt handle new nodes coming into the system. I moved over to the new nodes in a rolling fashion. I must be missing some understanding of how kubernetes handles dns under the hood. I just assumed it would detect the new nodes and handle the amount of kube-dns pods it needs to scale to. However my bet is that the scaler cant scale because of the timeout issue. that IP seems not in existance in my cluster.

BobbyJohansen · 2017-12-13T16:19:36Z

I attempted to scale up the dns pods using the config map in the dns-autoscaler using "min:3" which scaled them up but with no affect to my pods attempting to access a host using dns

BobbyJohansen · 2017-12-13T19:48:29Z

Also feels very similar to this issue: kubernetes/kubeadm#193

thomasjungblut · 2017-12-13T20:12:10Z

Thanks for the link Bobby. I think we may have forgotten to update the kops AMI on upgrading, here was a fix along for iptables and forwarding #3958

BobbyJohansen · 2017-12-13T20:26:09Z

Oh interesting! Can I resolve this in my current clusters manually? or should i rebuild my clusters using Kops:Master?

thomasjungblut · 2017-12-13T20:28:53Z

I believe if you update your instance groups to include the new AMI, it should be enough to get it.
Here's the manifest about the latest AMIs:
https://github.com/kubernetes/kops/blob/master/channels/stable#L16

chrislovecnm · 2017-12-13T20:33:49Z

Actually nodeup in kops 1.8 should fix it. Kops is backwards compatible. I would test rolling the cluster with kops 1.8 in dev.

BobbyJohansen · 2017-12-13T20:37:22Z

So I should just upgrade to kops 1.8 and recreate? Thank you guys for responding really appreciate it.

BobbyJohansen · 2017-12-13T20:44:32Z

Will try switching AMIs and then moving to kops 1.8

chrislovecnm · 2017-12-13T20:53:24Z

You need kops 1.8. That has the fix. Ami does not matter, personally I would do both. You can do a rolling update. But I am not certain that is your problem. Test ;)

BobbyJohansen · 2017-12-13T21:32:55Z

Update: Upgrading the AMIs on all of the worker nodes did not resolve the issue. Moving to a 1.8 cluster and 1.8 kops

BobbyJohansen · 2017-12-13T22:25:30Z

Do you guys know if the errors in kubedns sidecar are benign or not? I have seen opinions both ways

ERROR: logging before flag.Parse: I1213 22:19:02.887520 1 server.go:45] Starting server (options {DnsMasqPort:53 DnsMasqAddr:127.0.0.1 DnsMasqPollIntervalMs:5000 Probes:[{Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1} {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}] PrometheusAddr:0.0.0.0 PrometheusPort:10054 PrometheusPath:/metrics PrometheusNamespace:kubedns})

BobbyJohansen · 2018-01-15T19:10:13Z

This can be resolved

BobbyJohansen closed this as completed Jan 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Altering autscale group in AWS makes dns faulty #4050

Altering autscale group in AWS makes dns faulty #4050

BobbyJohansen commented Dec 13, 2017 •

edited

Loading

thomasjungblut commented Dec 13, 2017

BobbyJohansen commented Dec 13, 2017

thomasjungblut commented Dec 13, 2017

BobbyJohansen commented Dec 13, 2017

BobbyJohansen commented Dec 13, 2017

BobbyJohansen commented Dec 13, 2017

thomasjungblut commented Dec 13, 2017

BobbyJohansen commented Dec 13, 2017

thomasjungblut commented Dec 13, 2017

chrislovecnm commented Dec 13, 2017

BobbyJohansen commented Dec 13, 2017 •

edited

Loading

BobbyJohansen commented Dec 13, 2017

chrislovecnm commented Dec 13, 2017

BobbyJohansen commented Dec 13, 2017

BobbyJohansen commented Dec 13, 2017 •

edited

Loading

BobbyJohansen commented Jan 15, 2018

Altering autscale group in AWS makes dns faulty #4050

Altering autscale group in AWS makes dns faulty #4050

Comments

BobbyJohansen commented Dec 13, 2017 • edited Loading

thomasjungblut commented Dec 13, 2017

BobbyJohansen commented Dec 13, 2017

thomasjungblut commented Dec 13, 2017

BobbyJohansen commented Dec 13, 2017

BobbyJohansen commented Dec 13, 2017

BobbyJohansen commented Dec 13, 2017

thomasjungblut commented Dec 13, 2017

BobbyJohansen commented Dec 13, 2017

thomasjungblut commented Dec 13, 2017

chrislovecnm commented Dec 13, 2017

BobbyJohansen commented Dec 13, 2017 • edited Loading

BobbyJohansen commented Dec 13, 2017

chrislovecnm commented Dec 13, 2017

BobbyJohansen commented Dec 13, 2017

BobbyJohansen commented Dec 13, 2017 • edited Loading

BobbyJohansen commented Jan 15, 2018

BobbyJohansen commented Dec 13, 2017 •

edited

Loading

BobbyJohansen commented Dec 13, 2017 •

edited

Loading

BobbyJohansen commented Dec 13, 2017 •

edited

Loading