Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Altering autscale group in AWS makes dns faulty #4050

Closed
BobbyJohansen opened this issue Dec 13, 2017 · 16 comments
Closed

Altering autscale group in AWS makes dns faulty #4050

BobbyJohansen opened this issue Dec 13, 2017 · 16 comments

Comments

@BobbyJohansen
Copy link

BobbyJohansen commented Dec 13, 2017

Thanks for submitting an issue! Please fill in as much of the template below as
you can.

------------- BUG REPORT TEMPLATE --------------------

  1. What kops version are you running? The command kops version, will display
    this information.

kops version 1.7.1

  1. What Kubernetes version are you running? kubectl version will print the
    version if a cluster is running or provide the Kubernetes version specified as
    a kops flag.

  2. What cloud provider are you using?
    AWS

  3. What commands did you run? What is the simplest way to reproduce this issue?
    I changed the launch configuration in the autoscaling group for my nodes (workers) and terminated the old instances resulting in the new instances being spawned. (I upgraded the instance type)

  4. What happened after the commands executed?
    The nodes autojoined the cluster but there are intermittent dns issues when attempting to resolve dns names from my some of my pods.

I checked the dns controller in the kube-system and here is what i noticed:
(have replaced my dns zone name with cluster.dns bellow)

Found multiple zones for name "cluster.dns", won't manage zone (To fix: provide zone mapping flag with ID of zone)
dnscontroller.go:611] Update desired state: node/ip-X-X-X-X.us-west-2.compute.internal: [{A node/ip-X-X-X-X.us-west-2.compute.internal/internal X.X.X.X true} {A node/role=node/internal 10.1.54.136 true} {A node/role=node/ ip-X-X-X-X.us-west-2.compute.internal true} {A node/role=node/ ip-X-X-X-X.us-west-2.compute.internal true}]

I also noticed this error in one of the kube-dns sidecar pods

ERROR: logging before flag.Parse: W1212 23:30:15.361488 1 server.go:64] Error getting metrics from dnsmasq: read udp 127.0.0.1:58316->127.0.0.1:53: read: connection refused

I am using kube-router as my cni
and kube-dns autoscaler also throws some errors:

E1213 01:33:16.911664 1 autoscaler_server.go:86] Error while getting cluster status: Get https://100.64.0.1:443/api/v1/nodes: dial tcp 100.64.0.1:443: i/o timeout

  1. What did you expect to happen?
    DNS should be fine. I am guessing I am ignorant to something that manages dns in my cluster
@thomasjungblut
Copy link

We actually see a similar behaviour on our newly upgraded 1.8.4 cluster. We have a cluster-autoscaler running and everytime the node is terminated it stays around as a zombie.

We solve it by deleting the node that can't be contacted via:

kubectl delete node XXX

To me it looks like kubernetes doesn't pick up a terminated node though.

@BobbyJohansen
Copy link
Author

kubectl get nodes though does not show any phantom nodes. It shows all the new nodes that the autoscaling group made.

@thomasjungblut
Copy link

yeah maybe it is also a slightly different issue, since we see "no route to host" rather than refused connections or timeouts.

@BobbyJohansen
Copy link
Author

It is almost like kube-dns, kube-dns-autoscaler, and dns-controller couldnt handle new nodes coming into the system. I moved over to the new nodes in a rolling fashion. I must be missing some understanding of how kubernetes handles dns under the hood. I just assumed it would detect the new nodes and handle the amount of kube-dns pods it needs to scale to. However my bet is that the scaler cant scale because of the timeout issue. that IP seems not in existance in my cluster.

@BobbyJohansen
Copy link
Author

I attempted to scale up the dns pods using the config map in the dns-autoscaler using "min:3" which scaled them up but with no affect to my pods attempting to access a host using dns

@BobbyJohansen
Copy link
Author

Also feels very similar to this issue: kubernetes/kubeadm#193

@thomasjungblut
Copy link

Thanks for the link Bobby. I think we may have forgotten to update the kops AMI on upgrading, here was a fix along for iptables and forwarding #3958

@BobbyJohansen
Copy link
Author

Oh interesting! Can I resolve this in my current clusters manually? or should i rebuild my clusters using Kops:Master?

@thomasjungblut
Copy link

I believe if you update your instance groups to include the new AMI, it should be enough to get it.
Here's the manifest about the latest AMIs:
https://github.com/kubernetes/kops/blob/master/channels/stable#L16

@chrislovecnm
Copy link
Contributor

Actually nodeup in kops 1.8 should fix it. Kops is backwards compatible. I would test rolling the cluster with kops 1.8 in dev.

@BobbyJohansen
Copy link
Author

BobbyJohansen commented Dec 13, 2017

So I should just upgrade to kops 1.8 and recreate? Thank you guys for responding really appreciate it.

@BobbyJohansen
Copy link
Author

Will try switching AMIs and then moving to kops 1.8

@chrislovecnm
Copy link
Contributor

You need kops 1.8. That has the fix. Ami does not matter, personally I would do both. You can do a rolling update. But I am not certain that is your problem. Test ;)

@BobbyJohansen
Copy link
Author

Update: Upgrading the AMIs on all of the worker nodes did not resolve the issue. Moving to a 1.8 cluster and 1.8 kops

@BobbyJohansen
Copy link
Author

BobbyJohansen commented Dec 13, 2017

Do you guys know if the errors in kubedns sidecar are benign or not? I have seen opinions both ways

ERROR: logging before flag.Parse: I1213 22:19:02.887520 1 server.go:45] Starting server (options {DnsMasqPort:53 DnsMasqAddr:127.0.0.1 DnsMasqPollIntervalMs:5000 Probes:[{Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1} {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}] PrometheusAddr:0.0.0.0 PrometheusPort:10054 PrometheusPath:/metrics PrometheusNamespace:kubedns})

@BobbyJohansen
Copy link
Author

This can be resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants