-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Altering autscale group in AWS makes dns faulty #4050
Comments
We actually see a similar behaviour on our newly upgraded 1.8.4 cluster. We have a cluster-autoscaler running and everytime the node is terminated it stays around as a zombie. We solve it by deleting the node that can't be contacted via:
To me it looks like kubernetes doesn't pick up a terminated node though. |
kubectl get nodes though does not show any phantom nodes. It shows all the new nodes that the autoscaling group made. |
yeah maybe it is also a slightly different issue, since we see "no route to host" rather than refused connections or timeouts. |
It is almost like kube-dns, kube-dns-autoscaler, and dns-controller couldnt handle new nodes coming into the system. I moved over to the new nodes in a rolling fashion. I must be missing some understanding of how kubernetes handles dns under the hood. I just assumed it would detect the new nodes and handle the amount of kube-dns pods it needs to scale to. However my bet is that the scaler cant scale because of the timeout issue. that IP seems not in existance in my cluster. |
I attempted to scale up the dns pods using the config map in the dns-autoscaler using "min:3" which scaled them up but with no affect to my pods attempting to access a host using dns |
Also feels very similar to this issue: kubernetes/kubeadm#193 |
Thanks for the link Bobby. I think we may have forgotten to update the kops AMI on upgrading, here was a fix along for iptables and forwarding #3958 |
Oh interesting! Can I resolve this in my current clusters manually? or should i rebuild my clusters using Kops:Master? |
I believe if you update your instance groups to include the new AMI, it should be enough to get it. |
Actually nodeup in kops 1.8 should fix it. Kops is backwards compatible. I would test rolling the cluster with kops 1.8 in dev. |
So I should just upgrade to kops 1.8 and recreate? Thank you guys for responding really appreciate it. |
Will try switching AMIs and then moving to kops 1.8 |
You need kops 1.8. That has the fix. Ami does not matter, personally I would do both. You can do a rolling update. But I am not certain that is your problem. Test ;) |
Update: Upgrading the AMIs on all of the worker nodes did not resolve the issue. Moving to a 1.8 cluster and 1.8 kops |
Do you guys know if the errors in kubedns sidecar are benign or not? I have seen opinions both ways
|
This can be resolved |
Thanks for submitting an issue! Please fill in as much of the template below as
you can.
------------- BUG REPORT TEMPLATE --------------------
kops
version are you running? The commandkops version
, will displaythis information.
kops version 1.7.1
What Kubernetes version are you running?
kubectl version
will print theversion if a cluster is running or provide the Kubernetes version specified as
a
kops
flag.What cloud provider are you using?
AWS
What commands did you run? What is the simplest way to reproduce this issue?
I changed the launch configuration in the autoscaling group for my nodes (workers) and terminated the old instances resulting in the new instances being spawned. (I upgraded the instance type)
What happened after the commands executed?
The nodes autojoined the cluster but there are intermittent dns issues when attempting to resolve dns names from my some of my pods.
I checked the dns controller in the kube-system and here is what i noticed:
(have replaced my dns zone name with cluster.dns bellow)
I also noticed this error in one of the kube-dns sidecar pods
I am using kube-router as my cni
and kube-dns autoscaler also throws some errors:
DNS should be fine. I am guessing I am ignorant to something that manages dns in my cluster
The text was updated successfully, but these errors were encountered: