Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS intermittent delays of 5s #56903

Closed
mikksoone opened this issue Dec 6, 2017 · 255 comments
Closed

DNS intermittent delays of 5s #56903

mikksoone opened this issue Dec 6, 2017 · 255 comments
Labels
area/dns kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@mikksoone
Copy link

mikksoone commented Dec 6, 2017

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

What happened:
DNS lookup is sometimes taking 5 seconds.

What you expected to happen:
No delays in DNS.

How to reproduce it (as minimally and precisely as possible):

  1. Create a cluster in AWS using kops with cni networking:
kops create cluster     --node-count 3     --zones eu-west-1a,eu-west-1b,eu-west-1c     --master-zones eu-west-1a,eu-west-1b,eu-west-1c     --dns-zone kube.example.com   --node-size t2.medium     --master-size t2.medium  --topology private --networking cni   --cloud-labels "Env=Staging"  ${NAME}
  1. CNI plugin:
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"
  1. Run this script in any pod with that has curl:
var=1
while true ; do
  res=$( { curl -o /dev/null -s -w %{time_namelookup}\\n  http://www.google.com; } 2>&1 )
  var=$((var+1))
  if [[ $res =~ ^[1-9] ]]; then
    now=$(date +"%T")
    echo "$var slow: $res $now"
    break
  fi
done

Anything else we need to know?:

  1. I am encountering this issue in both staging and production clusters, but for some reason staging cluster is having a lot more 5s delays.
  2. Delays happen both for external services (google.com) or internal, such as service.namespace.
  3. Happens on both 1.6 and 1.7 version of kubernetes, but did not encounter these issues in 1.5 (though the setup was a bit different - no CNI back then).
  4. Have not tested with 1.7 without CNI yet.

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.2", GitCommit:"bdaeafa71f6c7c04636251031f93464384d54963", GitTreeState:"clean", BuildDate:"2017-10-24T19:48:57Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.10", GitCommit:"bebdeb749f1fa3da9e1312c4b08e439c404b3136", GitTreeState:"clean", BuildDate:"2017-11-03T16:31:49Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
AWS
  • OS (e.g. from /etc/os-release):
PRETTY_NAME="Ubuntu 16.04.3 LTS"
  • Kernel (e.g. uname -a):
Linux ingress-nginx-3882489562-438sm 4.4.65-k8s #1 SMP Tue May 2 15:48:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Similar issues

  1. Kube DNS Latency dns#96 - closed but seems to be exactly the same
  2. kube-dns: dnsmasq intermittent connection refused #45976 - has some comments matching this issue, but is taking the direction of fixing kube-dns up/down scaling problem, and is not about the intermittent failures.

/sig network

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 6, 2017
@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Dec 6, 2017
@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Dec 6, 2017
@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Dec 6, 2017
@cmluciano
Copy link
Member

@kubernetes/sig-network-misc

@kgignatyev-inspur
Copy link

I have similar issue: consistently slow DNS resolution from pods, 20 seconds plus
from busybox:
time nslookup google.com
Server: 100.64.0.10
Address 1: 100.64.0.10

Name: google.com
Address 1: 2607:f8b0:400a:806::200e
Address 2: 172.217.3.206 sea15s12-in-f14.1e100.net
real 0m 50.03s
user 0m 0.00s
sys 0m 0.00s
/ #

I just created 1.8.5 cluster in AWS with kops, and only deviation from standard config is that I am using CentOS host machines (ami-e535c59d for us-west-2)

resolution from hosts is instanteneous, from pods: consistently slow

@ani82
Copy link

ani82 commented Dec 23, 2017

we observe the same on GKE with version v1.8.4-gke0 and both Busybox (latest) or Debian9:

$ kubectl exec -ti busybox -- time nslookup storage.googleapis.com
Server: 10.39.240.10
Address 1: 10.39.240.10 kube-dns.kube-system.svc.cluster.local

Name: storage.googleapis.com
Address 1: 2607:f8b0:400c:c06::80 vl-in-x80.1e100.net
Address 2: 74.125.141.128 vl-in-f128.1e100.net
real 0m 10.02s
user 0m 0.00s
sys 0m 0.00s

DNS latency varies between 10 and 40s in multiples of 5s.

@thockin
Copy link
Member

thockin commented Jan 6, 2018

5s is pretty much ALWAYS indicating a DNS timeout, meaning some packet got dropped somewhere.

@ani82
Copy link

ani82 commented Jan 9, 2018

Yes, it seems as if the local DNS servers timeout instead of answering :

[root@busybox /]# nslookup google.com
;; connection timed out; trying next origin
;; connection timed out; trying next origin
;; connection timed out; trying next origin
;; connection timed out; trying next origin
;; connection timed out; trying next origin
;; connection timed out; no servers could be reached

[root@busybox /]# tcpdump port 53
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
15:38:10.423547 IP busybox.46239 > kube-dns.kube-system.svc.cluster.local.domain: 51779+ A? google.com.default.svc.cluster.local. (54)
15:38:10.424120 IP busybox.46757 > kube-dns.kube-system.svc.cluster.local.domain: 41018+ PTR? 10.240.39.10.in-addr.arpa. (43)
15:38:10.424595 IP kube-dns.kube-system.svc.cluster.local.domain > busybox.46757: 41018 1/0/0 PTR kube-dns.kube-system.svc.cluster.local. (95)
15:38:15.423611 IP busybox.46239 > kube-dns.kube-system.svc.cluster.local.domain: 51779+ A? google.com.default.svc.cluster.local. (54)
15:38:20.423809 IP busybox.46239 > kube-dns.kube-system.svc.cluster.local.domain: 51779+ A? google.com.default.svc.cluster.local. (54)
15:38:25.424247 IP busybox.44496 > kube-dns.kube-system.svc.cluster.local.domain: 63451+ A? google.com.svc.cluster.local. (46)
15:38:30.424508 IP busybox.39936 > kube-dns.kube-system.svc.cluster.local.domain: 14687+ A? google.com.cluster.local. (42)
15:38:35.424767 IP busybox.56675 > kube-dns.kube-system.svc.cluster.local.domain: 37241+ A? google.com.c.retailcatalyst-187519.internal. (61)
15:38:40.424992 IP busybox.35842 > kube-dns.kube-system.svc.cluster.local.domain: 22668+ A? google.com.google.internal. (44)
15:38:45.425295 IP busybox.52037 > kube-dns.kube-system.svc.cluster.local.domain: 6207+ A? google.com. (28)

@aguerra
Copy link

aguerra commented Jan 19, 2018

Just in case someone got here because of dns delays, in our case it was arp table overflow on the nodes (arp -n showing more than 1000 entries). Increasing the limits solved the problem.

@lbrictson
Copy link

We have the same issue within all of our kops deployed aws clusters (5). We tried moving from weave to flannel to rule out the CNI but the issue is the same. Our kube-dns pods are healthy, one on every host and they have not crashed recently.

Our arp tables are no where near full (less than 100 entries usually)

@bowei
Copy link
Member

bowei commented Jan 19, 2018

There are QPS limits on DNS at various places. I think in the past, people have hit AWS DNS server QPS limits in some cases, that may be worth checking.

@lbrictson
Copy link

@bowei sadly this happens in very small clusters as well for us, ones that have so few containers that there is no feasible way we'd be hitting the QPS limit from AWS

@mikksoone
Copy link
Author

Same here, small clusters, no arp nor QPS limits.
dnsPolicy: Default works without delays, but this unfortunately can not be used for all deployments.

@lbrictson
Copy link

@mikksoone exact same situation as us then, dnsPolicy: Default fixes the problem entirely, but of course breaks accessing services internal to the cluster which is a no-go for most of ours.

@vasartori
Copy link
Contributor

vasartori commented Jan 19, 2018

@bowei We have the same problem here.
But we are not using AWS.

@vasartori
Copy link
Contributor

Its seems a problem with glibc.
If you set a timeout on your /etc/resolv.conf this timeout will be respected.

On CoreOS Stable (glibc 2.23) this problem appears.

Setting with 0 in timeout on resolv.conf you will get a 1 secound delay....

I've try disable the IPv6, without success....

@vasartori
Copy link
Contributor

In my tests, using this option on /etc/resolv.conf
options single-request-reopen

Fixed the problem.
But I don't find a "clean" way to put it on pods in kubernetes 1.8.
What I do:

        lifecycle:
          postStart:
            exec:
              command:
              - /bin/sh
              - -c 
              - "/bin/echo 'options single-request-reopen' >> /etc/resolv.conf"

@mikksoone Could you try if it solve your problem too?

@aca02djr
Copy link

Also experiencing this on 1.8.6-gke.0 - @vasartori suggested solution resolved the issue for us too 👍🏻

@mikksoone
Copy link
Author

Doesn't solve the issue for me. Even with this option in resolv.conf I get timeouts of 5s, 2.5s and 3.5s - and they happen very often, twice per minute or so.

@lauri-elevant
Copy link

lauri-elevant commented Feb 5, 2018

We have the same symptoms on 1.8, intermittent DNS resolution stall of 5 seconds. The suggested workaround seems to be effective for us as well. Thank you @vasartori !

@sdtokkolabs
Copy link

I've been having this issue for some time on kubernetes 1.7 and 1.8. I was dropping dns queries from time to time
Yesterday I upgraded my cluster from 1.8.10 to 1.9.6 (kops from 1.8 to 1.9.0-alpha.3) and I started having this same issue ALL THE TIME. The workaround sugested in this issue has no effect and I can't find any way of stopping it. I've made a small workaround by assigning the most requested (and poblematic) DNS to fixed IPs in /etc/hosts.
Any idea on where the real problem is?
I'll test with a brand new cluster in the same versions and report back.

@xiaoxubeii
Copy link
Member

Same problem, but the most strange thing is that it appears on some nodes.

@rajatjindal
Copy link
Contributor

@thockin @bowei

requesting your feedback therefore tagging you.

can this be of any interest here: https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02

there are multiple issues reported for this issue in kubernetes project, and will be great to have it resolved for everyone.

@sdtokkolabs
Copy link

tried with several versions of kubernetes on fresh clusters, all hae the same problem to some degree, dns lookups get lost on the way and retryes have to be made. I've also tested kubenet, flanel, canal and weave as network providers, having the lowest incidence in flanel. I've also tried overloading the nodes and splitting the nodes (dns on it's own machine) but it made no difference. On my production cluster the incidence of this issue is way higher than on a brand new cluster and i can't find the way to isolate the problem :(

@zhan849
Copy link
Contributor

zhan849 commented Apr 14, 2020

We in Pinterest is using kernel 5.0 and the default iptable set up, but still hitting this issue pretty badly:

here is a pcap trace of packages showed clearly proof of UDP packets not getting forwarded out to dns and client side is having 5s timeout / dns level retries, 10.3.253.87 and 10.3.212.90 are user pods and 10.3.23.54.domain is the dns pod

07:25:14.196136 IP 10.3.253.87.36457 > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.30001: UDP, length 61
--
07:25:14.376267 IP 10.3.212.90.56401 > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.30001: UDP, length 47
07:25:19.196210 IP 10.3.253.87.36457 > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.30001: UDP, length 61
07:25:19.376469 IP 10.3.212.90.56401 > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.30001: UDP, length 47
07:25:24.196365 IP 10.3.253.87.36457 > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.30001: UDP, length 61
07:25:24.383758 IP 10.3.212.90.45350 > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.30001: UDP, length 39
07:25:26.795923 IP node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.37345 > 10.3.23.54.domain: 8166+ [1au] A? kubernetes.default. (47)
07:25:26.797035 IP 10.3.23.54.domain > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.37345: 8166 NXDomain 0/0/1 (47)
07:25:29.203369 IP 10.3.253.87.57701 > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.30001: UDP, length 60
07:25:29.203408 IP node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.57701 > 10.3.23.54.domain: 52793+ [1au] A? mavenrepo-external.pinadmin.com. (60)
07:25:29.204446 IP 10.3.23.54.domain > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.57701: 52793* 10/0/1 CNAME pinrepo-external.pinadmin.com., CNAME internal-vpc-pinrepo-pinadmin-internal-1188100222.us-east-1.elb.amazonaws.com., A 10.1.228.192, A 10.1.225.57, A 10.1.225.205, A 10.1.229.86, A 10.1.227.245, A 10.1.224.120, A 10.1.228.228, A 10.1.229.49 (998)

I did a conntrack -S and there is no insertion failure, which indicates that race 1 and 2 mentioned in this blog is already fixed, and we are hitting race 3.

# conntrack -S
cpu=0   	found=175 invalid=0 ignore=83983 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=201
cpu=1   	found=168 invalid=0 ignore=79659 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=299
cpu=2   	found=173 invalid=0 ignore=77880 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=4184
cpu=3   	found=161 invalid=0 ignore=78778 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=216
cpu=4   	found=157 invalid=0 ignore=80478 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=4245
cpu=5   	found=172 invalid=10 ignore=85572 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=346
cpu=6   	found=146 invalid=0 ignore=85334 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=4271
cpu=7   	found=162 invalid=0 ignore=84865 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=230
cpu=8   	found=155 invalid=0 ignore=81691 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=259
cpu=9   	found=164 invalid=1 ignore=81550 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=256
cpu=10  	found=180 invalid=0 ignore=92864 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=270
cpu=11  	found=163 invalid=0 ignore=93113 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=238
cpu=12  	found=171 invalid=0 ignore=80868 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=1934
cpu=13  	found=176 invalid=0 ignore=80974 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=2532
cpu=14  	found=174 invalid=0 ignore=91001 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=927
cpu=15  	found=175 invalid=0 ignore=79837 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=585
cpu=16  	found=168 invalid=0 ignore=84899 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=375
cpu=17  	found=172 invalid=0 ignore=84396 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=328
cpu=18  	found=142 invalid=0 ignore=80365 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=1012
cpu=19  	found=163 invalid=0 ignore=80193 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=5308
cpu=20  	found=179 invalid=0 ignore=84980 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=565
cpu=21  	found=200 invalid=0 ignore=80537 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=278
cpu=22  	found=153 invalid=0 ignore=83528 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=430
cpu=23  	found=166 invalid=0 ignore=84160 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=256
cpu=24  	found=189 invalid=0 ignore=81400 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=355
cpu=25  	found=183 invalid=0 ignore=82727 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=352
cpu=26  	found=170 invalid=0 ignore=89293 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=275
cpu=27  	found=183 invalid=1 ignore=82717 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=332
cpu=28  	found=188 invalid=0 ignore=83741 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=242
cpu=29  	found=192 invalid=0 ignore=88601 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=396
cpu=30  	found=166 invalid=0 ignore=84152 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=329
cpu=31  	found=165 invalid=0 ignore=81369 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=229
cpu=32  	found=170 invalid=0 ignore=84275 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=4208
cpu=33  	found=160 invalid=0 ignore=86734 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=4122
cpu=34  	found=173 invalid=0 ignore=82152 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=4231
cpu=35  	found=150 invalid=0 ignore=78019 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=4130

There are a few more action items we are trying ATM:

  1. newer version kernel
  2. use 1 dns replica (occupies full node so it has a rather stable run time) so there is 1 rule in iptable
  3. use TCP for DNS.

Will keep you guys updated but if anyone already tried any of the 3 options and has a failure / success story to share I'd be very much appreciated.

/cc @thockin @brb

@szuecs
Copy link
Member

szuecs commented Apr 14, 2020

@zhan849 why not using a daemonset and bypass conntrack?
We use dnsmasq in front of coredns in a daemonset pod and use cloudinit and some systemd units to set resolv.conf values via kubelet to the node local dnsmasq running in hostnetwork.
Works great even if we spike dns traffic to oomkill coredns. This happened in a nodejs heavy cluster.
https://github.com/zalando-incubator/kubernetes-on-aws/tree/dev/cluster/manifests/coredns-local
Everything else is in systemd units I can’t share.

@rata
Copy link
Member

rata commented Apr 14, 2020

@szuecs please note that there is upstream support for that and stable in 1.18 (haven't tried it myself): https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/

I guess this does not solve the root cause of the ndots problem, although is probably amortized. I guess this will just solve the conntrack races, that is the issue causing the intermittent delays.

@zhan849
Copy link
Contributor

zhan849 commented Apr 14, 2020

@szuecs good question! proxying dns in general is an interesting direction worth poking, however we run kubernetes in highly customized way and with fairly large scale so it'd be hard to plug and play most of open sourced solution. Also as @rata said it's not solving the root cause.
The 3 possible alleviations I posted above fits to our current production setup better :)

@prameshj
Copy link
Contributor

Sandor Szücs good question! proxying dns in general is an interesting direction worth poking, however we run kubernetes in highly customized way and with fairly large scale so it'd be hard to plug and play most of open sourced solution. Also as @rata said it's not solving the root cause.
The 3 possible alleviations I posted above fits to our current production setup better :)

Nodelocal DNSCache uses TCP for all upstream DNS queries(alleviation 3 that was mentioned in your previous comment) , in addition to skipping connection tracking for client pod to nodelocal DNS requests. It can be configured so that client pods continue to use the same DNS Server IP so the only change would be to deploy the daemonset. There are at least a couple of comments in this issue about clusters seeing significant improvement in DNS reliability and performance after deploying nodelocal dnscache. Hope that feedback helps.

@brb
Copy link
Contributor

brb commented Apr 15, 2020

@zhan849 Alternativelly, you could use Cilium's kube-proxy implementation which does not suffer from the conntrack races (it does not use netfilter/iptables).

@szuecs
Copy link
Member

szuecs commented Apr 15, 2020

@zhan849 I just link our configuration. We also run quite large scale kubernetes....
Anyways because of coredns/coredns#2593 , you want to run https://github.com/coredns/coredns/releases/tag/v1.6.9 with concurrency limits.

@lx308033262
Copy link

i have the same problem when i scale my coredns deploy to 2 replicas
but not delay when i got 1 replicas

@lx308033262
Copy link

hi guys

I slove this problem

when your etcd container and coredns container not in one node,this may be happend

so you maybe want to do this
1.make sure which node contain you etcd and coredns
[root@iZ2zed8sfcdxbw95lbf2omZ dns_test]# kubectl get pods -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
coredns-6955765f44-qvb6l 1/1 Running 0 18h 10.244.2.115 iz2zed8sfcdxbw95lbf2olz
etcd-iz2zed8sfcdxbw95lbf2omz 1/1 Running 2 27d 10.25.142.127 iz2zed8sfcdxbw95lbf2omz

[root@iZ2zed8sfcdxbw95lbf2omZ ~]# time kubectl exec -it busybox -- nslookup kubernetes 10.96.0.11
Server: 10.96.0.11
Address 1: 10.96.0.11

nslookup: can't resolve 'kubernetes'
command terminated with exit code 1

real 1m0.268s
user 0m0.119s
sys 0m0.039s
2.change your coredns contariner's node
[root@iZ2zed8sfcdxbw95lbf2omZ dns_test]# kubectl edit deploy coredns -n kube-system
nodeName: iz2zed8sfcdxbw95lbf2omz

[root@iZ2zed8sfcdxbw95lbf2omZ src]# time kubectl exec -it busybox -- nslookup kubernetes 10.96.0.10
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name: kubernetes
Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local

real 0m0.243s
user 0m0.140s
sys 0m0.041s
[root@iZ2zed8sfcdxbw95lbf2omZ src]#

@debMan
Copy link

debMan commented Jan 3, 2021

We faced the same issue on a small self-managed cluster.
The problem solved by scaling down the coreDNS pods to 1 pod.
This is a strange and unexpected solution, but it has solved the problem.

Cluster info:

nodes arch/OS:  amd64/debian
master nodes:   1
worker nodes:   6
deployments:    100
pods:           150

@Mainintowhile
Copy link

Mainintowhile commented Nov 17, 2021

similar with this issue.
first serval time got right response, and then got Address: 127.0.0.1 every time.

my solution is:
version:1.19.0
1 use the k8s coredns model(sed -f transforms2sed.sed coredns.yaml.base > coredns.yaml)
2 simplify the config such as
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local {
ttl 0
}
forward . /etc/resolv.conf
loop
reload
}

ps: i doubt that the cache relate part has bug.
notice that ip setting keep same. such as 10.244.0.1/16 (perhaps node is 10.244.0.1/24)
compare:
1 keep etcd and coredns together resolve perfect
2 simplify the config does not stable

@onedr0p
Copy link

onedr0p commented May 10, 2023

Alpine 3.18 with included musl 1.2.4 seemed to have finally fixed this issue

https://www.alpinelinux.org/posts/Alpine-3.18.0-released.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dns kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

No branches or pull requests