New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNS intermittent delays of 5s #56903
Comments
|
@kubernetes/sig-network-misc |
|
I have similar issue: consistently slow DNS resolution from pods, 20 seconds plus Name: google.com I just created 1.8.5 cluster in AWS with kops, and only deviation from standard config is that I am using CentOS host machines (ami-e535c59d for us-west-2) resolution from hosts is instanteneous, from pods: consistently slow |
|
we observe the same on GKE with version v1.8.4-gke0 and both Busybox (latest) or Debian9: $ kubectl exec -ti busybox -- time nslookup storage.googleapis.com Name: storage.googleapis.com DNS latency varies between 10 and 40s in multiples of 5s. |
|
5s is pretty much ALWAYS indicating a DNS timeout, meaning some packet got dropped somewhere. |
|
Yes, it seems as if the local DNS servers timeout instead of answering : [root@busybox /]# nslookup google.com [root@busybox /]# tcpdump port 53 |
|
Just in case someone got here because of dns delays, in our case it was arp table overflow on the nodes (arp -n showing more than 1000 entries). Increasing the limits solved the problem. |
|
We have the same issue within all of our kops deployed aws clusters (5). We tried moving from weave to flannel to rule out the CNI but the issue is the same. Our kube-dns pods are healthy, one on every host and they have not crashed recently. Our arp tables are no where near full (less than 100 entries usually) |
|
There are QPS limits on DNS at various places. I think in the past, people have hit AWS DNS server QPS limits in some cases, that may be worth checking. |
|
@bowei sadly this happens in very small clusters as well for us, ones that have so few containers that there is no feasible way we'd be hitting the QPS limit from AWS |
|
Same here, small clusters, no arp nor QPS limits. |
|
@mikksoone exact same situation as us then, dnsPolicy: Default fixes the problem entirely, but of course breaks accessing services internal to the cluster which is a no-go for most of ours. |
|
@bowei We have the same problem here. |
|
Its seems a problem with glibc. On CoreOS Stable (glibc 2.23) this problem appears. Setting with 0 in timeout on resolv.conf you will get a 1 secound delay.... I've try disable the IPv6, without success.... |
|
In my tests, using this option on /etc/resolv.conf Fixed the problem. @mikksoone Could you try if it solve your problem too? |
|
Also experiencing this on 1.8.6-gke.0 - @vasartori suggested solution resolved the issue for us too 👍🏻 |
|
Doesn't solve the issue for me. Even with this option in resolv.conf I get timeouts of 5s, 2.5s and 3.5s - and they happen very often, twice per minute or so. |
|
We have the same symptoms on 1.8, intermittent DNS resolution stall of 5 seconds. The suggested workaround seems to be effective for us as well. Thank you @vasartori ! |
|
I've been having this issue for some time on kubernetes 1.7 and 1.8. I was dropping dns queries from time to time |
|
Same problem, but the most strange thing is that it appears on some nodes. |
|
requesting your feedback therefore tagging you. can this be of any interest here: https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02 there are multiple issues reported for this issue in kubernetes project, and will be great to have it resolved for everyone. |
|
tried with several versions of kubernetes on fresh clusters, all hae the same problem to some degree, dns lookups get lost on the way and retryes have to be made. I've also tested kubenet, flanel, canal and weave as network providers, having the lowest incidence in flanel. I've also tried overloading the nodes and splitting the nodes (dns on it's own machine) but it made no difference. On my production cluster the incidence of this issue is way higher than on a brand new cluster and i can't find the way to isolate the problem :( |
|
We in Pinterest is using kernel 5.0 and the default iptable set up, but still hitting this issue pretty badly: here is a pcap trace of packages showed clearly proof of UDP packets not getting forwarded out to dns and client side is having 5s timeout / dns level retries, 10.3.253.87 and 10.3.212.90 are user pods and 10.3.23.54.domain is the dns pod 07:25:14.196136 IP 10.3.253.87.36457 > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.30001: UDP, length 61
--
07:25:14.376267 IP 10.3.212.90.56401 > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.30001: UDP, length 47
07:25:19.196210 IP 10.3.253.87.36457 > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.30001: UDP, length 61
07:25:19.376469 IP 10.3.212.90.56401 > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.30001: UDP, length 47
07:25:24.196365 IP 10.3.253.87.36457 > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.30001: UDP, length 61
07:25:24.383758 IP 10.3.212.90.45350 > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.30001: UDP, length 39
07:25:26.795923 IP node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.37345 > 10.3.23.54.domain: 8166+ [1au] A? kubernetes.default. (47)
07:25:26.797035 IP 10.3.23.54.domain > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.37345: 8166 NXDomain 0/0/1 (47)
07:25:29.203369 IP 10.3.253.87.57701 > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.30001: UDP, length 60
07:25:29.203408 IP node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.57701 > 10.3.23.54.domain: 52793+ [1au] A? mavenrepo-external.pinadmin.com. (60)
07:25:29.204446 IP 10.3.23.54.domain > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.57701: 52793* 10/0/1 CNAME pinrepo-external.pinadmin.com., CNAME internal-vpc-pinrepo-pinadmin-internal-1188100222.us-east-1.elb.amazonaws.com., A 10.1.228.192, A 10.1.225.57, A 10.1.225.205, A 10.1.229.86, A 10.1.227.245, A 10.1.224.120, A 10.1.228.228, A 10.1.229.49 (998)
I did a # conntrack -S
cpu=0 found=175 invalid=0 ignore=83983 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=201
cpu=1 found=168 invalid=0 ignore=79659 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=299
cpu=2 found=173 invalid=0 ignore=77880 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=4184
cpu=3 found=161 invalid=0 ignore=78778 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=216
cpu=4 found=157 invalid=0 ignore=80478 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=4245
cpu=5 found=172 invalid=10 ignore=85572 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=346
cpu=6 found=146 invalid=0 ignore=85334 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=4271
cpu=7 found=162 invalid=0 ignore=84865 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=230
cpu=8 found=155 invalid=0 ignore=81691 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=259
cpu=9 found=164 invalid=1 ignore=81550 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=256
cpu=10 found=180 invalid=0 ignore=92864 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=270
cpu=11 found=163 invalid=0 ignore=93113 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=238
cpu=12 found=171 invalid=0 ignore=80868 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=1934
cpu=13 found=176 invalid=0 ignore=80974 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=2532
cpu=14 found=174 invalid=0 ignore=91001 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=927
cpu=15 found=175 invalid=0 ignore=79837 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=585
cpu=16 found=168 invalid=0 ignore=84899 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=375
cpu=17 found=172 invalid=0 ignore=84396 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=328
cpu=18 found=142 invalid=0 ignore=80365 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=1012
cpu=19 found=163 invalid=0 ignore=80193 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=5308
cpu=20 found=179 invalid=0 ignore=84980 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=565
cpu=21 found=200 invalid=0 ignore=80537 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=278
cpu=22 found=153 invalid=0 ignore=83528 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=430
cpu=23 found=166 invalid=0 ignore=84160 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=256
cpu=24 found=189 invalid=0 ignore=81400 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=355
cpu=25 found=183 invalid=0 ignore=82727 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=352
cpu=26 found=170 invalid=0 ignore=89293 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=275
cpu=27 found=183 invalid=1 ignore=82717 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=332
cpu=28 found=188 invalid=0 ignore=83741 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=242
cpu=29 found=192 invalid=0 ignore=88601 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=396
cpu=30 found=166 invalid=0 ignore=84152 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=329
cpu=31 found=165 invalid=0 ignore=81369 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=229
cpu=32 found=170 invalid=0 ignore=84275 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=4208
cpu=33 found=160 invalid=0 ignore=86734 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=4122
cpu=34 found=173 invalid=0 ignore=82152 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=4231
cpu=35 found=150 invalid=0 ignore=78019 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=4130There are a few more action items we are trying ATM:
Will keep you guys updated but if anyone already tried any of the 3 options and has a failure / success story to share I'd be very much appreciated. |
|
@zhan849 why not using a daemonset and bypass conntrack? |
|
@szuecs please note that there is upstream support for that and stable in 1.18 (haven't tried it myself): https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/ I guess this does not solve the root cause of the ndots problem, although is probably amortized. I guess this will just solve the conntrack races, that is the issue causing the intermittent delays. |
|
@szuecs good question! proxying dns in general is an interesting direction worth poking, however we run kubernetes in highly customized way and with fairly large scale so it'd be hard to plug and play most of open sourced solution. Also as @rata said it's not solving the root cause. |
Nodelocal DNSCache uses TCP for all upstream DNS queries(alleviation 3 that was mentioned in your previous comment) , in addition to skipping connection tracking for client pod to nodelocal DNS requests. It can be configured so that client pods continue to use the same DNS Server IP so the only change would be to deploy the daemonset. There are at least a couple of comments in this issue about clusters seeing significant improvement in DNS reliability and performance after deploying nodelocal dnscache. Hope that feedback helps. |
|
@zhan849 Alternativelly, you could use Cilium's kube-proxy implementation which does not suffer from the conntrack races (it does not use netfilter/iptables). |
|
@zhan849 I just link our configuration. We also run quite large scale kubernetes.... |
|
i have the same problem when i scale my coredns deploy to 2 replicas |
|
hi guys I slove this problem when your etcd container and coredns container not in one node,this may be happend so you maybe want to do this [root@iZ2zed8sfcdxbw95lbf2omZ ~]# time kubectl exec -it busybox -- nslookup kubernetes 10.96.0.11 nslookup: can't resolve 'kubernetes' real 1m0.268s [root@iZ2zed8sfcdxbw95lbf2omZ src]# time kubectl exec -it busybox -- nslookup kubernetes 10.96.0.10 Name: kubernetes real 0m0.243s |
|
We faced the same issue on a small self-managed cluster. Cluster info: |
|
similar with this issue. my solution is: ps: i doubt that the cache relate part has bug. |
|
Alpine 3.18 with included musl 1.2.4 seemed to have finally fixed this issue https://www.alpinelinux.org/posts/Alpine-3.18.0-released.html |
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
DNS lookup is sometimes taking 5 seconds.
What you expected to happen:
No delays in DNS.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
kubectl version):uname -a):Similar issues
/sig network
The text was updated successfully, but these errors were encountered: