Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS intermittent delays of 5s #56903

Closed
mikksoone opened this issue Dec 6, 2017 · 252 comments
Closed

DNS intermittent delays of 5s #56903

mikksoone opened this issue Dec 6, 2017 · 252 comments

Comments

@mikksoone
Copy link

@mikksoone mikksoone commented Dec 6, 2017

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

What happened:
DNS lookup is sometimes taking 5 seconds.

What you expected to happen:
No delays in DNS.

How to reproduce it (as minimally and precisely as possible):

  1. Create a cluster in AWS using kops with cni networking:
kops create cluster     --node-count 3     --zones eu-west-1a,eu-west-1b,eu-west-1c     --master-zones eu-west-1a,eu-west-1b,eu-west-1c     --dns-zone kube.example.com   --node-size t2.medium     --master-size t2.medium  --topology private --networking cni   --cloud-labels "Env=Staging"  ${NAME}
  1. CNI plugin:
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"
  1. Run this script in any pod with that has curl:
var=1
while true ; do
  res=$( { curl -o /dev/null -s -w %{time_namelookup}\\n  http://www.google.com; } 2>&1 )
  var=$((var+1))
  if [[ $res =~ ^[1-9] ]]; then
    now=$(date +"%T")
    echo "$var slow: $res $now"
    break
  fi
done

Anything else we need to know?:

  1. I am encountering this issue in both staging and production clusters, but for some reason staging cluster is having a lot more 5s delays.
  2. Delays happen both for external services (google.com) or internal, such as service.namespace.
  3. Happens on both 1.6 and 1.7 version of kubernetes, but did not encounter these issues in 1.5 (though the setup was a bit different - no CNI back then).
  4. Have not tested with 1.7 without CNI yet.

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.2", GitCommit:"bdaeafa71f6c7c04636251031f93464384d54963", GitTreeState:"clean", BuildDate:"2017-10-24T19:48:57Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.10", GitCommit:"bebdeb749f1fa3da9e1312c4b08e439c404b3136", GitTreeState:"clean", BuildDate:"2017-11-03T16:31:49Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
AWS
  • OS (e.g. from /etc/os-release):
PRETTY_NAME="Ubuntu 16.04.3 LTS"
  • Kernel (e.g. uname -a):
Linux ingress-nginx-3882489562-438sm 4.4.65-k8s #1 SMP Tue May 2 15:48:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Similar issues

  1. kubernetes/dns#96 - closed but seems to be exactly the same
  2. #45976 - has some comments matching this issue, but is taking the direction of fixing kube-dns up/down scaling problem, and is not about the intermittent failures.

/sig network

@kgignatyev-inspur
Copy link

@kgignatyev-inspur kgignatyev-inspur commented Dec 10, 2017

I have similar issue: consistently slow DNS resolution from pods, 20 seconds plus
from busybox:
time nslookup google.com
Server: 100.64.0.10
Address 1: 100.64.0.10

Name: google.com
Address 1: 2607:f8b0:400a:806::200e
Address 2: 172.217.3.206 sea15s12-in-f14.1e100.net
real 0m 50.03s
user 0m 0.00s
sys 0m 0.00s
/ #

I just created 1.8.5 cluster in AWS with kops, and only deviation from standard config is that I am using CentOS host machines (ami-e535c59d for us-west-2)

resolution from hosts is instanteneous, from pods: consistently slow

@ani82
Copy link

@ani82 ani82 commented Dec 23, 2017

we observe the same on GKE with version v1.8.4-gke0 and both Busybox (latest) or Debian9:

$ kubectl exec -ti busybox -- time nslookup storage.googleapis.com
Server: 10.39.240.10
Address 1: 10.39.240.10 kube-dns.kube-system.svc.cluster.local

Name: storage.googleapis.com
Address 1: 2607:f8b0:400c:c06::80 vl-in-x80.1e100.net
Address 2: 74.125.141.128 vl-in-f128.1e100.net
real 0m 10.02s
user 0m 0.00s
sys 0m 0.00s

DNS latency varies between 10 and 40s in multiples of 5s.

@thockin
Copy link
Member

@thockin thockin commented Jan 6, 2018

5s is pretty much ALWAYS indicating a DNS timeout, meaning some packet got dropped somewhere.

@thockin thockin added the area/dns label Jan 6, 2018
@ani82
Copy link

@ani82 ani82 commented Jan 9, 2018

Yes, it seems as if the local DNS servers timeout instead of answering :

[root@busybox /]# nslookup google.com
;; connection timed out; trying next origin
;; connection timed out; trying next origin
;; connection timed out; trying next origin
;; connection timed out; trying next origin
;; connection timed out; trying next origin
;; connection timed out; no servers could be reached

[root@busybox /]# tcpdump port 53
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
15:38:10.423547 IP busybox.46239 > kube-dns.kube-system.svc.cluster.local.domain: 51779+ A? google.com.default.svc.cluster.local. (54)
15:38:10.424120 IP busybox.46757 > kube-dns.kube-system.svc.cluster.local.domain: 41018+ PTR? 10.240.39.10.in-addr.arpa. (43)
15:38:10.424595 IP kube-dns.kube-system.svc.cluster.local.domain > busybox.46757: 41018 1/0/0 PTR kube-dns.kube-system.svc.cluster.local. (95)
15:38:15.423611 IP busybox.46239 > kube-dns.kube-system.svc.cluster.local.domain: 51779+ A? google.com.default.svc.cluster.local. (54)
15:38:20.423809 IP busybox.46239 > kube-dns.kube-system.svc.cluster.local.domain: 51779+ A? google.com.default.svc.cluster.local. (54)
15:38:25.424247 IP busybox.44496 > kube-dns.kube-system.svc.cluster.local.domain: 63451+ A? google.com.svc.cluster.local. (46)
15:38:30.424508 IP busybox.39936 > kube-dns.kube-system.svc.cluster.local.domain: 14687+ A? google.com.cluster.local. (42)
15:38:35.424767 IP busybox.56675 > kube-dns.kube-system.svc.cluster.local.domain: 37241+ A? google.com.c.retailcatalyst-187519.internal. (61)
15:38:40.424992 IP busybox.35842 > kube-dns.kube-system.svc.cluster.local.domain: 22668+ A? google.com.google.internal. (44)
15:38:45.425295 IP busybox.52037 > kube-dns.kube-system.svc.cluster.local.domain: 6207+ A? google.com. (28)

@aguerra
Copy link

@aguerra aguerra commented Jan 19, 2018

Just in case someone got here because of dns delays, in our case it was arp table overflow on the nodes (arp -n showing more than 1000 entries). Increasing the limits solved the problem.

@lbrictson
Copy link

@lbrictson lbrictson commented Jan 19, 2018

We have the same issue within all of our kops deployed aws clusters (5). We tried moving from weave to flannel to rule out the CNI but the issue is the same. Our kube-dns pods are healthy, one on every host and they have not crashed recently.

Our arp tables are no where near full (less than 100 entries usually)

@bowei
Copy link
Member

@bowei bowei commented Jan 19, 2018

There are QPS limits on DNS at various places. I think in the past, people have hit AWS DNS server QPS limits in some cases, that may be worth checking.

@lbrictson
Copy link

@lbrictson lbrictson commented Jan 19, 2018

@bowei sadly this happens in very small clusters as well for us, ones that have so few containers that there is no feasible way we'd be hitting the QPS limit from AWS

@mikksoone
Copy link
Author

@mikksoone mikksoone commented Jan 19, 2018

Same here, small clusters, no arp nor QPS limits.
dnsPolicy: Default works without delays, but this unfortunately can not be used for all deployments.

@lbrictson
Copy link

@lbrictson lbrictson commented Jan 19, 2018

@mikksoone exact same situation as us then, dnsPolicy: Default fixes the problem entirely, but of course breaks accessing services internal to the cluster which is a no-go for most of ours.

@vasartori
Copy link
Contributor

@vasartori vasartori commented Jan 19, 2018

@bowei We have the same problem here.
But we are not using AWS.

@vasartori
Copy link
Contributor

@vasartori vasartori commented Jan 22, 2018

Its seems a problem with glibc.
If you set a timeout on your /etc/resolv.conf this timeout will be respected.

On CoreOS Stable (glibc 2.23) this problem appears.

Setting with 0 in timeout on resolv.conf you will get a 1 secound delay....

I've try disable the IPv6, without success....

@vasartori
Copy link
Contributor

@vasartori vasartori commented Jan 23, 2018

In my tests, using this option on /etc/resolv.conf
options single-request-reopen

Fixed the problem.
But I don't find a "clean" way to put it on pods in kubernetes 1.8.
What I do:

        lifecycle:
          postStart:
            exec:
              command:
              - /bin/sh
              - -c 
              - "/bin/echo 'options single-request-reopen' >> /etc/resolv.conf"

@mikksoone Could you try if it solve your problem too?

@aca02djr
Copy link

@aca02djr aca02djr commented Jan 24, 2018

Also experiencing this on 1.8.6-gke.0 - @vasartori suggested solution resolved the issue for us too 👍🏻

@mikksoone
Copy link
Author

@mikksoone mikksoone commented Jan 26, 2018

Doesn't solve the issue for me. Even with this option in resolv.conf I get timeouts of 5s, 2.5s and 3.5s - and they happen very often, twice per minute or so.

@lauri-elevant
Copy link

@lauri-elevant lauri-elevant commented Feb 5, 2018

We have the same symptoms on 1.8, intermittent DNS resolution stall of 5 seconds. The suggested workaround seems to be effective for us as well. Thank you @vasartori !

@sdtokkolabs
Copy link

@sdtokkolabs sdtokkolabs commented Apr 3, 2018

I've been having this issue for some time on kubernetes 1.7 and 1.8. I was dropping dns queries from time to time
Yesterday I upgraded my cluster from 1.8.10 to 1.9.6 (kops from 1.8 to 1.9.0-alpha.3) and I started having this same issue ALL THE TIME. The workaround sugested in this issue has no effect and I can't find any way of stopping it. I've made a small workaround by assigning the most requested (and poblematic) DNS to fixed IPs in /etc/hosts.
Any idea on where the real problem is?
I'll test with a brand new cluster in the same versions and report back.

@xiaoxubeii
Copy link
Member

@xiaoxubeii xiaoxubeii commented Apr 12, 2018

Same problem, but the most strange thing is that it appears on some nodes.

@rajatjindal
Copy link
Contributor

@rajatjindal rajatjindal commented Apr 14, 2018

@thockin @bowei

requesting your feedback therefore tagging you.

can this be of any interest here: https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02

there are multiple issues reported for this issue in kubernetes project, and will be great to have it resolved for everyone.

@sdtokkolabs
Copy link

@sdtokkolabs sdtokkolabs commented Apr 17, 2018

tried with several versions of kubernetes on fresh clusters, all hae the same problem to some degree, dns lookups get lost on the way and retryes have to be made. I've also tested kubenet, flanel, canal and weave as network providers, having the lowest incidence in flanel. I've also tried overloading the nodes and splitting the nodes (dns on it's own machine) but it made no difference. On my production cluster the incidence of this issue is way higher than on a brand new cluster and i can't find the way to isolate the problem :(

@axot
Copy link

@axot axot commented Nov 6, 2019

@jaygorrell There are some motivations why upgrading to tcp connection, you could check it from https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/0030-nodelocal-dns-cache.md#Motivation

You could also give a try to use some custom version of node local DNS from https://github.com/colopl/k8s-local-dns

@jaygorrell
Copy link

@jaygorrell jaygorrell commented Nov 6, 2019

Yeah, I meant that I expected the tcp upgrade to mitigate the other two racy conditions so that nodelocaldns would resolve all 3 races on it's own.

The more I debug the more I think it's just ttl/eviction problems with the cache, so thanks for the modified version... I'll take a look at that.

@prameshj
Copy link
Contributor

@prameshj prameshj commented Nov 6, 2019

I have NodeLocalDNS configured and working in my cluster - I can dig to verify that requests after the first return in < 1ms until the cache expires. However, I still see the conntrack errors increasing. I understand the other two races were kernel problems but doesn't upgrading coredns requests to tcp address that, so only NodeLocalDNS is needed?

the conntrack errors could be increasing because of DNAT in a different service too.

@jaygorrell
Copy link

@jaygorrell jaygorrell commented Nov 6, 2019

the conntrack errors could be increasing because of DNAT in a different service too.

That crossed my mind, but nothing non-standard runs outside of kubernetes on these hosts.

Can anyone at least confirm that after having conntrack -S errors before this change, they no longer have errors?

@jaygorrell
Copy link

@jaygorrell jaygorrell commented Nov 6, 2019

I'm still unsure about the conntrack error count, but I think we've figured out the reason we're not seeing an improvement in DNS latency.

Kubernetes defaults to ndots: 5. This essentially guarantees you'll always check search domains first before trying outright. We often refer to services with <service>.default so even if there's a local cache for the service, it's going to try <service>.default.default.svc.cluster.local first since the first search domain is default.svc.cluster.local. For external domains, it's even worse because it's going to try 4 search domains before trying the outright domain.

Caching in this configuration is much lower for failures so each of those reaches out to the main coredns pods before hitting the good cached value.

We're going to try ndots: 1 because that means we can use <service> or <service>.default.svc.cluster.local or public domains and they will all resolve on the first try. We just have to migrate away from using <service>.default.

My questions:

  1. The conntrack errors should still be 0, right? Even with all these extra requests to coredns, they're still upgraded to tcp and I believe that avoids all the race conditions. Am I mistaken?
  2. This all makes sense, right? Why does ndots default to such a high value, making it so that any external requests are guaranteed to miss on DNS cache hits, even on the main coredns pods.
  3. Is there any information on how coredns cache evictions work? It seems like items aren't really caching for 30s like they should here. Do success and failures share the same shards?
@chrisohaver
Copy link
Contributor

@chrisohaver chrisohaver commented Nov 6, 2019

Why does ndots default to such a high value

Here's an explanation: #14051 (comment)

Is there any information on how coredns cache evictions work?

Yes, in the CoreDNS cache plugin readme. (it's random)

It seems like items aren't really caching for 30s like they should here.

Items cache for their TTL, with a cap at 30s. IOW, if an item TTL is 5s, it caches for 5s, not 30s.

Do success and failures share the same shards?

no - success and denial caches are separate caches.

@jaygorrell
Copy link

@jaygorrell jaygorrell commented Nov 6, 2019

Why does ndots default to such a high value

Here's an explanation: #14051 (comment)

Not sure I follow this response, or maybe it's just stale. It claims:

Without ndots >= 2, DNS lookups like 'mysvc.myns' can't work (e.g. kubernetes.default)
Without ndots >= 3, DNS lookups like 'mysvc.myns.svc' can't work (important once pod DNS lands).
But with svc.cluster.local and .cluster.local both in the search domains, respectively, these should resolve fine. And that's just talking about the order -- doesn't ndots only determine if an outright search is tried before the search domains or not?

In his first example if ndots were 1, it would just try kubernetes.default first, followed by the search domains which would find a hit... not much different than the problems we have with ndots >= 2 anyway since the first search domain fails (duplicate .default.default).

Is there any information on how coredns cache evictions work?

Yes, in the CoreDNS cache plugin readme. (it's random)

Sorry, I meant moreso about what triggers evictions. Is it only when full? I'm trying to figure out why things aren't caching for 30s with the default settings. It seems to be either evictions or something I'm not expecting with minimum cache... dig shows the record has 300s which should mean it's cached for 30s given the max ttl set in nodelocaldns.

Do success and failures share the same shards?

no - success and denial caches are separate caches.

Thanks, this helps with troubleshooting.

I did come across this post from @thockin too:
#33554 (comment)

But that doesn't match my experience and what has been rather well documented as a fix for similar latency issues.

@prameshj
Copy link
Contributor

@prameshj prameshj commented Nov 6, 2019

Why does ndots default to such a high value

Here's an explanation: #14051 (comment)

Not sure I follow this response, or maybe it's just stale. It claims:

Without ndots >= 2, DNS lookups like 'mysvc.myns' can't work (e.g. kubernetes.default)
Without ndots >= 3, DNS lookups like 'mysvc.myns.svc' can't work (important once pod DNS lands).
But with svc.cluster.local and .cluster.local both in the search domains, respectively, these should resolve fine. And that's just talking about the order -- doesn't ndots only determine if an outright search is tried before the search domains or not?

In his first example if ndots were 1, it would just try kubernetes.default first, followed by the search domains which would find a hit... not much different than the problems we have with ndots >= 2 anyway since the first search domain fails (duplicate .default.default).

Is there any information on how coredns cache evictions work?

Yes, in the CoreDNS cache plugin readme. (it's random)

Sorry, I meant moreso about what triggers evictions. Is it only when full? I'm trying to figure out why things aren't caching for 30s with the default settings. It seems to be either evictions or something I'm not expecting with minimum cache... dig shows the record has 300s which should mean it's cached for 30s given the max ttl set in nodelocaldns.

Cache TTL for NXDOMAIN responses is 5s in nodelocaldns in the (default yaml)[]. For success records, it is 30s or the record TTL, whichever is smaller. If you are using CoreDNS upstream, i think the kubernetes plugin creates service records with 5s TTL. Not sure why dig would show 300s. Can you do "dig @" and see the TTL?

You can use the prefetch plugin in CoreDNS in order to help with the latency increase upon cache miss. What is your latency like during cache hit and miss? Are you seeing an increasing trend every 5s?

Regarding the race conditions... if all the searchpath expanded queries show up in nodelocaldns and all of them are cache misses, nodelocaldns will reach out to CoreDNS upstream via the service IP. There is a DNAT step happening there but that shouldn't cause parallel requests with the same TCP source port. CoreDNS reuses existing TCP connections,but te requests should go out serially. Can you try with the prefetch plugin and see if the conntrack errors reduce? In my local tests, I was unable to see a large number in "conntrack -S" insert_failed errors before nodelocaldns as well as after. I just checked on a test cluster node that runs 5 pods doing 1k QPS each, insert_failed errors is 0.

Do success and failures share the same shards?

no - success and denial caches are separate caches.

Thanks, this helps with troubleshooting.

I did come across this post from @thockin too:
#33554 (comment)

But that doesn't match my experience and what has been rather well documented as a fix for similar latency issues.

@jaygorrell
Copy link

@jaygorrell jaygorrell commented Nov 6, 2019

Ah, I think we had some test settings still there in NodeLocalDNS that threw off the numbers. Reverting that back to the official settings does show 5s is the ttl on these records. That does help a bit with regards to timing.

A bit more context that I never gave to start with:

We were mostly in the 20-100ms range before adding NodeLocalDNS, but had a steady increase in insert_failed count as well as services logging EAI_AGAIN errors when trying to talk to other systems. For context, this is an environment with about ~200 services that primarily communicate over http.

After adding NodeLocalDNS, it actually doesn't seem like there's much impact to our DNS latency overall but it's starting to look like that's due to the number of queries that get forwarded to coredns because of ndots and search domains... the savings on the cache hit is just one of 2 to 5 queries. It is worth noting though that our DNS metrics from APM are showing a noticeable uptick is latency since adding NodeLocalDNS but it appears to be only on external systems. I'm a little surprised by that since they've always gone through the same search domains and some of them at least should hit the negative cache.

You can use the prefetch plugin in CoreDNS in order to help with the latency increase upon cache miss. What is your latency like during cache hit and miss? Are you seeing an increasing trend every 5s?

Yeah, exactly. It's usually 1.4ms on miss and 0.1ms on hit. That's just in a simple test from a pod... real services are going through the search domain chain that adds more latency. I'll try to mess with prefetch a bit more and see if that helps... I think a minTTL can help as well but I'd have to add it for the misses too.

So basically we're not seeing any results at all from NodeLocalDNS but the lack of latency reduction is mostly explained and we should be able to improve it with cache settings. The other part is why insert_failed is still incrementing and why services are still reporting DNS failures at a similar frequency (ie. EAI_AGAIN).

Requests to coredns are over tcp so that should be good. NodeLocalDNS is still hit over udp of course, but there's no conntrack in play, right?

Edit: One more question... is there any way to see the current shard sizes to see if/when evictions are happening? I need better visibility there.

@gurumaia
Copy link

@gurumaia gurumaia commented Nov 29, 2019

We are facing the same issue. Applying the single-request-reopen parameter to our pods' resolv.conf "fixes" the issue, but there is one other piece of information I'd like to add.

We noticed that if we change the DNS address in one of our pods' resolv.conf to one of our core-dns pods' address, everything works fine, no timeouts. But when we go through the default configuration, which is the core-dns service's address, we get the intermittent 5 seconds delay.

Since the single-request-reopen parameter controls the usage of one socket for more than one DNS request, it might be that k8s' service implementation somehow is confused by receiving more than one request through the same socket.

@zhan849
Copy link
Contributor

@zhan849 zhan849 commented Apr 14, 2020

We in Pinterest is using kernel 5.0 and the default iptable set up, but still hitting this issue pretty badly:

here is a pcap trace of packages showed clearly proof of UDP packets not getting forwarded out to dns and client side is having 5s timeout / dns level retries, 10.3.253.87 and 10.3.212.90 are user pods and 10.3.23.54.domain is the dns pod

07:25:14.196136 IP 10.3.253.87.36457 > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.30001: UDP, length 61
--
07:25:14.376267 IP 10.3.212.90.56401 > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.30001: UDP, length 47
07:25:19.196210 IP 10.3.253.87.36457 > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.30001: UDP, length 61
07:25:19.376469 IP 10.3.212.90.56401 > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.30001: UDP, length 47
07:25:24.196365 IP 10.3.253.87.36457 > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.30001: UDP, length 61
07:25:24.383758 IP 10.3.212.90.45350 > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.30001: UDP, length 39
07:25:26.795923 IP node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.37345 > 10.3.23.54.domain: 8166+ [1au] A? kubernetes.default. (47)
07:25:26.797035 IP 10.3.23.54.domain > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.37345: 8166 NXDomain 0/0/1 (47)
07:25:29.203369 IP 10.3.253.87.57701 > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.30001: UDP, length 60
07:25:29.203408 IP node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.57701 > 10.3.23.54.domain: 52793+ [1au] A? mavenrepo-external.pinadmin.com. (60)
07:25:29.204446 IP 10.3.23.54.domain > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.57701: 52793* 10/0/1 CNAME pinrepo-external.pinadmin.com., CNAME internal-vpc-pinrepo-pinadmin-internal-1188100222.us-east-1.elb.amazonaws.com., A 10.1.228.192, A 10.1.225.57, A 10.1.225.205, A 10.1.229.86, A 10.1.227.245, A 10.1.224.120, A 10.1.228.228, A 10.1.229.49 (998)

I did a conntrack -S and there is no insertion failure, which indicates that race 1 and 2 mentioned in this blog is already fixed, and we are hitting race 3.

# conntrack -S
cpu=0   	found=175 invalid=0 ignore=83983 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=201
cpu=1   	found=168 invalid=0 ignore=79659 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=299
cpu=2   	found=173 invalid=0 ignore=77880 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=4184
cpu=3   	found=161 invalid=0 ignore=78778 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=216
cpu=4   	found=157 invalid=0 ignore=80478 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=4245
cpu=5   	found=172 invalid=10 ignore=85572 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=346
cpu=6   	found=146 invalid=0 ignore=85334 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=4271
cpu=7   	found=162 invalid=0 ignore=84865 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=230
cpu=8   	found=155 invalid=0 ignore=81691 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=259
cpu=9   	found=164 invalid=1 ignore=81550 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=256
cpu=10  	found=180 invalid=0 ignore=92864 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=270
cpu=11  	found=163 invalid=0 ignore=93113 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=238
cpu=12  	found=171 invalid=0 ignore=80868 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=1934
cpu=13  	found=176 invalid=0 ignore=80974 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=2532
cpu=14  	found=174 invalid=0 ignore=91001 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=927
cpu=15  	found=175 invalid=0 ignore=79837 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=585
cpu=16  	found=168 invalid=0 ignore=84899 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=375
cpu=17  	found=172 invalid=0 ignore=84396 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=328
cpu=18  	found=142 invalid=0 ignore=80365 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=1012
cpu=19  	found=163 invalid=0 ignore=80193 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=5308
cpu=20  	found=179 invalid=0 ignore=84980 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=565
cpu=21  	found=200 invalid=0 ignore=80537 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=278
cpu=22  	found=153 invalid=0 ignore=83528 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=430
cpu=23  	found=166 invalid=0 ignore=84160 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=256
cpu=24  	found=189 invalid=0 ignore=81400 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=355
cpu=25  	found=183 invalid=0 ignore=82727 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=352
cpu=26  	found=170 invalid=0 ignore=89293 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=275
cpu=27  	found=183 invalid=1 ignore=82717 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=332
cpu=28  	found=188 invalid=0 ignore=83741 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=242
cpu=29  	found=192 invalid=0 ignore=88601 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=396
cpu=30  	found=166 invalid=0 ignore=84152 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=329
cpu=31  	found=165 invalid=0 ignore=81369 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=229
cpu=32  	found=170 invalid=0 ignore=84275 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=4208
cpu=33  	found=160 invalid=0 ignore=86734 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=4122
cpu=34  	found=173 invalid=0 ignore=82152 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=4231
cpu=35  	found=150 invalid=0 ignore=78019 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=4130

There are a few more action items we are trying ATM:

  1. newer version kernel
  2. use 1 dns replica (occupies full node so it has a rather stable run time) so there is 1 rule in iptable
  3. use TCP for DNS.

Will keep you guys updated but if anyone already tried any of the 3 options and has a failure / success story to share I'd be very much appreciated.

/cc @thockin @brb

@szuecs
Copy link
Member

@szuecs szuecs commented Apr 14, 2020

@zhan849 why not using a daemonset and bypass conntrack?
We use dnsmasq in front of coredns in a daemonset pod and use cloudinit and some systemd units to set resolv.conf values via kubelet to the node local dnsmasq running in hostnetwork.
Works great even if we spike dns traffic to oomkill coredns. This happened in a nodejs heavy cluster.
https://github.com/zalando-incubator/kubernetes-on-aws/tree/dev/cluster/manifests/coredns-local
Everything else is in systemd units I can’t share.

@rata
Copy link
Member

@rata rata commented Apr 14, 2020

@szuecs please note that there is upstream support for that and stable in 1.18 (haven't tried it myself): https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/

I guess this does not solve the root cause of the ndots problem, although is probably amortized. I guess this will just solve the conntrack races, that is the issue causing the intermittent delays.

@zhan849
Copy link
Contributor

@zhan849 zhan849 commented Apr 14, 2020

@szuecs good question! proxying dns in general is an interesting direction worth poking, however we run kubernetes in highly customized way and with fairly large scale so it'd be hard to plug and play most of open sourced solution. Also as @rata said it's not solving the root cause.
The 3 possible alleviations I posted above fits to our current production setup better :)

@prameshj
Copy link
Contributor

@prameshj prameshj commented Apr 15, 2020

Sandor Szücs good question! proxying dns in general is an interesting direction worth poking, however we run kubernetes in highly customized way and with fairly large scale so it'd be hard to plug and play most of open sourced solution. Also as @rata said it's not solving the root cause.
The 3 possible alleviations I posted above fits to our current production setup better :)

Nodelocal DNSCache uses TCP for all upstream DNS queries(alleviation 3 that was mentioned in your previous comment) , in addition to skipping connection tracking for client pod to nodelocal DNS requests. It can be configured so that client pods continue to use the same DNS Server IP so the only change would be to deploy the daemonset. There are at least a couple of comments in this issue about clusters seeing significant improvement in DNS reliability and performance after deploying nodelocal dnscache. Hope that feedback helps.

@brb
Copy link
Contributor

@brb brb commented Apr 15, 2020

@zhan849 Alternativelly, you could use Cilium's kube-proxy implementation which does not suffer from the conntrack races (it does not use netfilter/iptables).

@szuecs
Copy link
Member

@szuecs szuecs commented Apr 15, 2020

@zhan849 I just link our configuration. We also run quite large scale kubernetes....
Anyways because of coredns/coredns#2593 , you want to run https://github.com/coredns/coredns/releases/tag/v1.6.9 with concurrency limits.

@lx308033262
Copy link

@lx308033262 lx308033262 commented Apr 15, 2020

i have the same problem when i scale my coredns deploy to 2 replicas
but not delay when i got 1 replicas

@lx308033262
Copy link

@lx308033262 lx308033262 commented Apr 16, 2020

hi guys

I slove this problem

when your etcd container and coredns container not in one node,this may be happend

so you maybe want to do this
1.make sure which node contain you etcd and coredns
[root@iZ2zed8sfcdxbw95lbf2omZ dns_test]# kubectl get pods -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
coredns-6955765f44-qvb6l 1/1 Running 0 18h 10.244.2.115 iz2zed8sfcdxbw95lbf2olz
etcd-iz2zed8sfcdxbw95lbf2omz 1/1 Running 2 27d 10.25.142.127 iz2zed8sfcdxbw95lbf2omz

[root@iZ2zed8sfcdxbw95lbf2omZ ~]# time kubectl exec -it busybox -- nslookup kubernetes 10.96.0.11
Server: 10.96.0.11
Address 1: 10.96.0.11

nslookup: can't resolve 'kubernetes'
command terminated with exit code 1

real 1m0.268s
user 0m0.119s
sys 0m0.039s
2.change your coredns contariner's node
[root@iZ2zed8sfcdxbw95lbf2omZ dns_test]# kubectl edit deploy coredns -n kube-system
nodeName: iz2zed8sfcdxbw95lbf2omz

[root@iZ2zed8sfcdxbw95lbf2omZ src]# time kubectl exec -it busybox -- nslookup kubernetes 10.96.0.10
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name: kubernetes
Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local

real 0m0.243s
user 0m0.140s
sys 0m0.041s
[root@iZ2zed8sfcdxbw95lbf2omZ src]#

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.