New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-dns: dnsmasq intermittent connection refused #45976

Open
someword opened this Issue May 17, 2017 · 101 comments

Comments

Projects
None yet
@someword
Copy link

someword commented May 17, 2017

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.):

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):


Is this a BUG REPORT or FEATURE REQUEST? (choose one):

Kubernetes version (use kubectl version):

kubectl version
Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.7", GitCommit:"8eb75a5810cba92ccad845ca360cf924f2385881", GitTreeState:"clean", BuildDate:"2017-04-27T10:00:30Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.7", GitCommit:"8eb75a5810cba92ccad845ca360cf924f2385881", GitTreeState:"clean", BuildDate:"2017-04-27T09:42:05Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): PRETTY_NAME="Container Linux by CoreOS 1339.0.0 (Ladybug)"
  • Kernel (e.g. uname -a): 4.10.1-coreos
  • Install tools: custom ansible
  • Others: kube dns related images. gcr.io/google_containers/kubedns-amd64:1.9 and gcr.io/google_containers/kube-dnsmasq-amd64:1.4.1

What happened:
java.net.UnknownHostException: dynamodb.us-east-1.amazonaws.com

What you expected to happen:
Receive a response to the name lookup request.

How to reproduce it (as minimally and precisely as possible):
This is the kicker. We are not able to reproduce this issue on purpose. However we experience this in our production cluster 1 - 500 times a week.

Anything else we need to know:
In the past 2 months or so we had experienced a handful of events where DNS was failing for most/all of our production pods and the event would last for 5 - 10 minutes. During this time the kube-dns service was healthy with 3 - 6 available endpoints at all times. We increased our kube-dns pod count to 20 in 20 node production clusters. This level of provisioning alleviated the DNS issues that were taking down our production services. However we still experience at least weekly smaller events ranging from 1 second to 30 seconds which affect a small subset of pods. During these events 1 - 5 pods on different nodes across the cluster experience a burst of DNS failures which have a much smaller end user impact. We enabled query logging in dnsmasq as we were not sure whether the queries made it from the client pod to one of the kube-dns pods or not. What was interesting is that during the DNS events where query logging was enabled none of the name lookup requests that resulted in an exception were received by dnsmasq. At this point my colleague noticed these errors coming from dnsmasq-metrics

ERROR: logging before flag.Parse: W0517 03:19:50.139060 1 server.go:53] Error getting metrics from dnsmasq: read udp 127.0.0.1:36181->127.0.0.1:53: i/o timeout

That error as near as I can tell is basically a name resolution error from dnsmasq-metrics as it's trying to query the dnsmasq container in the same pod to get dnsmasq's internal metrics similar to running dig +short chaos txt cachesize.bind.

All of our DNS events are happening at the exact same time that 1 or more dnsmasq-metrics container is throwing those errors. We thought we might be possibly exceeding the default 150 connection limit that dnsmasq has but we do not see any logs indicating that. IF we did we would expect to see these log messages

dnsmasq: Maximum number of concurrent DNS queries reached (max: 150)

Based off of conversations with other cluster operators and users in slack I know that other users are experiencing these same problems. I'm hoping that this issue can be used to centralize our efforts and determine if dnsmasq refusing connections is the problem or a symptom of something else.

@ravilr

This comment has been minimized.

Copy link
Contributor

ravilr commented May 17, 2017

cc @bowei

we are also seeing this intermittently in our clusters, specifically from java based containers. the lookups that fail are for non-cluster domains.

@someword what version of kube-dns manifest are you running with. specifically, does it have this #41212 change ? we are running a older version without that change. Just wondering if the above change helps.

@someword

This comment has been minimized.

Copy link

someword commented May 17, 2017

@ravilr
We are mostly seeing it from java apps but that's because they are the ones that are the best at logging what happens. It mostly seems for external names as well but 90% of our the DNS lookups done by our java apps are off cluster names. Our nodejs apps see it as well but they are not instrumented very well so we don't know the exposure.

So we have been running an internal experimental version of the manifest while trying to unravel all of this. The dnsmasq flags that we are currently running are here. I think that's what you were wanting right? The combination of the below flags AND over provisioning has helped but obviously we are still having issues.

- args:
    - --cache-size=1000
    - --no-resolv
    - --server=/cluster.local/ec2.internal/127.0.0.1#10053
    - --server=169.254.169.253
    - --server=8.8.8.8
    - --log-facility=-
    - --log-async
    - --address=/com.cluster.local/com.svc.cluster.local/com.kube-system.svc.cluster.local/OUR_DOMAIN.com.cluster.local/OUR_DOMAIN.com.svc.cluster.local/OUR_DOMAIN.com.kube-system.svc.cluster.local/com.ec2.internal/ec2.internal.kube-system.svc.cluster.local/ec2.internal.svc.cluster.local/ec2.internal.cluster.local/
@ravilr

This comment has been minimized.

Copy link
Contributor

ravilr commented May 17, 2017

ok. good to know. i was just trying to see if there are any patterns.

Another thing that is related for sure in our case is, kube-proxy being up and available all the times. During kubernetes version upgrades (in place upgrade without drain) on the nodes, we've observed this happening when kube-proxy gets restarted.

reminds me of this #32749 which helps in removing dependency on some of the kube components for pod dns resolutions of non-cluster-local queries.

@cmluciano

This comment has been minimized.

Copy link
Member

cmluciano commented May 18, 2017

It's interesting that you both noted this problem was present in a Java app. Can you note if you are using a specific TTL setting w/i your library or the JVM?

@bowei

This comment has been minimized.

Copy link
Member

bowei commented May 19, 2017

There is a limit on dnsmasq for # of concurrent forwarded queries, -0, --dns-forward-max=<queries>, I'm wondering if you may be hitting this limit (and we should increase it)

@someword

This comment has been minimized.

Copy link

someword commented May 19, 2017

@bowei - We pondered about that as well. We do not have any log messages from dnsmasq like this
dnsmasq: Maximum number of concurrent DNS queries reached (max: 150)

We tested on a kube-dns pod and set the max to a very low number and verified that log messages would get written when we hit the max. We have been tempted to set it to 300 as a test but from what i've seen dnsmasq will log if this is the reason.

@someword

This comment has been minimized.

Copy link

someword commented May 19, 2017

@cmluciano - We do not pass any of these flags to the jvm networkaddress.cache.ttl or networkaddress.cache.negative.ttl. I am going to investigate what networkaddress.cache.ttl is set to and see if maybe the java based apps are not doing any caching. However the issue we are seeing where dnsmasq-metrics in the same pod as dnsmasq is getting connection refused when trying to do a dns lookup against dnsmasq makes me think the issue is in the kube-dns pod itself. Whether that is dnsmasq just being locked up or possibly some resource shortfall (ephemeral ports, file descriptors, etc) that is causing the attempted UDP connection from dnsmasq-metrics to the dnsmasq container to fail at some layer lower than dnsmasq.

@someword

This comment has been minimized.

Copy link

someword commented May 19, 2017

@cmluciano we use the openjdk and the default for networkaddress.cache.ttl is 30 seconds according to https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/sun/net/InetAddressCachePolicy.java#L48. I verified by capturing traffic from a java app that is just doing a dns lookup in a loop for kinesis.us-east-1.amazonaws.com. I see requests hit the wire about every 30 seconds even though the loops are at 10 second intervals. Increasing this to 60 seconds may lighten the load on the name servers but dnsmasq is still refusing queries occasionally.

@bowei

This comment has been minimized.

Copy link
Member

bowei commented May 19, 2017

@someword do you know what is the dns QPS hitting dnsmasq? (this can be obtained from measuring the delta hits/misses # from the dns pod http://127.0.0.01/metrics)

@someword

This comment has been minimized.

Copy link

someword commented May 19, 2017

@bowei When I look at the available metrics I don't see a cache hit metric. I have a cachemiss counter
skydns_skydns_dns_cachemiss_count_total{cache="response"} 4.565907e+06

And request totals
skydns_skydns_dns_request_count_total{system="recursive"} 98618
skydns_skydns_dns_request_count_total{system="reverse"} 40638

What's strange is that for this particular pod i have 4 million cache misses but only 98618 requests? I would assume that cache miss has to be a smaller number than total requests. So we are just in the process in getting these metrics into datadog for visualization across our cluster. Something doesn't seem to be accurate. In this screenshot we are looking at all of our kube-dns pods in a specific production cluster. The request counter is converted to a QPS datapoint for each pod. This is showing on average 1/2 to 1 qps which seems low.
screenshot 2017-05-19 12 57 47

This screen shot shows the cache miss broken down by second (MPS?). We are averaging 1.3K MPS but only 13 QPS.
screenshot 2017-05-19 13 04 28

The query count seems low and the miss seems very high. Does the above make any sense or am I missing something. I'll capture some traffic and see what sort of DNS query rate I see.

@someword

This comment has been minimized.

Copy link

someword commented May 19, 2017

I looked at a 30 second snapshot of traffic going to a single dnsmasq container and my numbers don't line up with the dnsmasq-metrics QPS.

sysdig -w output.scap -M 30
sysdig -r output.scap "proc.name=dnsmasq and fd.sport=53 and evt.type=recvmsg and evt.dir=>" |wc -l
5120

I did this a few times on a couple nodes but the count was in the high 4K and low 5K each time so the above seems like a decent representation. Also we are at a low time in our usage so I would expect those numbers to be even 20% higher during higher load time.

@bowei

This comment has been minimized.

Copy link
Member

bowei commented May 19, 2017

The dnsmasq metrics are available in dnsmasq_cache_hits, dnsmasq_cache_misses from the prometheus metrics.

Given ~4k - 5k QPS at dnsmasq and given the CPU request that dnsmasq has, you may need another replica to handle the load. Or I would try increasing the CPU request for dnsmasq.

@someword

This comment has been minimized.

Copy link

someword commented May 19, 2017

@bowei - i was hitting port 10055 (skydns) for metrics and it looks like i want the different metrics on port 10054 for dnsmasq. Doh!

core@ip-10-40-6-21 ~ $ curl -s 172.20.102.3:10054/metrics |grep -E '(misses|hits)' |grep -v ^#
dnsmasq_cache_hits 8.012014e+06
dnsmasq_cache_misses 5.355583e+06

So for cache hits 7,992,626 to miss is 0.000005342244 which seems like a pretty healthy caching name server.

We do have 16 kube-dns replicas and have tried running with 30 but still experience dnsmasq refusing connections. When I look at cpu stats for dnsmasq I don't see anything that makes me think it's underpowered. With a 100m cpu limit we have 0 docker.cpu.throttled with a max of 20m for kubernetes.cpu.usage.total.

@bowei

This comment has been minimized.

Copy link
Member

bowei commented May 19, 2017

Interesting -- can you check your node conntrack (http://conntrack-tools.netfilter.org/conntrack.html) tables on your node (not the pod)? The way a lot of resolver libraries work is that they bind an ephemeral port to send the request and each request results in an entry. If you exceed the conntrack limits, you will start getting dropped packets.

@someword

This comment has been minimized.

Copy link

someword commented May 19, 2017

From kube-proxy logs I see these values logged

nf_conntrack version 0.5.0 (65536 buckets, 262144 max)

core@ip-10-40-6-21 ~ $ docker run --net=host --privileged --rm claesjonsson/conntrack -L |wc -l
conntrack v1.4.2 (conntrack-tools): 33461 flow entries have been shown.
33461

It looks like at this point in time my count is under the max which i gather is either 65536 or 262144. Would conntrack come into play when the dnsmasq-metrics container is performing a dns lookup against the dnsmasq container in the same pod? I was thinking it was traffic traversing the pod localhost network it would not go through conntrack but I've not found the specific details that cover the traffic path as it goes between containers in the same pod over the localhost interface.

I'll look to see if we are tracking the size of the conntrack tables or if we hit the maximum would it log and see if we have anything in our log aggregation system.

@bowei

This comment has been minimized.

Copy link
Member

bowei commented May 25, 2017

/assign

@someword

This comment has been minimized.

Copy link

someword commented Jun 13, 2017

@bowei - i'm curious if you have any thoughts on this. We have instrumented a variety of additional metrics to refer to when this issue comes up again. In doing so I'm noticing a busy udp based app running in a pod (not kube-dns) is having udp rcv_buf_errors. For my particular exercise and getting supporting data to help determine the cause of DNS resolutions errors should I only be concerned about udp packet loss at the physical host level? As I write this it makes me think I should add a sidecar to the kube-dns pod to gather network stats specific to the kube-dns pods network namespace.

Sorry if this is getting off topic.

@bowei

This comment has been minimized.

Copy link
Member

bowei commented Jun 13, 2017

@someword -- can you open a new issue re: UDP? That sounds like a problems that should be investigated by itself...

@someword

This comment has been minimized.

Copy link

someword commented Jun 13, 2017

@bowei - would the issue just be my question about whether tracking UDP metrics within a pods network namespace is important vs tracking at the the physical machine level?

Also does this current issue provide any benefit or should it be closed?
Thanks.

@bowei

This comment has been minimized.

Copy link
Member

bowei commented Jun 13, 2017

It sounds like two issues to me:

  • UDP rcv_buf_errors and packet loss
  • A generic sidecar to gather network stats

The second one may already be filed somewhere. Keep this one open for now.

@evanj

This comment has been minimized.

Copy link

evanj commented Oct 5, 2017

I've run into this issue, and I believe the root cause in my case is saturating the nf_conntrack limits on the kube-dns node. I have a script and configuration that can reproduce this issue on GKE 1.7.6 if it is helpful. The workaround was to set dnsPolicy: Default on the applications that were doing lots of outbound connections. This improved performance significantly.

@bowei

This comment has been minimized.

Copy link
Member

bowei commented Oct 5, 2017

@evanj Can you post the script to a gist and link it here?

@someword

This comment has been minimized.

Copy link

someword commented Oct 5, 2017

@evanj - Are you hitting limits of nf_conntrack in the kube-dns pod or in the physical instance hosting the pod? We are monitoring nf_conntrack counts and we are not hitting the maximums. I'll checkout out our dnsPolicy setting though.

@evanj

This comment has been minimized.

Copy link

evanj commented Oct 5, 2017

I've posted the program and config files, with steps for how to reproduce the issue at the top. I'm hitting the node level nf_conntrack_max limit:

/ # cat /proc/sys/net/netfilter/nf_conntrack_max 
131072
/ # cat /proc/sys/net/netfilter/nf_conntrack_count 
131072

One thing I noticed when I created a brand new cluster, rather than my existing cluster that has been upgrade: the kube-dns-autoscaler default configuration now has "preventSinglePointFailure":true by default, which means a small cluster has two instances that makes this a bit harder to reproduce.

Code: https://gist.github.com/evanj/261ffbee061d4309673425b705a78c18

@ilyalukyanov

This comment has been minimized.

Copy link

ilyalukyanov commented Oct 19, 2017

I posted my findings in another issue which seems to be about the same problem (at least related) - #47142

I'd like to also share them here to contribute to the discussion.

In my case dnsPolicy: Default also helped, which is expected since dnsmasq is no longer being used for DNS, which seems to be the offender.

I explored logs and found something that I cannot yet explain.

This is the output when an external host is successfully resolved:

dnsmasq[1]: query[A] subdomain.my-domain.com from 172.20.4.4
dnsmasq[1]: forwarded subdomain.my-domain.com to 127.0.0.1
dnsmasq[1]: reply subdomain.my-domain.com is <CNAME>
dnsmasq[1]: reply subdomain2.my-domain.com is <CNAME>
dnsmasq[1]: reply subdomain2-my-domain-com.something.aws.third-party-domain.net is 11.22.33.44

This is the output when an external host isn't found:

dnsmasq[1]: query[A] subdomain.my-domain.com.default.svc.cluster.local from 172.20.4.4
dnsmasq[1]: forwarded subdomain.my-domain.com.default.svc.cluster.local to 127.0.0.1
dnsmasq[1]: reply subdomain.my-domain.com.default.svc.cluster.local is NXDOMAIN
dnsmasq[1]: query[A] subdomain.my-domain.com.svc.cluster.local from 172.20.4.4
dnsmasq[1]: forwarded subdomain.my-domain.com.svc.cluster.local to 127.0.0.1
dnsmasq[1]: reply subdomain.my-domain.com.svc.cluster.local is NXDOMAIN
dnsmasq[1]: query[A] subdomain.my-domain.com.cluster.local from 172.20.4.4
dnsmasq[1]: forwarded subdomain.my-domain.com.cluster.local to 127.0.0.1
dnsmasq[1]: reply subdomain.my-domain.com.cluster.local is NXDOMAIN
dnsmasq[1]: query[A] subdomain.my-domain.com.abcdefghijklmnopqrstuvwxyz.px.internal.cloudapp.net from 172.20.4.4
dnsmasq[1]: forwarded subdomain.my-domain.com.abcdefghijklmnopqrstuvwxyz.px.internal.cloudapp.net to 127.0.0.1
dnsmasq[1]: reply subdomain.my-domain.com.abcdefghijklmnopqrstuvwxyz.px.internal.cloudapp.net is NXDOMAIN

Both cases are for the same external hostname (however it's reproducible with any), same containers, same application and same cluster configuration.

For some reason depending on I have no idea what it decides to resolve either using local records only or upstream only.

Another thing I'm going to try is to replace Kube-DNS with CoreDNS.

@ilyalukyanov

This comment has been minimized.

Copy link

ilyalukyanov commented Oct 20, 2017

Replacing Kube-DNS with CoreDNS resulted in the same bahaviour... Looks like the issue isn't with DNS servers. The issue must be higher up in the Kubernetes DNS middleware.

@ApsOps

This comment has been minimized.

Copy link
Contributor

ApsOps commented May 17, 2018

@joanfont you can set a lower ndots value to get the desired behavior.

@hugochinchilla

This comment has been minimized.

Copy link

hugochinchilla commented May 17, 2018

Hello @ApsOps I'm a coworker of @joanfont. What you mention is true but it should not be neccessary to change it, there is indeed a problem with how kubernetes is handling the DNS resolution.
In the attached pcap file you can see how for the same DNS name kubernetes some times makes the correct query and other times it decides to use search domains.

In the pcap you can see the following with more detail, but here is a short version:

Pod asks for arale-ng.cyw3ljy98zq7.eu-west-1.rds.amazonaws.com

At the host level we see this queries going to aws dns servers (10.0.0.2).

07:04:21.227278000  A? arale-ng.cyw3ljy98zq7.eu-west-1.rds.amazonaws.com -> resolves OK
07:04:17.630330000  A? arale-ng.cyw3ljy98zq7.eu-west-1.rds.amazonaws.com.svc.cluster.local -> fails

The pod is performing always the same query and dnsmasq is sometimes doing the right thing (forwarding the query "as is") and other times is deciding to apply search domains on it. I think the behavior of ndots is consistent and does not explain this problem.

@hugochinchilla

This comment has been minimized.

Copy link

hugochinchilla commented May 17, 2018

Sorry for the noise, I sent the comment a few times by accident while editting it.

@ApsOps

This comment has been minimized.

Copy link
Contributor

ApsOps commented May 17, 2018

dnsmasq is sometimes doing the right thing (forwarding the query "as is") and other times is deciding to apply search domains on it

@hugochinchilla search domains are not added by dnsmasq, they are present in the /etc/resolv.conf of your pods and are added even before the query reaches dnsmasq. It generally depends on the base OS of your containers, if it decides to use search paths before or after making the "as is" query.

There have been long discussions about the default value of 5 for ndots, how it benefits in cross-namespace lookups and how it degrades if we have more external calls.

The ndots customization support was recently added and is the right approach at this point IMHO.
#33554 (comment)

@joekohlsdorf

This comment has been minimized.

Copy link

joekohlsdorf commented May 25, 2018

Regarding ndots: You can add a dot to the end of your domain name, this way it will be treated as FQDN and local search will never be attempted arale-ng.cyw3ljy98zq7.eu-west-1.rds.amazonaws.com..

We see the same intermittent DNS resolution issues in all clusters. In our case it's a Python application and we are failing to resolve external domains. It isn't related to kube-dns autoscaling events because we are running a ridiculously high but fixed number of kube-dns pods. We are also not hitting conntrack limits.

Kubernetes 1.9 on AWS, Networking is kubenet, same results with kube-dns:1.14.9 and kube-dns:1.14.5.

@bboreham

This comment has been minimized.

Copy link
Contributor

bboreham commented May 26, 2018

Folks here might be interested in the (long) thread about DNS packets getting dropped here:
weaveworks/weave#3287

(Note it is mostly independent of Weave Net: more about conntrack, masquerading and the like)

@jsravn

This comment has been minimized.

Copy link
Contributor

jsravn commented May 29, 2018

@bboreham To be clear, the linked issue is about SNAT specifically. The linked article refers to use of host-gw mode on flannel, which masquerades all cross node pod communication. If you're using a setup without masquerading everything (like vxlan, cloud routing, etc.), then accessing kube-dns or any service IP will use DNAT only, and is not affected by the described issue.

@bboreham

This comment has been minimized.

Copy link
Contributor

bboreham commented May 29, 2018

@jsravn could you clarify "the linked issue", "The linked article" and "the described issue" with specific links please? I got lost.

@hugochinchilla

This comment has been minimized.

Copy link

hugochinchilla commented May 29, 2018

@joekohlsdorf that's a quick win!! thank you for the tip.

@jsravn

This comment has been minimized.

Copy link
Contributor

jsravn commented May 29, 2018

@bboreham The weave issue you linked has the article https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02. This describes a problem with source NATing - if there is a source port collision, it's possible for the packet to get dropped - for UDP and DNS this causes timeouts. When using "host-gw" mode on flannel, every pod connection is source NAT'd on the host VM's IP, making collisions much more likely.

Anyways, I think this may actually explain my own DNS failures. My cluster nodes are all setup with a local dnsmasq which proxies all pod DNS queries. I discovered that due to the way service IPs work in iptables, this dnsmasq incorrectly picks the wrong source IP when establishing a connection to the kube-dns pod, and so everything is source NAT'd. I've changed it to force dnsmasq to use the flannel interface for the source IP which stops the source NATing - I'm hoping that fixes things for me!

@bboreham

This comment has been minimized.

Copy link
Contributor

bboreham commented May 30, 2018

@jsravn ok, I refute your assertion "the linked issue is about SNAT specifically". E.g. this comment is about DNAT: weaveworks/weave#3287 (comment)

@jsravn

This comment has been minimized.

Copy link
Contributor

jsravn commented May 30, 2018

@jsravn ok, I refute your assertion "the linked issue is about SNAT specifically". E.g. this comment is about DNAT: weaveworks/weave#3287 (comment)

Cool, didn't see that. The issue OP and its linked post are about snat specifically. But that comment explains that the same race condition can also occur with DNAT.

@jaredallard

This comment has been minimized.

Copy link

jaredallard commented Jun 1, 2018

We're seeing this in GKE as well as in a kops deployed AWS environment. We're starting to move this up into our production environments, but if DNS has transient issues it's a bit concerning. Reading through this thread, it looks like we don't really have a full idea of what's causing this do we?

Edit: I've noticed that a pod can sometimes get in a state where DNS will never resolve internal services, deleting that pod fixes the issue.

@joekohlsdorf

This comment has been minimized.

Copy link

joekohlsdorf commented Jun 1, 2018

This comment explains the root cause pretty well: weaveworks/weave#3287 (comment)

We have switched our resolvers to TCP and since not seen these issues anymore. This is probably better than the 4ms artificial delay to avoid the race which was suggested in the weave issue and is much easier to implement.

The title of this issue should be updated, it doesn't only affect kube-dns.

@YoniTapingo

This comment has been minimized.

Copy link

YoniTapingo commented Jul 15, 2018

@joekohlsdorf "We have switched our resolvers to TCP" Could you elaborate on how you made the change?

@jmcshane

This comment has been minimized.

Copy link

jmcshane commented Jul 16, 2018

I just wanted to jump in on this issue as the problems I am experiencing in my environment are extremely similar to the OP. Java containers specifically, no dnsmasq messages about Maximum number of concurrent DNS queries, but packets dropped before sending the query to the external DNS server.

I have attempted adding options single-request-reopen to the resolv.conf file of pods, but after doing this manually and running over a couple days the issue still occurs. I'm interested in what @jsravn did here, because there's definitely a lot of packets crossing weird network boundaries here, and I have the same setup with the node local dnsmasq.

@joekohlsdorf

This comment has been minimized.

Copy link

joekohlsdorf commented Jul 16, 2018

@YoniTapingo I run this little script from my container entrypoint:

#!/usr/bin/env sh

echo >> /etc/resolv.conf
echo "options use-vc" >> /etc/resolv.conf

You could also do it in a preStart lifecycle hook if you have root or sudo.

@joshbenner

This comment has been minimized.

Copy link

joshbenner commented Jul 16, 2018

I was seeing this a lot, and confirmed I was seeing conntrack-related packet drops. We have applications performing a very large number of lookups, some external, but many internal to the cluster (so default resolver was not an option).

I was able to fix by setting a very high value on kube-proxy's conntrack commandline parameter:

--conntrack-min=536870912

The nodes seem to have been stable for a couple months with this now.

@bboreham

This comment has been minimized.

Copy link
Contributor

bboreham commented Jul 16, 2018

@joekohlsdorf use-vc is a neat trick, but do note it is a glibc option, not supported by other implementations such as musl (used in alpine base images, for instance).

The most relevant point I could find about DNS TCP support in musl is: https://twitter.com/RichFelker/status/994629795551031296

@azman0101

This comment has been minimized.

Copy link

azman0101 commented Jul 25, 2018

@joshbenner

I was able to fix by setting a very high value on kube-proxy's conntrack commandline parameter:

--conntrack-min=536870912

This value seems very high. How did you determine that this was the right value for your use-case ?

nf_conntrack_buckets - INTEGER
Size of hash table. If not specified as parameter during module
loading, the default size is calculated by dividing total memory
by 16384 to determine the number of buckets but the hash table will
never have fewer than 32 and limited to 16384 buckets. For systems
with more than 4GB of memory it will be 65536 buckets.
This sysctl is only writeable in the initial net namespace.

So 536870912 gives us virtually the nf_conntrack_buckets value for 2.19TB of memory.

536870912 ÷ 4 = 134217728(nf_conntrack_buckets)
134217728 x 16384 = 2.19 TB of RAM

I wonder why the conntrack table increase like stated in blog post https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02 didn't works for them:

We had already increased the size of the conntrack table and the Kernel logs were not showing any errors.

But, I tested myself dns-test with default conntack-min and with --conntrack-min=536870912 and this solves DNS timeout.

I also have to test this with our application.

Is there any information about when this bug was introduced ? We run without this issue on k8s 1.6.7 ...

@joekohlsdorf

This comment has been minimized.

Copy link

joekohlsdorf commented Jul 26, 2018

We have applications performing a very large number of lookups, some external, but many internal to the cluster (so default resolver was not an option).

If you know that you have a lot of external queries, you should verify that you are not hitting any limits of your upstream DNS or cloud provider. I ran into this because someone thought it would be a good idea to not cache NXDOMAIN responses in kube-dns by default. Unfortunately checking this is a bit tricky because the number of upstream DNS reqs isn't (yet) provided as a metric.

@dylancaponi

This comment has been minimized.

Copy link

dylancaponi commented Sep 25, 2018

I ran into this because someone thought it would be a good idea to not cache NXDOMAIN responses in kube-dns by default.

@joekohlsdorf How do you change that?

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Dec 24, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@george-angel

This comment has been minimized.

Copy link

george-angel commented Jan 7, 2019

/remove-lifecycle stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment