New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS Issue #255

Open
MosheMoradSimgo opened this Issue Feb 19, 2017 · 31 comments

Comments

Projects
None yet
@MosheMoradSimgo

MosheMoradSimgo commented Feb 19, 2017

Hi,

We are running alpine (3.4) in a docker container over a Kubernetes cluster (GCP).

We have been seeing some anomalies where our thread is stuck for 2.5 sec.
After some research using strace we saw that DNS resolving gets timed-out once in a while.

Here are some examples:

23:18:27 recvfrom(5, "\f\361\201\203\0\1\0\0\0\1\0\0\2db\6devone\5*****\3net\3svc\7cluster\5local\0\0\1\0\1\7cluster\5local\0\0\6\0\1\0\0\0<\0D\2ns\3dns\7cluster\5local\0\nhostmaster\7cluster\5local\0X\243\213\360\0\0p\200\0\0\34 \0\t:\200\0\0\0<", 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.3.240.10")}, [16]) = 148 <0.000045>
23:18:27 recvfrom(5, 0x7ffdd0e1fb90, 512, 0, 0x7ffdd0e1f640, 0x7ffdd0e1f61c) = -1 EAGAIN (Resource temporarily unavailable) <0.000014>
23:18:27 clock_gettime(CLOCK_REALTIME, {1487114307, 714908396}) = 0 <0.000015>
23:18:27 poll([{fd=5, events=POLLIN}], 1, 2499) = 0 (Timeout) <2.502024>

09:04:27 recvfrom(5<UDP:[0.0.0.0:36148]>, "\354\211\201\203\0\1\0\0\0\1\0\0\2db\6devone\5*****\3net\3svc\7cluster\5local\0\0\1\0\1\7cluster\5local\0\0\6\0\1\0\0\0<\0D\2ns\3dns\7cluster\5local\0\nhostmaster\7cluster\5local\0X\244\30\220\0\0p\200\0\0\34 \0\t:\200\0\0\0<", 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.3.240.10")}, [16]) = 148 <0.000041>
09:04:27 recvfrom(5<UDP:[0.0.0.0:36148]>, 0x7ffec3d9b0b0, 512, 0, 0x7ffec3d9ab60, 0x7ffec3d9ab3c) = -1 EAGAIN (Resource temporarily unavailable) <0.000011>
09:04:27 clock_gettime(CLOCK_REALTIME, {1487149467, 555317749}) = 0 <0.000008>
09:04:27 poll([{fd=5<UDP:[0.0.0.0:36148]>, events=POLLIN}], 1, 2498) = 0 (Timeout) <2.499671>


09:18:47 recvfrom(5<UDP:[0.0.0.0:47282]>, " B\201\200\0\1\0\1\0\0\0\0\2db\6devone\5*****\3net\0\0\1\0\1\300\f\0\1\0\1\0\0\0\200\0\4h\307\16N", 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.3.240.10")}, [16]) = 53 <0.000011>
09:18:47 recvfrom(5<UDP:[0.0.0.0:47282]>, 0x7ffdd0e1fb90, 512, 0, 0x7ffdd0e1f640, 0x7ffdd0e1f61c) = -1 EAGAIN (Resource temporarily unavailable) <0.000008>
09:18:47 clock_gettime(CLOCK_REALTIME, {1487150327, 679292144}) = 0 <0.000005>
09:18:47 poll([{fd=5<UDP:[0.0.0.0:47282]>, events=POLLIN}], 1, 2497) = 0 (Timeout) <2.498797>

And a good example:

08:22:25 recvfrom(5<UDP:[0.0.0.0:59162]>, "\20j\201\203\0\1\0\0\0\1\0\0\2db\6devone\5*****\3net\3svc\7cluster\5local\0\0\34\0\1\7cluster\5local\0\0\6\0\1\0\0\0<\0D\2ns\3dns\7cluster\5local\0\nhostmaster\7cluster\5local\0X\244\n\200\0\0p\200\0\0\34 \0\t:\200\0\0\0<", 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.3.240.10")}, [16]) = 148 <0.000014>
08:22:25 recvfrom(5<UDP:[0.0.0.0:59162]>, 0x7ffec3d9aeb0, 512, 0, 0x7ffec3d9ab60, 0x7ffec3d9ab3c) = -1 EAGAIN (Resource temporarily unavailable) <0.000011>
08:22:25 clock_gettime(CLOCK_REALTIME, {1487146945, 638264715}) = 0 <0.000010>
08:22:25 poll([{fd=5<UDP:[0.0.0.0:59162]>, events=POLLIN}], 1, 2498) = 1 ([{fd=5, revents=POLLIN}]) <0.000010>

In the past we already had some issues with DNS resolving in older an version(3.3), which have been resolved since we moved to 3.4 (or so we thought).

Is this a known issue?
Does anybody have a solution / workaround / suggestion what to do?

Thanks a lot.

@Sartner

This comment has been minimized.

Sartner commented Feb 25, 2017

Have the same issue
Alpine: 3.5
Docker: 1.13.1-cs2

/ # time ping -c 1 dev11
PING dev11 (10.1.100.11): 56 data bytes
64 bytes from 10.1.100.11: seq=0 ttl=63 time=0.211 ms

--- dev11 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.211/0.211/0.211 ms
real    0m 2.50s
user    0m 0.00s
sys     0m 0.00s
@not-null

This comment has been minimized.

not-null commented Mar 17, 2017

Hi,

With the latest version (3.5), I am experiencing below error.

fetch http://dl-4.alpinelinux.org/alpine/v3.5/community/x86_64/APKINDEX.tar.gz
ERROR: http://dl-4.alpinelinux.org/alpine/v3.5/community: DNS lookup error
fetch http://dl-4.alpinelinux.org/alpine/v3.5/community/x86_64/APKINDEX.tar.gz
WARNING: Ignoring http://dl-4.alpinelinux.org/alpine/v3.5/community/x86_64/APKINDEX.tar.gz: DNS lookup error
fetch http://dl-4.alpinelinux.org/alpine/v3.5/main/x86_64/APKINDEX.tar.gz
ERROR: http://dl-4.alpinelinux.org/alpine/v3.5/main: DNS lookup error
fetch http://dl-4.alpinelinux.org/alpine/v3.5/main/x86_64/APKINDEX.tar.gz
WARNING: Ignoring http://dl-4.alpinelinux.org/alpine/v3.3/main/x86_64/APKINDEX.tar.gz: DNS lookup error
ERROR: unsatisfiable constraints:
  bash (missing):
    required by: world[bash]
  ca-certificates (missing):
    required by: world[ca-certificates]
  curl (missing):
    required by: world[curl]

Can anyone please help me in resolving it and moving forward

Thanks

@andyshinn andyshinn added the question label May 5, 2017

@andyshinn

This comment has been minimized.

Collaborator

andyshinn commented May 5, 2017

The latter two comments don't sound like the same issue. This seems like a Kubernetes specific thing. Do you know if it happens to only Alpine containers or does it affect others as well? I've heard of intermittent DNS resolving issues in Kubernetes. But they were not specific to Alpine.

@c24w

This comment has been minimized.

c24w commented Jun 2, 2017

We're seeing slow DNS resolution in alpine:3.4 (not in Kubernetes):

$ time docker run --rm alpine:3.4 nslookup google.com
nslookup: can't resolve '(null)': Name does not resolve    

Name:      google.com        
Address 1: 216.58.204.78 lhr25s13-in-f78.1e100.net         
Address 2: 216.58.204.78 lhr25s13-in-f78.1e100.net         
Address 3: 216.58.204.78 lhr25s13-in-f78.1e100.net         
Address 4: 2a00:1450:4009:814::200e lhr25s13-in-x0e.1e100.net

real    0m2.996s             
user    0m0.010s             
sys     0m0.005s  

Versus Busybox:

$ time docker run --rm busybox nslookup google.com
Server:    10.108.88.10      
Address 1: 10.108.88.10      

Name:      google.com        
Address 1: 2a00:1450:4009:814::200e lhr25s13-in-x0e.1e100.net
Address 2: 216.58.204.78 lhr25s13-in-f14.1e100.net         
Address 3: 216.58.204.78 lhr25s13-in-f14.1e100.net         
Address 4: 216.58.204.78 lhr25s13-in-f14.1e100.net

real    0m0.545s             
user    0m0.011s             
sys     0m0.007s

Not sure what the null error suggests, but it might be related!

Docker version 17.05.0-ce, build 89658be

@mpashka

This comment has been minimized.

mpashka commented Aug 3, 2017

I have an issue with DNS resolving in alpine.
I have /etc/resolv.conf config with several search suffixes (6 suffixes). And during DNS resolving I see that my DNS server answers only first 6 or 7 requests (this is DNS DoS protection). But according to strace output alpine does 2 requests for each search suffix.

Ubuntu docker image doesn't have this problem - it does only one request for each name suffix.

So is it possible to fix this behaviour and make only 1 request to DNS server for each domain name suffix. This is important because kubernetes usually put 3 search suffixes. So if we have more than one our own search suffixes and we have DNS server that limits requests from single IP than most likely we get DNS resolution problem.

@justlooks

This comment has been minimized.

justlooks commented Aug 11, 2017

yes ,latest alpine image has problem in DNS resolve ,all my app image build on alpine have same problem on kubernetes v1.7.0


[root@k8s-master nfstest]# kubectl exec -it testme --namespace demo  -- nslookup heapster.kube-system
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      heapster.kube-system
Address 1: 10.100.249.248 heapster.kube-system.svc.cluster.local
[root@k8s-master nfstest]# kubectl exec -it testme --namespace demo  -- nslookup http-svc.kube-system
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      http-svc.kube-system
Address 1: 10.102.217.7 http-svc.kube-system.svc.cluster.local
[root@k8s-master nfstest]# kubectl exec -it testme --namespace demo  -- nslookup ftpserver-service.demo
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

nslookup: can't resolve 'ftpserver-service.demo'
@mpashka

This comment has been minimized.

mpashka commented Aug 11, 2017

During my investigations I've found that I have a problem with my DNS server.
Some time ago alpine didn't support resolv.conf options 'search' and 'domains'. But that is not the case now. They also claim they do resolving in parallel and thus results can differ. But this is not the case for me also.
I've found that alpine makes 2 requests because one is for ipv4 (A record) and other is for ipv6 (AAAA record).
My trouble is related to DNS server itself. If there are several search domains in resolv.conf and for some of that domains DNS server reports 'Server failure' (RCODE = 2) then alpine retries this name. If DNS server reports 'No such name' (RCODE = 3) then alpine continues with next search domain. Ubuntu on the other hand doesn't treat 'Server failure' (RCODE = 2) as DNS server failure and just coninues to fetch other search domains.
You can check DNS server rcode result for some specific dns name using command
# dig @<dns_server> dns_name_to_check
and check 'status:' field - it can be NXDOMAIN (which is 'No such name' RCODE = 3) or SERVFAIL.
BTW nslookup operates in the same manner. It respects RCODE and stopps if DNS server responce 'Server failure' (RCODE = 2)

chrishiestand added a commit to zesty-io/redis-k8s-statefulset that referenced this issue Oct 4, 2017

do not build on alpine
testing this without alpine because alpine bug might be
 causing issue where redis does not resolve new ip address:
 gliderlabs/docker-alpine#255

* also pin to 4.0

chrishiestand added a commit to zesty-io/redis-k8s-statefulset that referenced this issue Oct 4, 2017

do not build on alpine
testing this without alpine because alpine bug might be
 causing issue where redis does not resolve new ip address:
 gliderlabs/docker-alpine#255

* also pin to 4.0

chrishiestand added a commit to zesty-io/redis-k8s-statefulset that referenced this issue Oct 4, 2017

do not build on alpine
testing this without alpine because alpine bug might be
 causing issue where redis does not resolve new ip address:
 gliderlabs/docker-alpine#255

* also pin to 4.0

@blacktop blacktop referenced this issue Nov 4, 2017

Closed

can't issue DNS #6

@zq-david-wang

This comment has been minimized.

zq-david-wang commented Apr 10, 2018

I tried on alpine-docker 3.7, with /etc/resolv.conf as follow:

nameserver 10.254.0.100
search  localdomain  somebaddomain
options ndots:5

My DNS server "10.254.0.100" manage its own domain 'localdomain' while forward query of other domain to some external dns server.
Then when I query google.com, alpine dnsclient would

  1. try google.com.localdomain, and get a "NXDomain" response
  2. try google.com.somebaddomain, but get a "Refused" response, but after receive a "Refused/SERVFAIL" response, alpine client would keep retry "google.com.somebaddomain", resulting in the final failure.

I also try centos/ubuntu docker image, those dns client would giveup those "Refused/Servfail" response and keep next trial of "google.com" and got an expected response.

Is it the secure/expect reaction to retry same dns after receiving "Refused/Servfail" response or it is a bug in alpine.

@KIVagant

This comment has been minimized.

KIVagant commented May 11, 2018

We got probably the same issue. Two different containers running in the same cluster in parallel:

  • image with 3.5.2 works normal, AWS DNS resolves in 0.01s
  • image with 3.7.0 has big lag, DNS could be resolved in 5 seconds or could not be resolved at all.
@zioalex

This comment has been minimized.

zioalex commented May 25, 2018

For the DNS delay try to add the line:
options single-request
in the resolv.conf
See https://wiki.archlinux.org/index.php/Domain_name_resolution#Hostname_lookup_delayed_with_IPv6

@joshbenner

This comment has been minimized.

joshbenner commented May 29, 2018

I don't think musl (which is used by Alpine) has the single-request resolver option.

@zq-david-wang

This comment has been minimized.

zq-david-wang commented Jun 11, 2018

I tried following changes, it seems work. (Tried on my cluster and push to davidzqwang/alpine-dns:3.7)

diff --git a/src/network/lookup_name.c b/src/network/lookup_name.c
index 209c20f..abb7da5 100644
--- a/src/network/lookup_name.c
+++ b/src/network/lookup_name.c
@@ -202,7 +202,7 @@ static int name_from_dns_search(struct address buf[static MAXADDRS], char canon[
                        memcpy(canon+l+1, p, z-p);
                        canon[z-p+1+l] = 0;
                        int cnt = name_from_dns(buf, canon, canon, family, &conf);
-                       if (cnt) return cnt;
+                       if (cnt > 0 || cnt == EAI_AGAIN) return cnt;
                }
        }

@runephilosof

This comment has been minimized.

runephilosof commented Jun 12, 2018

I have tested 3.6, 3.7 and edge and all are affected by https://bugs.busybox.net/show_bug.cgi?id=675.
Alpine 3.7, and edge use BusyBox v1.27.2 (2017-12-12 10:41:50 GMT) multi-call binary., but if I pulll busybox:1.27.2 and test nslookup, it doesn't have the error.
So I am not sure if just upgrading busybox will fix the issue.
The busybox bug report hints that the libc in use will influence the problem.

@krikri90

This comment has been minimized.

krikri90 commented Jul 25, 2018

fetch http://mirror.ps.kz/alpine/v3.8/main/x86_64/APKINDEX.tar.gz
ERROR: http://mirror.ps.kz/alpine/v3.8/main: DNS lookup error
WARNING: Ignoring APKINDEX.1b054110.tar.gz: No such file or directory
fetch http://mirror.ps.kz/alpine/v3.8/community/x86_64/APKINDEX.tar.gz
ERROR: http://mirror.ps.kz/alpine/v3.8/community: DNS lookup error
WARNING: Ignoring APKINDEX.ce38122e.tar.gz: No such file or directory

Getting above error. How to fix it

@sadok-f

This comment has been minimized.

sadok-f commented Aug 22, 2018

Hi,

We're running a couple of Docker container on AWS EC2, the images based on Alpine3.7.
The DNS resolution is very slow, here an example:

time nslookup google.com
nslookup: can't resolve '(null)': Name does not resolve

Name:      google.com
Address 1: 216.58.207.174 muc11s04-in-f14.1e100.net
Address 2: 2a00:1450:4016:80a::200e muc11s12-in-x0e.1e100.net
real    0m 2.53s
user    0m 0.00s
sys     0m 0.00s

Another test by curl cmd:

time curl https://packagist.org/packages/list.json?vendor=composer  --output list.json
% Total    % Received % Xferd  Average Speed   Time    Time     Time  
Current
                             Dload  Upload   Total   Spent    Left  
Speed
100   174    0   174    0     0     58      0 --:--:--  0:00:03 --:--:--    48
real    0m 3.61s
user    0m 0.01s
sys 0m 0.00s

Which is interesting if we put -4 option for curl which for resolving the address to IPV4, the result is much faster as it should be:

time curl -4 https://packagist.org/packages/list.json?vendor=composer  --output list.json
% Total    % Received % Xferd  Average Speed   Time    Time     Time  
Current
                             Dload  Upload   Total   Spent    Left  
Speed
100   174    0   174    0     0    174      0 --:--:-- --:--:-- --:--:--  1359
real    0m 0.13s
user    0m 0.01s
sys 0m 0.00s

There's a workaround proposed here: #313 (comment)

Is there any soonish release to fix that?
Thx

@bboreham

This comment has been minimized.

bboreham commented Aug 22, 2018

FYI @brb has found some kernel race conditions which relate to this symptom. See https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts for technical details

@zhouqiang-cl

This comment has been minimized.

zhouqiang-cl commented Aug 26, 2018

I found if i install bind-tools it will all be ok
RUN apk add bind-tools

@sebastianfuss

This comment has been minimized.

sebastianfuss commented Aug 31, 2018

@zhouqiang-cl
Unfortunately RUN apk add bind-tools does not solve my name resolution problems. I am running a container with Alpine 3.8 in AWS Fargate and i am getting errors during resolving hostnames.

EDIT:
I moved as well to debian stretch slim and my dns problems seems to be solved.

@jurgenweber

This comment has been minimized.

jurgenweber commented Sep 2, 2018

I have converted a few images to Debian Jessie/Stretch slim and my DNS issues went away. Kubernetes 1.9.7 using kops in AWS. This has been bothering us for a long while.

@based64god

This comment has been minimized.

based64god commented Sep 13, 2018

I too am seeing issues with MUSL DNS failure on a bare-metal Kubernetes cluster. The hosts in the cluster are all Ubuntu 18.04 machines using systemd-resolved for local DNS. I can reproduce the issue @sadok-f is having. This is on a Kubernetes 1.11.3 cluster (set up using Kubeadm 1.11.3, with Weave CNI), CoreDNS 1.1.3, systemd 237 on the host. Swapping images out for Debian stretch slim fixes the issues.

@jstoja

This comment has been minimized.

jstoja commented Sep 19, 2018

@zhouqiang-cl @sebastianfuss installing bind-tools just seem to use a statically built binary seem to only solve the nslookup command but not the underlying issue.

@chenyongze

This comment has been minimized.

chenyongze commented Sep 22, 2018

ERROR: tzdata-2018d-r1: temporary error (try again later)

dgrove-oss added a commit to dgrove-oss/openwhisk that referenced this issue Oct 4, 2018

Switch from alpine to jessie-slim for runner utility images
The Alpine based images have a nasty problem with DNS failures that
tends to surface when running them in Kubernetes.  After a fair amount
of poking around, it seems like the only reliable fix is to not use
Alpine images on Kubernetes until upstream bug fixes in various layers
of the software stack, including the Linux kernel propagate to the
Alpine releases.  For more context,
see:
  gliderlabs/docker-alpine#255
  kubernetes/kubernetes#56903
  https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts
@mblaschke

This comment has been minimized.

mblaschke commented Oct 8, 2018

Can confirm the issue running multiple alpine containers in a Kubernetes cluster. Busybox images are fine, only Alpine is affected.

@swift1911

This comment has been minimized.

swift1911 commented Oct 15, 2018

Is any progress for this issue? In my test, a newer musl version can solve this problem

@jstoja

This comment has been minimized.

jstoja commented Oct 16, 2018

@swift1911 could you share with us the test you used and the version of alpine+musl that you used? That would be of tremendeous help to check for a fix!

@Mykolaichenko

This comment has been minimized.

Mykolaichenko commented Nov 8, 2018

Guys how we can push that? It's extremely huge problem!

@ncopa

This comment has been minimized.

Collaborator

ncopa commented Nov 8, 2018

Is there any way to reproduce this without using kubernetes?

Alternatively, does anyone have a tcpdump trace that shows exactly what is going on?

hprotzek added a commit to springernature/halfpipe-ml-deploy that referenced this issue Nov 12, 2018

@brb

This comment has been minimized.

brb commented Nov 15, 2018

@ncopa You can use the client and the server from https://github.com/brb/conntrack-race to reproduce the issue w/o k8s.

@tecnobrat

This comment has been minimized.

tecnobrat commented Nov 19, 2018

I don't know if this will help anyone else, but we found if we ran any alpine-based docker image on-top of amazon's ECS AMI, that we would get a 400ms timeout set in DNS resolution, but we cannot find out where its coming from.

Our resolv.conf looks like:

~ $ cat /etc/resolv.conf
options timeout:2 attempts:5
; generated by /sbin/dhclient-script
search ec2.internal
nameserver 172.16.0.2

If we use an ubuntu-based image we don't have this issue:

$ sudo iptables -I FORWARD -p udp --sport 53 -j DROP
$ sudo docker run -it bash
bash-4.4# ping tugboat.info
ping: bad address 'tugboat.info'
bash-4.4# ping tecnobrat.com
ping: bad address 'tecnobrat.com'
bash-4.4# exit
exit
[status stage bstolz@ip-172-17-50-25 ~]$ sudo iptables -D FORWARD -p udp --sport 53 -j DROP

image

You can see from the wireshark that it sends a request every 400ms instead of ever 2 seconds like in our resolv.conf

I'm not sure whats causing it, but its causing a lot of DNS timeouts for us.

@tecnobrat

This comment has been minimized.

tecnobrat commented Nov 19, 2018

I just realized that options timeout:2 attempts:5 which means:
2s = 2000ms
2000 / 5 = 400ms

Is alpine using an OVERALL timeout of 2 seconds, and then attempting to accomplish 5 attempts within that 2 seconds? Instead of 2 seconds per attempt?

@tecnobrat

This comment has been minimized.

tecnobrat commented Nov 19, 2018

I believe this is the case, according to https://git.musl-libc.org/cgit/musl/tree/src/network/res_msend.c#n111

Which means its fundamentally different than ubuntu and other glibc OS's.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment