Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FS#391 - dnsmasq stops working properly if the fastest upstream DNS server returns a server failure #5503

Closed
openwrt-bot opened this issue Jan 15, 2017 · 9 comments
Labels

Comments

@openwrt-bot
Copy link

@openwrt-bot openwrt-bot commented Jan 15, 2017

IronicSven:

** - Device problem occurs on
**Reproduced on TP-Link 1043nd v1 and TP-Link Archer C7 v2.

** - Software versions of LEDE release, packages, etc.
**Reboot (SNAPSHOT, r2961-5b089e4)
Dnsmasq version 2.76

** - Steps to reproduce
**Fastest upstream DNS server returns a server failure.

My provider is having some difficulties with his DNS servers this week. I noticed that if the fastest DNS server returns a server failure dnsmasq stops working properly because it ignores the replys of the slower DNS servers.

In google chrome ERR_NAME_RESOLUTION_FAILED appears and nslookup returns ** server can't find google.com: SERVFAIL

I don't use strict-order and it doesn't matter if the faulty upstream DNS server is the first or the last entry in the config as long as it returns the fastest reply.
I had to delete the upstream DNS server which returns the server failure from my config to get dnsmasq working again.

I was able to create a tcpdump and syslog while the DNS server 83.169.185.162 returned a server failure today.

  • You can see in syslog.txt that the reply messages are missing until I delete 83.169.185.162 from the config.
  • The tcpdump wan.pcap shows that 83.169.185.162 returns the fastest reply with a server failure and that the other DNS servers work properly but dnsmasq seems to ignore their replys.
@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Jan 16, 2017

stintel:

Something similar happens to me from time to time. I'm a Gentoo user, and one of the nameservers for the gentoo.org domain seems to be unreliable. When it is down, it's near impossible for me to resolve anything in said domain.

Someone in OpenWrt also had this problem where he was unable to resolve most records of a domain when connected to hist OpenWrt router. The problem did not occur when he was directly connected.

While searching for possible solutions, I came across the --all-servers option:

By default, when dnsmasq has more than one upstream server available, it will send queries to just one server. Setting this flag forces dnsmasq to send all queries to all available servers. The reply from the server which answers first will be returned to the original requester.

Can you test if enabling it helps? Can be enabled in /etc/config/dhcp:

config dnsmasq option allservers '1' ...

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Jan 16, 2017

IronicSven:

Hi Stijn,

the problem still occurs with the allservers parameter.

My ISPs DNS servers are working fine right now but I created a Ubuntu VM in Virtualbox, installed a bind9 DNS server and assigned the manual IP address 192.168.3.2 without gateway address (which means it can't connect to the internet to resolve the request) to reproduce the DNS server failure.

sven@sven-VirtualBox:~$ nslookup facebook.com 192.168.3.2 Server: 192.168.3.2 Address: 192.168.3.2#53

** server can't find facebook.com: SERVFAIL

dnsmasq is now using the google public DNS servers and my faulty DNS server for the allservers parameter test.

root@FlensNet:~# uci set dhcp.@dnsmasq[0].allservers=1 root@FlensNet:~# uci commit dhcp root@FlensNet:~# /etc/init.d/dnsmasq restart root@FlensNet:~# nslookup facebook.com nslookup: can't resolve '(null)': Name does not resolve

nslookup: can't resolve 'facebook.com': Try again
root@FlensNet:~# nslookup facebook.com
nslookup: can't resolve '(null)': Name does not resolve

nslookup: can't resolve 'facebook.com': Try again
root@FlensNet:~# nslookup facebook.com
nslookup: can't resolve '(null)': Name does not resolve

nslookup: can't resolve 'facebook.com': Try again

Syslog shows that the requests are forwarded to all DNS servers as expected but the reply messages are still missing:

Mon Jan 16 17:55:13 2017 daemon.info dnsmasq[1070]: exiting on receipt of SIGTERM
Mon Jan 16 17:55:17 2017 daemon.info dnsmasq[1933]: started, version 2.76 cachesize 500
Mon Jan 16 17:55:17 2017 daemon.info dnsmasq[1933]: compile time options: IPv6 GNU-getopt no-DBus no-i18n no-IDN DHCP no-DHCPv6 no-Lua TFTP no-conntrack no-ipset no-auth no-DNSSEC no-ID loop-detect inotify
Mon Jan 16 17:55:17 2017 daemon.info dnsmasq[1933]: DNS service limited to local subnets
Mon Jan 16 17:55:17 2017 daemon.info dnsmasq-dhcp[1933]: DHCP, IP range 192.168.3.100 -- 192.168.3.249, lease time 12h
Mon Jan 16 17:55:17 2017 daemon.info dnsmasq[1933]: using local addresses only for domain lan
Mon Jan 16 17:55:17 2017 daemon.info dnsmasq[1933]: reading /tmp/resolv.conf.auto
Mon Jan 16 17:55:17 2017 daemon.info dnsmasq[1933]: using local addresses only for domain lan
Mon Jan 16 17:55:17 2017 daemon.info dnsmasq[1933]: using nameserver 8.8.8.8#53
Mon Jan 16 17:55:17 2017 daemon.info dnsmasq[1933]: using nameserver 8.8.4.4#53
Mon Jan 16 17:55:17 2017 daemon.info dnsmasq[1933]: using nameserver 192.168.3.2#53
Mon Jan 16 17:55:17 2017 daemon.info dnsmasq[1933]: read /etc/hosts - 4 addresses
Mon Jan 16 17:55:17 2017 daemon.info dnsmasq[1933]: read /tmp/hosts/odhcpd - 0 addresses
Mon Jan 16 17:55:17 2017 daemon.info dnsmasq[1933]: read /tmp/hosts/dhcp.cfg02411c - 2 addresses
Mon Jan 16 17:55:17 2017 daemon.info dnsmasq-dhcp[1933]: read /etc/ethers - 0 addresses
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 1 127.0.0.1/46400 query[A] facebook.com from 127.0.0.1
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 1 127.0.0.1/46400 forwarded facebook.com to 8.8.8.8
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 1 127.0.0.1/46400 forwarded facebook.com to 8.8.4.4
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 1 127.0.0.1/46400 forwarded facebook.com to 192.168.3.2
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 2 127.0.0.1/46400 query[AAAA] facebook.com from 127.0.0.1
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 2 127.0.0.1/46400 forwarded facebook.com to 8.8.8.8
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 2 127.0.0.1/46400 forwarded facebook.com to 8.8.4.4
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 2 127.0.0.1/46400 forwarded facebook.com to 192.168.3.2
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 3 127.0.0.1/46400 query[A] facebook.com from 127.0.0.1
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 3 127.0.0.1/46400 forwarded facebook.com to 8.8.8.8
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 3 127.0.0.1/46400 forwarded facebook.com to 8.8.4.4
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 3 127.0.0.1/46400 forwarded facebook.com to 192.168.3.2
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 4 127.0.0.1/46400 query[AAAA] facebook.com from 127.0.0.1
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 4 127.0.0.1/46400 forwarded facebook.com to 8.8.8.8
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 4 127.0.0.1/46400 forwarded facebook.com to 8.8.4.4
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 4 127.0.0.1/46400 forwarded facebook.com to 192.168.3.2

DNS starts to work again after I remove 192.168.3.2 from the DNS servers list.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Jan 16, 2017

bjonglez:

Did it start happening after the update to dnsmasq 2.76, in May 2016?

These commits look relevant:

http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commitdiff;h=51967f9807665dae403f1497b827165c5fa1084b (introduced in dnsmasq 2.69)
http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commitdiff;h=4ace25c5d6c30949be9171ff1c524b2139b989d3 (introduced in dnsmasq 2.76)

So, the first commit introduced the issue you see as a "feature" (but there was a bug in the implementation, so it didn't work), while the second commit made the first commit actually work starting from dnsmasq 2.76.

I'm not sure what is the right behaviour, but it indeed sounds strange to treat SERVFAIL as a valid response.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Jan 17, 2017

EricLuehrsen:

Simon Kelley took note of this. It might be a necessary though annoying behavior for a stub resolver using DNSSEC.

https://www.mail-archive.com/dnsmasq-discuss@lists.thekelleys.org.uk/msg10901.html

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Jan 17, 2017

IronicSven:

@baptiste: I can't reproduce the issue with OpenWrt Chaos Calmer 15.05.1 r49389 and Dnsmasq version 2.73. My internet outages started with Lede and Dnsmasq 2.76.

@eric: I don't use DNSSEC and thus treating SERVFAIL as a valid response sounds strange to me.

I've spend some time with the attempt to add some logging messages and revert the changes mentioned above. I created a patchfile in package/network/services/dnsmasq/patches:

--- a/src/forward.c +++ b/src/forward.c @@ -821,9 +821,15 @@ void reply_query(int fd, int family, tim }

server = forward->sentto;
+

  • if (option_bool(OPT_LOG) && RCODE(header) == SERVFAIL)
  • my_syslog(LOG_INFO, _("received SERVFAIL"));
  • if (option_bool(OPT_LOG) && RCODE(header) == REFUSED)
  • my_syslog(LOG_INFO, _("received REFUSED"));
  • if ((forward->sentto->flags & SERV_TYPE) == 0)
    {
  •  if (RCODE(header) == REFUSED)
    
  •  if (RCODE(header) == REFUSED || RCODE(header) == SERVFAIL)
    
    server = NULL;
    else
    {
    @@ -853,7 +857,7 @@ void reply_query(int fd, int family, tim
    we get a good reply from another server. Kill it when we've
    had replies from all to avoid filling the forwarding table when
    everything is broken */
  • if (forward->forwardall == 0 || --forward->forwardall == 1 || RCODE(header) != REFUSED)
  • if (forward->forwardall == 0 || --forward->forwardall == 1 || (RCODE(header) != REFUSED && RCODE(header) != SERVFAIL))
    {
    int check_rebind = 0, no_cache_dnssec = 0, cache_secure = 0, bogusanswer = 0;

Now it is working as I would expect it. If the fastest DNS server returns SERVFAIL the next DNS server that returns NOERROR will be used for a valid response.

Tue Jan 17 19:25:09 2017 daemon.info dnsmasq[1593]: 25 127.0.0.1/54663 query[AAAA] bugs.lede-project.org from 127.0.0.1 Tue Jan 17 19:25:09 2017 daemon.info dnsmasq[1593]: 25 127.0.0.1/54663 forwarded bugs.lede-project.org to 83.169.185.161 Tue Jan 17 19:25:09 2017 daemon.info dnsmasq[1593]: 25 127.0.0.1/54663 forwarded bugs.lede-project.org to 83.169.185.225 Tue Jan 17 19:25:09 2017 daemon.info dnsmasq[1593]: 25 127.0.0.1/54663 forwarded bugs.lede-project.org to 8.8.8.8 Tue Jan 17 19:25:09 2017 daemon.info dnsmasq[1593]: 25 127.0.0.1/54663 forwarded bugs.lede-project.org to 8.8.4.4 Tue Jan 17 19:25:09 2017 daemon.info dnsmasq[1593]: 25 127.0.0.1/54663 forwarded bugs.lede-project.org to 192.168.3.2 Tue Jan 17 19:25:09 2017 daemon.info dnsmasq[1593]: received SERVFAIL Tue Jan 17 19:25:09 2017 daemon.info dnsmasq[1593]: 24 127.0.0.1/54663 reply bugs.lede-project.org is 148.251.78.235 Tue Jan 17 19:25:09 2017 daemon.info dnsmasq[1593]: 25 127.0.0.1/54663 reply bugs.lede-project.org is 2a01:4f8:202:43ea::3

@devs: Please feel free to use the patch.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Feb 4, 2017

dtaht:

A patch much like this was folded into lede a day or three back.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Feb 4, 2017

None:

For clarity:

A patch much like that and updating to dnsmasq 2.77test1 was pulled into jow's staging tree. It is not yet in master, much less backported to 17.01.*

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Feb 7, 2017

IronicSven:

I just tested a selfbuilt with https://git.lede-project.org/?p=source.git;a=commit;h=3bef96ef18a6fb20401313dfa6e88057d56b16ad and can't reproduce this issue anymore.

I would like to suggest to cherry pick this commit for Lede v17.01.0 because it will prevent internet outages for users with unreliable DNS servers like me.

PS: Simon Kelley included the fix in dnsmasq-2.77test2: http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=68f6312d4bae30b78daafcd6f51dc441b8685b1e

Thanks for your support.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Feb 7, 2017

None:

I already have a pull request in for 2.77test2 lede-project/source#794

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant