New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TCP DNS requests fail with "communications error" / "end of file" #824
Comments
I think I've narrowed this down to an issue with requests over a wireguard interface - still debugging, but closing anyway. |
I do not believe this has anything to do with wireguard. Other TCP connections are being established fine, the problem is only with pihole-FTL. I believe the configuration is correct, and can be confirmed at https://tricorder.pi-hole.net/5myi8yatj5. Attached are pcaps of a working request from the LAN subnet directly on the ethernet interface (enx), and a failing request over the wireguard (wg0) interface. The enx interfaces has ip 10.3.2.7/24, while the wg0 interface has 10.3.2.7/32 with the peer at 10.49.128.1/29. dns-tcp-enx.pcap.gz Could the issue be in the "allow all origins" logic? |
Do you see the same happening when using both |
The behaviors are:
I believe the docker-based deployments work correctly because they're behind DNAT rules. |
This really looks like a firewall bug TBH. Receiving EOF on a socket means that it was closed. Getting into such a state, where connections are accepted but then immediately closed without receiving/sending any content, should be impossible. Can you double-check a firewall ( |
The firewalls are entirely open, with a default ACCEPT policy:
Connections to lighttpd work fine over the same link:
Looking at the pcap, taken on the host running pihole and with the above open firewall rules, you can see the necessary packets arriving and having the correct structure. The ACK, FIN/ACK and RST/ACK packets are all sent to the client before any further packets are received, and based on the packet timings there's an approx 23ms delay before the client retries, which is consistent with the RTT of the link. Using strace on the process this is what happens when making a UDP request that succeeds:
For a failing TCP request we can see the packets arrive, and the process seems to enumerate the available interfaces, and if I'm reading this correctly, proceeds to close the connection.
|
I'm traveling and don't have WS installed locally, so I haven't had a chance to look at your pcaps, sorry. I think your deduction is correct, even when this does not make too much sense as dnsmasq doesn't allow TCP/UDP-specific configuration so it should also not behave differently for the two protocols. The one exception from this rule is that no further TCP connections are accepted when the maximum number of TCP workers is reached. However, I think that clients then have to wait (connections are not accepted) instead of being rejected by an immediate close. Can you check if you have multiple (21 ?) FTL processes running? pidof pihole-FTL | wc -w If not, then the next guess is that you bound to the wildcard address and the check for an interface with the same address as the local address of the TCP connection failed. When dnsmasq concludes that this is a not allowed interface, it will exit early. Please try
This branch is based on the latest code that will be included in #827 plus a few extra debug lines for you. Check your |
There was only a single pihole-FTL process running ( Thanks for adding the debug logging. After switching to the branch I confirmed the PID was changed, then did the request as before. This was the log output in
I then wondered whether the fact 10.3.2.202/24 on eth0 and 10.3.2.202/32 on wg0 (note the netmask difference) might have been causing an issue, so I removed the /24 IP from eth0 and restarted pihole-FTL, while still on the branch, then tried again and the request worked, with the following log output:
So pihole-FTL is struggling with seeing the same IP, albeit with different netmasks, on different interfaces. I know this is an acceptable practice and have used it between host systems and their VPN uplinks forever. What I don't understand in the log output is why it's finding a match against 127.0.0.1:53 when the request came in on wg0 destined for IP 10.3.2.202/32. So I checked again, and this time the logs differed, though nothing had been restarted.
I then looked at the commit you made to add the logging and found Hope this helps. |
Wait, what?
But
contains
Checking back with @dschaper just in case I'm missing something obvious here. |
Doesn't look like you're missing anything. The /24 contains the /32 host in it's range. Unless you set the routes up the right way then /24 traffic is going to go to eth0 and then the default gateway. Get the output from Put the wireguard interface on a completely different subnet. |
It's entirely acceptable to do this, and in fact the TincVPN recommends it for a similar scenario. Using Wireguard like this is no different. In the simplest form consider your default route being defined via a gateway. Your own IP is within that default route and you have no issues. Or consider a regular router with multiple NICs. On one side it may have address 10.1.0.1/16 and thus know about and route to your wider 10.0.0.0/16 network. On another NIC it would have the same IP but more specific subnet, 10.1.0.1/24, so only routes traffic to 10.1.0.0/24 via the second NIC, and anything else for 10.0.0.0/16 via the first NIC. Essentially when it needs to make a routing decision, the most specific route in the tables wins, which is why you see the routes sorted by their netmask in a routing table. In my case, i'm using the /32 IP on the Wireguard interface because Wireguard manages the routing table for it's known hosts, and as the TincVPN docs states - this "will make things a lot easier to remember and set up" - which is important for a mesh VPN created on top of Wireguard. I appreciate that there's a possible workaround as @dschaper suggested, but I would still classify this as a bug and thus incorrect behaviour. I believe this is apparent by the fact that UDP request are succeeding, which proves that the routing is entirely correct. This is also supported by the pcaps. I believe the issue may be in the |
Thanks for digging into this. As this is a bug in the embedded I think this bug ticket already contains enough information which you can use for reporting the bug on the official In case you need any further assistance or want to discuss something in addition, please feel free to re-open this ticket. |
Thanks for the help and offer of a branch fix, much appreciated. Happy to raise the issue upstream and wait for the fixes to be pulled in due course. |
@DL6ER - if you have a moment would you be able to answer a couple of questions in the upstream discussion? |
@jinnko I cannot promise to be able to say anything until Monday when I return from business trip. Feel free to reply yourself, you have already looked into the code and draw some conclusion from it. |
In raising this issue, I confirm the following (please check boxes, eg [X]) Failure to fill the template will close your issue:
How familiar are you with the codebase?:
1
[BUG] Expected Behaviour:
When performing a
dig
against the TCP endpoint I expect a NOERROR response.[BUG] Actual Behaviour:
[BUG] Steps to reproduce:
curl | bash
methoddig @10.3.2.7 pi.hole
Log file output [if available]
Nothing is logged in
/var/log/pihole*.log
.Device specifics
Hardware Type: rPi Zero and rPi 1
OS: Raspbian Buster, fully updated
Two separate docker containers I have running, both ARM and amd64, work as expected, and are also v5.0:
The text was updated successfully, but these errors were encountered: