Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCP DNS requests fail with "communications error" / "end of file" #824

Closed
3 tasks done
jinnko opened this issue Jul 1, 2020 · 15 comments
Closed
3 tasks done

TCP DNS requests fail with "communications error" / "end of file" #824

jinnko opened this issue Jul 1, 2020 · 15 comments

Comments

@jinnko
Copy link

jinnko commented Jul 1, 2020

In raising this issue, I confirm the following (please check boxes, eg [X]) Failure to fill the template will close your issue:

  • I have read and understood the contributors guide.
  • The issue I am reporting can be replicated
  • The issue I am reporting isn't a duplicate

How familiar are you with the codebase?:

1


[BUG] Expected Behaviour:

When performing a dig against the TCP endpoint I expect a NOERROR response.

[BUG] Actual Behaviour:

# dig +tcp @10.3.2.7 pi.hole
;; communications error to 10.3.2.7#53: end of file
;; communications error to 10.3.2.7#53: end of file

[BUG] Steps to reproduce:

  • Prepare a new Raspbian environment on Raspberry Pi Zero or Raspberry Pi 1 (not tested on others)
  • Install using curl | bash method
  • Confirm UDP requests work with dig @10.3.2.7 pi.hole
  • Attempt TCP request as quoted in Actual Behaviour above and see it fail

Log file output [if available]

Nothing is logged in /var/log/pihole*.log.

Device specifics

Hardware Type: rPi Zero and rPi 1
OS: Raspbian Buster, fully updated

# pihole -v
  Pi-hole version is v5.0 (Latest: v5.0)
  AdminLTE version is v5.0 (Latest: v5.0)
  FTL version is v5.0 (Latest: v5.0)

Two separate docker containers I have running, both ARM and amd64, work as expected, and are also v5.0:

# docker exec -it pihole pihole -v   
  Pi-hole version is v5.0 (Latest: v5.0)
  AdminLTE version is v5.0 (Latest: v5.0)
  FTL version is v5.0 (Latest: v5.0)
@jinnko
Copy link
Author

jinnko commented Jul 1, 2020

I think I've narrowed this down to an issue with requests over a wireguard interface - still debugging, but closing anyway.

@jinnko jinnko closed this as completed Jul 1, 2020
@jinnko
Copy link
Author

jinnko commented Jul 1, 2020

I do not believe this has anything to do with wireguard. Other TCP connections are being established fine, the problem is only with pihole-FTL. I believe the configuration is correct, and can be confirmed at https://tricorder.pi-hole.net/5myi8yatj5.

Attached are pcaps of a working request from the LAN subnet directly on the ethernet interface (enx), and a failing request over the wireguard (wg0) interface. The enx interfaces has ip 10.3.2.7/24, while the wg0 interface has 10.3.2.7/32 with the peer at 10.49.128.1/29.

dns-tcp-enx.pcap.gz
dns-tcp-wg0.pcap.gz

Could the issue be in the "allow all origins" logic?

@jinnko jinnko reopened this Jul 1, 2020
@DL6ER
Copy link
Member

DL6ER commented Jul 1, 2020

Do you see the same happening when using both Listen on all interfaces and Listen on all interfaces, permit all origins ?

@jinnko
Copy link
Author

jinnko commented Jul 2, 2020

The behaviors are:

  • Listen on all interfaces, from 10.49.128.1 on the other side of wg0

    • dig @10.3.2.7 pihole - times out with ;; connection timed out; no servers could be reached
    • dig +tcp @10.3.2.7 pihole - fails with ;; communications error to 10.3.2.7#53: end of file
  • Listen on all interfaces, permit all origins

    • dig @10.3.2.7 pihole - succeeds immediately
    • dig +tcp @10.3.2.7 pihole - fails with ;; communications error to 10.3.2.7#53: end of file

I believe the docker-based deployments work correctly because they're behind DNAT rules.

@DL6ER
Copy link
Member

DL6ER commented Jul 3, 2020

This really looks like a firewall bug TBH. Receiving EOF on a socket means that it was closed. Getting into such a state, where connections are accepted but then immediately closed without receiving/sending any content, should be impossible. Can you double-check a firewall (iptables, ufw, maybe Docker-settings, something else)? Also, can you connect to port 80 (lighttpd) over the same connection?

@jinnko
Copy link
Author

jinnko commented Jul 3, 2020

The firewalls are entirely open, with a default ACCEPT policy:

# iptables-save
# Generated by xtables-save v1.8.2 on Fri Jul  3 08:22:15 2020
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
COMMIT
# Completed on Fri Jul  3 08:22:15 2020

Connections to lighttpd work fine over the same link:

# curl 10.3.2.202                                                                        

    <html>
      <head>
        <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1"/>
        <link rel='stylesheet' href='pihole/blockingpage.css' type='text/css'/>
      </head>
      <body id='splashpage'>
        <img src='admin/img/logo.svg'/><br/>
        Pi-<b>hole</b>: Your black hole for Internet advertisements<br/>
        <a href='/admin'>Did you mean to go to the admin panel?</a>
      </body>
    </html>
    #             

Looking at the pcap, taken on the host running pihole and with the above open firewall rules, you can see the necessary packets arriving and having the correct structure. The ACK, FIN/ACK and RST/ACK packets are all sent to the client before any further packets are received, and based on the packet timings there's an approx 23ms delay before the client retries, which is consistent with the RTT of the link.

Using strace on the process this is what happens when making a UDP request that succeeds:

Process 499 attached
Jul 03 08:28:26 poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}], 6, -1) = 1 ([{fd=4, revents=POLLIN}])
Jul 03 08:28:26 gettimeofday({tv_sec=1593764906, tv_usec=145172}, NULL) = 0
Jul 03 08:28:26 recvmsg(4, {msg_name={sa_family=AF_INET, sin_port=htons(36577), sin_addr=inet_addr("10.49.128.1")}, msg_namelen=28->16, msg_iov=[{iov_base="\25\264\1 \0\1\0\0\0\0\0\1\2pi\4hole\0\0\1\0\1\0\0)\20\0\0\0"..., iov_len=4096}], msg_iovlen=1, msg_control=[{cmsg_len=24, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, cmsg_data={ipi_ifindex=if_nametoindex("wg0"), ipi_spec_dst=inet_addr("10.3.2.202"), ipi_addr=inet_addr("10.3.2.202")}}], msg_controllen=24, msg_flags=0}, 0) = 48
Jul 03 08:28:26 ioctl(4, SIOCGIFNAME, {ifr_index=6, ifr_name="wg0"}) = 0
Jul 03 08:28:26 gettimeofday({tv_sec=1593764906, tv_usec=150040}, NULL) = 0
Jul 03 08:28:26 gettimeofday({tv_sec=1593764906, tv_usec=150333}, NULL) = 0
Jul 03 08:28:26 sendmsg(4, {msg_name={sa_family=AF_INET, sin_port=htons(36577), sin_addr=inet_addr("10.49.128.1")}, msg_namelen=16, msg_iov=[{iov_base="\25\264\205\200\0\1\0\1\0\0\0\1\2pi\4hole\0\0\1\0\1\300\f\0\1\0\1\0"..., iov_len=52}], msg_iovlen=1, msg_control=[{cmsg_len=24, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, cmsg_data={ipi_ifindex=0, ipi_spec_dst=inet_addr("10.3.2.202"), ipi_addr=inet_addr("0.0.0.0")}}], msg_controllen=24, msg_flags=0}, 0) = 52
Jul 03 08:28:26 poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}], 6, -1) = 1 ([{fd=4, revents=POLLIN}])
Jul 03 08:28:26 gettimeofday({tv_sec=1593764906, tv_usec=971473}, NULL) = 0
Jul 03 08:28:26 recvmsg(4, {msg_name={sa_family=AF_INET, sin_port=htons(39848), sin_addr=inet_addr("10.3.2.105")}, msg_namelen=28->16, msg_iov=[{iov_base="\227\233\1\0\0\1\0\0\0\0\0\0\10location\10services\7m"..., iov_len=4096}], msg_iovlen=1, msg_control=[{cmsg_len=24, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, cmsg_data={ipi_ifindex=if_nametoindex("eth0"), ipi_spec_dst=inet_addr("10.3.2.202"), ipi_addr=inet_addr("10.3.2.202")}}], msg_controllen=24, msg_flags=0}, 0) = 47
Jul 03 08:28:26 ioctl(4, SIOCGIFNAME, {ifr_index=2, ifr_name="eth0"}) = 0
Jul 03 08:28:26 gettimeofday({tv_sec=1593764906, tv_usec=973606}, NULL) = 0
Jul 03 08:28:26 gettimeofday({tv_sec=1593764906, tv_usec=973891}, NULL) = 0
Jul 03 08:28:26 openat(AT_FDCWD, "/etc/pihole/pihole-FTL.conf", O_RDONLY|O_LARGEFILE) = 18
Jul 03 08:28:26 fstat64(18, {st_mode=S_IFREG|0664, st_size=15, ...}) = 0

For a failing TCP request we can see the packets arrive, and the process seems to enumerate the available interfaces, and if I'm reading this correctly, proceeds to close the connection.

# strace -p 499 2>&1 | ts
Jul 03 08:36:40 strace: Process 499 attache
Jul 03 08:36:41 poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}], 6, -1) = 1 ([{fd=5, revents=POLLIN}]
Jul 03 08:36:41 gettimeofday({tv_sec=1593765401, tv_usec=184500}, NULL) = 
Jul 03 08:36:41 accept(5, NULL, NULL)                   = 1
Jul 03 08:36:41 getsockname(18, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.3.2.202")}, [28->16]) = 
Jul 03 08:36:41 socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 1
Jul 03 08:36:41 sendto(3, {{len=20, type=0x16 /* NLMSG_??? */, flags=NLM_F_REQUEST|NLM_F_ACK|0x300, seq=5631, pid=0}, "\x0a\x00\x00\x00"}, 20, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 12) = 2
Jul 03 08:36:41 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{{len=72, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=5631, pid=496}, "\x0a\x80\x80\xfe\x01\x00\x00\x00\x14\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x14\x00\x06\x00"...}, {{len=72, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=5631, pid=496}, "\x0a\x40\x80\x00\x02\x00\x00\x00\x14\x00\x01\x00\x2a\x02\x80\x10\x60\xc5\x00\x00\x00\x10\x00\x03\x00\x02\x02\x02\x14\x00\x06\x00"...}, {{len=72, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=5631, pid=496}, "\x0a\x40\x80\xfd\x02\x00\x00\x00\x14\x00\x01\x00\xfe\x80\x00\x00\x00\x00\x00\x00\x02\x1f\xf3\xff\xfe\xfc\x7f\xed\x14\x00\x06\x00"...}], iov_len=240}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_PEEK|MSG_TRUNC) = 21
Jul 03 08:36:41 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{{len=72, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=5631, pid=496}, "\x0a\x80\x80\xfe\x01\x00\x00\x00\x14\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x14\x00\x06\x00"...}, {{len=72, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=5631, pid=496}, "\x0a\x40\x80\x00\x02\x00\x00\x00\x14\x00\x01\x00\x2a\x02\x80\x10\x60\xc5\x00\x00\x00\x10\x00\x03\x00\x02\x02\x02\x14\x00\x06\x00"...}, {{len=72, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=5631, pid=496}, "\x0a\x40\x80\xfd\x02\x00\x00\x00\x14\x00\x01\x00\xfe\x80\x00\x00\x00\x00\x00\x00\x02\x1f\xf3\xff\xfe\xfc\x7f\xed\x14\x00\x06\x00"...}], iov_len=240}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 21
Jul 03 08:36:41 ioctl(19, SIOCGIFNAME, {ifr_index=1, ifr_name="lo"}) = 
Jul 03 08:36:41 ioctl(19, SIOCGIFFLAGS, {ifr_name="lo", ifr_flags=IFF_UP|IFF_LOOPBACK|IFF_RUNNING}) = 
Jul 03 08:36:41 ioctl(19, SIOCGIFMTU, {ifr_name="lo", ifr_mtu=65536}) = 
Jul 03 08:36:41 ioctl(19, SIOCGIFNAME, {ifr_index=2, ifr_name="eth0"}) = 
Jul 03 08:36:41 ioctl(19, SIOCGIFFLAGS, {ifr_name="eth0", ifr_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST}) = 
Jul 03 08:36:41 ioctl(19, SIOCGIFMTU, {ifr_name="eth0", ifr_mtu=1500}) = 
Jul 03 08:36:41 ioctl(19, SIOCGIFNAME, {ifr_index=2, ifr_name="eth0"}) = 
Jul 03 08:36:41 ioctl(19, SIOCGIFFLAGS, {ifr_name="eth0", ifr_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST}) = 
Jul 03 08:36:41 ioctl(19, SIOCGIFMTU, {ifr_name="eth0", ifr_mtu=1500}) = 
Jul 03 08:36:41 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base={{len=20, type=NLMSG_DONE, flags=NLM_F_MULTI, seq=5631, pid=496}, 0}, iov_len=240}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_PEEK|MSG_TRUNC) = 20
Jul 03 08:36:41 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base={{len=20, type=NLMSG_DONE, flags=NLM_F_MULTI, seq=5631, pid=496}, 0}, iov_len=240}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
Jul 03 08:36:41 sendto(3, {{len=20, type=0x16 /* NLMSG_??? */, flags=NLM_F_REQUEST|NLM_F_ACK|0x300, seq=5632, pid=0}, "\x02\x00\x00\x00"}, 20, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 12) = 20
Jul 03 08:36:41 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{{len=76, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=5632, pid=496}, "\x02\x08\x80\xfe\x01\x00\x00\x00\x08\x00\x01\x00\x7f\x00\x00\x01\x08\x00\x02\x00\x7f\x00\x00\x01\x07\x00\x03\x00\x6c\x6f\x00\x00"...}, {{len=88, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=5632, pid=496}, "\x02\x18\x80\x00\x02\x00\x00\x00\x08\x00\x01\x00\x0a\x03\x02\xca\x08\x00\x02\x00\x0a\x03\x02\xca\x08\x00\x04\x00\x0a\x03\x02\xff"...}, {{len=76, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=5632, pid=496}, "\x02\x20\x80\x00\x06\x00\x00\x00\x08\x00\x01\x00\x0a\x03\x02\xca\x08\x00\x02\x00\x0a\x03\x02\xca\x08\x00\x03\x00\x77\x67\x30\x00"...}], iov_len=240}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_PEEK|MSG_TRUNC) = 240
Jul 03 08:36:41 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{{len=76, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=5632, pid=496}, "\x02\x08\x80\xfe\x01\x00\x00\x00\x08\x00\x01\x00\x7f\x00\x00\x01\x08\x00\x02\x00\x7f\x00\x00\x01\x07\x00\x03\x00\x6c\x6f\x00\x00"...}, {{len=88, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=5632, pid=496}, "\x02\x18\x80\x00\x02\x00\x00\x00\x08\x00\x01\x00\x0a\x03\x02\xca\x08\x00\x02\x00\x0a\x03\x02\xca\x08\x00\x04\x00\x0a\x03\x02\xff"...}, {{len=76, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=5632, pid=496}, "\x02\x20\x80\x00\x06\x00\x00\x00\x08\x00\x01\x00\x0a\x03\x02\xca\x08\x00\x02\x00\x0a\x03\x02\xca\x08\x00\x03\x00\x77\x67\x30\x00"...}], iov_len=240}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 240
Jul 03 08:36:41 ioctl(19, SIOCGIFNAME, {ifr_index=1, ifr_name="lo"}) = 0
Jul 03 08:36:41 ioctl(19, SIOCGIFFLAGS, {ifr_name="lo", ifr_flags=IFF_UP|IFF_LOOPBACK|IFF_RUNNING}) = 0
Jul 03 08:36:41 ioctl(19, SIOCGIFMTU, {ifr_name="lo", ifr_mtu=65536}) = 0
Jul 03 08:36:41 ioctl(19, SIOCGIFNAME, {ifr_index=2, ifr_name="eth0"}) = 0
Jul 03 08:36:41 ioctl(19, SIOCGIFFLAGS, {ifr_name="eth0", ifr_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST}) = 0
Jul 03 08:36:41 ioctl(19, SIOCGIFMTU, {ifr_name="eth0", ifr_mtu=1500}) = 
Jul 03 08:36:41 ioctl(19, SIOCGIFNAME, {ifr_index=6, ifr_name="wg0"}) = 0
Jul 03 08:36:41 ioctl(19, SIOCGIFFLAGS, {ifr_name="wg0", ifr_flags=IFF_UP|IFF_POINTOPOINT|IFF_RUNNING|IFF_NOARP}) = 0
Jul 03 08:36:41 ioctl(19, SIOCGIFMTU, {ifr_name="wg0", ifr_mtu=1420}) = 0
Jul 03 08:36:41 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base={{len=20, type=NLMSG_DONE, flags=NLM_F_MULTI, seq=5632, pid=496}, 0}, iov_len=240}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_PEEK|MSG_TRUNC) = 20
Jul 03 08:36:41 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base={{len=20, type=NLMSG_DONE, flags=NLM_F_MULTI, seq=5632, pid=496}, 0}, iov_len=240}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
Jul 03 08:36:41 close(19)                               = 0
Jul 03 08:36:41 setsockopt(18, SOL_IP, IP_PKTINFO, [1], 4) = 0
Jul 03 08:36:41 getsockopt(18, SOL_IP, IP_PKTOPTIONS, "\30\0\0\0\0\0\0\0\10\0\0\0\6\0\0\0\n\3\2\312\n\3\2\312", [5131->24]) = 0
Jul 03 08:36:41 ioctl(5, SIOCGIFNAME, {ifr_index=6, ifr_name="wg0"}) = 
Jul 03 08:36:41 ioctl(5, SIOCGIFFLAGS, {ifr_name="wg0", ifr_flags=IFF_UP|IFF_POINTOPOINT|IFF_RUNNING|IFF_NOARP}) = 0
Jul 03 08:36:41 shutdown(18, SHUT_RDWR)                 = 
Jul 03 08:36:41 close(18)                               = 0
Jul 03 08:36:41 poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}], 6, -1) = 1 ([{fd=5, revents=POLLIN}])
Jul 03 08:36:41 gettimeofday({tv_sec=1593765401, tv_usec=548630}, NULL) = 0
Jul 03 08:36:41 accept(5, NULL, NULL)                   = 18
Jul 03 08:36:41 getsockname(18, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.3.2.202")}, [28->16]) = 0
Jul 03 08:36:41 socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 19
Jul 03 08:36:41 sendto(3, {{len=20, type=0x16 /* NLMSG_??? */, flags=NLM_F_REQUEST|NLM_F_ACK|0x300, seq=5633, pid=0}, "\x0a\x00\x00\x00"}, 20, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 12) = 20
Jul 03 08:36:41 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{{len=72, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=5633, pid=496}, "\x0a\x80\x80\xfe\x01\x00\x00\x00\x14\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x14\x00\x06\x00"...}, {{len=72, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=5633, pid=496}, "\x0a\x40\x80\x00\x02\x00\x00\x00\x14\x00\x01\x00\x2a\x02\x80\x10\x60\xc5\x00\x00\x00\x10\x00\x03\x00\x02\x02\x02\x14\x00\x06\x00"...}, {{len=72, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=5633, pid=496}, "\x0a\x40\x80\xfd\x02\x00\x00\x00\x14\x00\x01\x00\xfe\x80\x00\x00\x00\x00\x00\x00\x02\x1f\xf3\xff\xfe\xfc\x7f\xed\x14\x00\x06\x00"...}], iov_len=240}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_PEEK|MSG_TRUNC) = 216
Jul 03 08:36:41 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{{len=72, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=5633, pid=496}, "\x0a\x80\x80\xfe\x01\x00\x00\x00\x14\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x14\x00\x06\x00"...}, {{len=72, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=5633, pid=496}, "\x0a\x40\x80\x00\x02\x00\x00\x00\x14\x00\x01\x00\x2a\x02\x80\x10\x60\xc5\x00\x00\x00\x10\x00\x03\x00\x02\x02\x02\x14\x00\x06\x00"...}, {{len=72, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=5633, pid=496}, "\x0a\x40\x80\xfd\x02\x00\x00\x00\x14\x00\x01\x00\xfe\x80\x00\x00\x00\x00\x00\x00\x02\x1f\xf3\xff\xfe\xfc\x7f\xed\x14\x00\x06\x00"...}], iov_len=240}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 216
Jul 03 08:36:41 ioctl(19, SIOCGIFNAME, {ifr_index=1, ifr_name="lo"}) = 0
Jul 03 08:36:41 ioctl(19, SIOCGIFFLAGS, {ifr_name="lo", ifr_flags=IFF_UP|IFF_LOOPBACK|IFF_RUNNING}) = 0
Jul 03 08:36:41 ioctl(19, SIOCGIFMTU, {ifr_name="lo", ifr_mtu=65536}) = 
Jul 03 08:36:41 ioctl(19, SIOCGIFNAME, {ifr_index=2, ifr_name="eth0"}) = 0
Jul 03 08:36:41 ioctl(19, SIOCGIFFLAGS, {ifr_name="eth0", ifr_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST}) = 
Jul 03 08:36:41 ioctl(19, SIOCGIFMTU, {ifr_name="eth0", ifr_mtu=1500}) = 0
Jul 03 08:36:41 ioctl(19, SIOCGIFNAME, {ifr_index=2, ifr_name="eth0"}) = 0
Jul 03 08:36:41 ioctl(19, SIOCGIFFLAGS, {ifr_name="eth0", ifr_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST}) = 0
Jul 03 08:36:41 ioctl(19, SIOCGIFMTU, {ifr_name="eth0", ifr_mtu=1500}) = 0
Jul 03 08:36:41 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base={{len=20, type=NLMSG_DONE, flags=NLM_F_MULTI, seq=5633, pid=496}, 0}, iov_len=240}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_PEEK|MSG_TRUNC) = 20
Jul 03 08:36:41 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base={{len=20, type=NLMSG_DONE, flags=NLM_F_MULTI, seq=5633, pid=496}, 0}, iov_len=240}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
Jul 03 08:36:41 sendto(3, {{len=20, type=0x16 /* NLMSG_??? */, flags=NLM_F_REQUEST|NLM_F_ACK|0x300, seq=5634, pid=0}, "\x02\x00\x00\x00"}, 20, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 12) = 20
Jul 03 08:36:41 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{{len=76, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=5634, pid=496}, "\x02\x08\x80\xfe\x01\x00\x00\x00\x08\x00\x01\x00\x7f\x00\x00\x01\x08\x00\x02\x00\x7f\x00\x00\x01\x07\x00\x03\x00\x6c\x6f\x00\x00"...}, {{len=88, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=5634, pid=496}, "\x02\x18\x80\x00\x02\x00\x00\x00\x08\x00\x01\x00\x0a\x03\x02\xca\x08\x00\x02\x00\x0a\x03\x02\xca\x08\x00\x04\x00\x0a\x03\x02\xff"...}, {{len=76, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=5634, pid=496}, "\x02\x20\x80\x00\x06\x00\x00\x00\x08\x00\x01\x00\x0a\x03\x02\xca\x08\x00\x02\x00\x0a\x03\x02\xca\x08\x00\x03\x00\x77\x67\x30\x00"...}], iov_len=240}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_PEEK|MSG_TRUNC) = 240
Jul 03 08:36:41 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{{len=76, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=5634, pid=496}, "\x02\x08\x80\xfe\x01\x00\x00\x00\x08\x00\x01\x00\x7f\x00\x00\x01\x08\x00\x02\x00\x7f\x00\x00\x01\x07\x00\x03\x00\x6c\x6f\x00\x00"...}, {{len=88, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=5634, pid=496}, "\x02\x18\x80\x00\x02\x00\x00\x00\x08\x00\x01\x00\x0a\x03\x02\xca\x08\x00\x02\x00\x0a\x03\x02\xca\x08\x00\x04\x00\x0a\x03\x02\xff"...}, {{len=76, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=5634, pid=496}, "\x02\x20\x80\x00\x06\x00\x00\x00\x08\x00\x01\x00\x0a\x03\x02\xca\x08\x00\x02\x00\x0a\x03\x02\xca\x08\x00\x03\x00\x77\x67\x30\x00"...}], iov_len=240}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 240
Jul 03 08:36:41 ioctl(19, SIOCGIFNAME, {ifr_index=1, ifr_name="lo"}) = 0
Jul 03 08:36:41 ioctl(19, SIOCGIFFLAGS, {ifr_name="lo", ifr_flags=IFF_UP|IFF_LOOPBACK|IFF_RUNNING}) = 0
Jul 03 08:36:41 ioctl(19, SIOCGIFMTU, {ifr_name="lo", ifr_mtu=65536}) = 0
Jul 03 08:36:41 ioctl(19, SIOCGIFNAME, {ifr_index=2, ifr_name="eth0"}) = 0
Jul 03 08:36:41 ioctl(19, SIOCGIFFLAGS, {ifr_name="eth0", ifr_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST}) = 0
Jul 03 08:36:41 ioctl(19, SIOCGIFMTU, {ifr_name="eth0", ifr_mtu=1500}) = 0
Jul 03 08:36:41 ioctl(19, SIOCGIFNAME, {ifr_index=6, ifr_name="wg0"}) = 0
Jul 03 08:36:41 ioctl(19, SIOCGIFFLAGS, {ifr_name="wg0", ifr_flags=IFF_UP|IFF_POINTOPOINT|IFF_RUNNING|IFF_NOARP}) = 0
Jul 03 08:36:41 ioctl(19, SIOCGIFMTU, {ifr_name="wg0", ifr_mtu=1420}) = 0
Jul 03 08:36:41 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base={{len=20, type=NLMSG_DONE, flags=NLM_F_MULTI, seq=5634, pid=496}, 0}, iov_len=240}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_PEEK|MSG_TRUNC) = 20
Jul 03 08:36:41 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base={{len=20, type=NLMSG_DONE, flags=NLM_F_MULTI, seq=5634, pid=496}, 0}, iov_len=240}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
Jul 03 08:36:41 close(19)                               = 0
Jul 03 08:36:41 setsockopt(18, SOL_IP, IP_PKTINFO, [1], 4) = 0
Jul 03 08:36:41 getsockopt(18, SOL_IP, IP_PKTOPTIONS, "\30\0\0\0\0\0\0\0\10\0\0\0\6\0\0\0\n\3\2\312\n\3\2\312", [5131->24]) = 0
Jul 03 08:36:41 ioctl(5, SIOCGIFNAME, {ifr_index=6, ifr_name="wg0"}) = 0
Jul 03 08:36:41 ioctl(5, SIOCGIFFLAGS, {ifr_name="wg0", ifr_flags=IFF_UP|IFF_POINTOPOINT|IFF_RUNNING|IFF_NOARP}) = 0
Jul 03 08:36:41 shutdown(18, SHUT_RDWR)                 = 0
Jul 03 08:36:41 close(18)                               = 0

@DL6ER
Copy link
Member

DL6ER commented Jul 6, 2020

I'm traveling and don't have WS installed locally, so I haven't had a chance to look at your pcaps, sorry. I think your deduction is correct, even when this does not make too much sense as dnsmasq doesn't allow TCP/UDP-specific configuration so it should also not behave differently for the two protocols. The one exception from this rule is that no further TCP connections are accepted when the maximum number of TCP workers is reached. However, I think that clients then have to wait (connections are not accepted) instead of being rejected by an immediate close.

Can you check if you have multiple (21 ?) FTL processes running?

pidof pihole-FTL | wc -w

If not, then the next guess is that you bound to the wildcard address and the check for an interface with the same address as the local address of the TCP connection failed. When dnsmasq concludes that this is a not allowed interface, it will exit early. Please try

pihole checkout ftl fix/jjnnko

This branch is based on the latest code that will be included in #827 plus a few extra debug lines for you. Check your /var/log/pihole-FTL.log for lines starting in TCP: to hopefully gain more insight into what is going on.

@jinnko
Copy link
Author

jinnko commented Jul 6, 2020

There was only a single pihole-FTL process running (pidof pihole-FTL | wc -w returned 1).

Thanks for adding the debug logging. After switching to the branch I confirmed the PID was changed, then did the request as before. This was the log output in /var/log/pihole-FTL.log

[2020-07-06 16:16:41.820 16777M] TCP: Enumerating interfaces...
[2020-07-06 16:16:41.821 16777M] TCP: ...done
[2020-07-06 16:16:41.821 16777M] TCP: No interface and no loopback exception, closing connection
[2020-07-06 16:16:41.822 16777M] TCP: Checking interface eth0 (10.3.2.202:53) vs 10.3.2.202:53
[2020-07-06 16:16:41.822 16777M] TCP: MATCH
[2020-07-06 16:16:41.869 16777M] TCP: Enumerating interfaces...
[2020-07-06 16:16:41.871 16777M] TCP: ...done
[2020-07-06 16:16:41.873 16777M] TCP: No interface and no loopback exception, closing connection
[2020-07-06 16:16:41.875 16777M] TCP: Checking interface eth0 (10.3.2.202:53) vs 10.3.2.202:53
[2020-07-06 16:16:41.877 16777M] TCP: MATCH

I then wondered whether the fact 10.3.2.202/24 on eth0 and 10.3.2.202/32 on wg0 (note the netmask difference) might have been causing an issue, so I removed the /24 IP from eth0 and restarted pihole-FTL, while still on the branch, then tried again and the request worked, with the following log output:

[2020-07-06 16:23:09.646 17467M] TCP: Enumerating interfaces...
[2020-07-06 16:23:09.649 17467M] TCP: ...done
[2020-07-06 16:23:09.650 17467M] TCP: Checking interface wg0 (10.3.2.202:53) vs 127.0.0.1:53
[2020-07-06 16:23:09.652 17467M] TCP: Checking interface eth0 (10.3.2.203:53) vs 127.0.0.1:53
[2020-07-06 16:23:09.654 17467M] TCP: Checking interface lo (127.0.0.1:53) vs 127.0.0.1:53
[2020-07-06 16:23:09.655 17467M] TCP: MATCH

So pihole-FTL is struggling with seeing the same IP, albeit with different netmasks, on different interfaces. I know this is an acceptable practice and have used it between host systems and their VPN uplinks forever.

What I don't understand in the log output is why it's finding a match against 127.0.0.1:53 when the request came in on wg0 destined for IP 10.3.2.202/32. So I checked again, and this time the logs differed, though nothing had been restarted.

[2020-07-06 16:29:34.689 17467M] TCP: Enumerating interfaces...
[2020-07-06 16:29:34.690 17467M] TCP: ...done
[2020-07-06 16:29:34.691 17467M] TCP: Checking interface wg0 (10.3.2.202:53) vs 10.3.2.202:53
[2020-07-06 16:29:34.691 17467M] TCP: MATCH

I then looked at the commit you made to add the logging and found loopback_exception() where I believe at least part of the issue lies. In the comments there's mention of odd behaviour in the kernel interface reporting, and I suspect that's where things are getting a bit messed up because this is a Wireguard interface. I'm no expert, but I'm sure the packets are coming in tunnelled over eth0, getting unpacked in the kernel, then being emitted decrypted on the wg0 interface. Between this and the overlapping IP (though differing netmask), something is upsetting the dnsmasq logic. I wish I could propose the fix as it looks like it's right there where I'm looking, but I don't see it.

Hope this helps.

@DL6ER
Copy link
Member

DL6ER commented Jul 7, 2020

I then wondered whether the fact 10.3.2.202/24 on eth0 and 10.3.2.202/32 on wg0 (note the netmask difference) might have been causing an issue, so I removed the /24 IP from eth0 and restarted pihole-FTL, while still on the branch, then tried again and the request worked

Wait, what?

So pihole-FTL is struggling with seeing the same IP, albeit with different netmasks, on different interfaces. I know this is an acceptable practice and have used it between host systems and their VPN uplinks forever.

But

10.3.2.202/24 on eth0

contains

10.3.2.202/32 on wg0

Checking back with @dschaper just in case I'm missing something obvious here.

@dschaper
Copy link
Member

dschaper commented Jul 7, 2020

Doesn't look like you're missing anything. The /24 contains the /32 host in it's range. Unless you set the routes up the right way then /24 traffic is going to go to eth0 and then the default gateway. Get the output from ip route when both ranges are in use and see where things go. ip route get will tell you where the TCP/IP stack is going to send traffic.

Put the wireguard interface on a completely different subnet.

@jinnko
Copy link
Author

jinnko commented Jul 7, 2020

It's entirely acceptable to do this, and in fact the TincVPN recommends it for a similar scenario. Using Wireguard like this is no different.

In the simplest form consider your default route being defined via a gateway. Your own IP is within that default route and you have no issues.

Or consider a regular router with multiple NICs. On one side it may have address 10.1.0.1/16 and thus know about and route to your wider 10.0.0.0/16 network. On another NIC it would have the same IP but more specific subnet, 10.1.0.1/24, so only routes traffic to 10.1.0.0/24 via the second NIC, and anything else for 10.0.0.0/16 via the first NIC. Essentially when it needs to make a routing decision, the most specific route in the tables wins, which is why you see the routes sorted by their netmask in a routing table.

In my case, i'm using the /32 IP on the Wireguard interface because Wireguard manages the routing table for it's known hosts, and as the TincVPN docs states - this "will make things a lot easier to remember and set up" - which is important for a mesh VPN created on top of Wireguard.

I appreciate that there's a possible workaround as @dschaper suggested, but I would still classify this as a bug and thus incorrect behaviour. I believe this is apparent by the fact that UDP request are succeeding, which proves that the routing is entirely correct. This is also supported by the pcaps.

I believe the issue may be in the loopback_exception() logic which appears to be forcing the connection to close.

@DL6ER
Copy link
Member

DL6ER commented Jul 19, 2020

Thanks for digging into this. As this is a bug in the embedded dnsmasq, I don't think there is much we can do here. The bug should be resolved upstream and we will pick the incoming bugfix and make it available for our users. If you need it more urgently in FTL (before the next regular release), we can always prepare a special branch for you which contains said fix inside FTL on short notice.

I think this bug ticket already contains enough information which you can use for reporting the bug on the official dnsmasq "bug ticket system" dnsmasq-discuss@lists.thekelleys.org.uk

In case you need any further assistance or want to discuss something in addition, please feel free to re-open this ticket.

@DL6ER DL6ER closed this as completed Jul 19, 2020
@jinnko
Copy link
Author

jinnko commented Jul 20, 2020

Thanks for the help and offer of a branch fix, much appreciated. Happy to raise the issue upstream and wait for the fixes to be pulled in due course.

@jinnko
Copy link
Author

jinnko commented Jul 22, 2020

@DL6ER - if you have a moment would you be able to answer a couple of questions in the upstream discussion?

@DL6ER
Copy link
Member

DL6ER commented Jul 22, 2020

@jinnko I cannot promise to be able to say anything until Monday when I return from business trip. Feel free to reply yourself, you have already looked into the code and draw some conclusion from it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants