Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: go DNS resolver fails to connect to local DNS server #67925

Closed
danvolchek opened this issue Jun 10, 2024 · 26 comments
Closed

net: go DNS resolver fails to connect to local DNS server #67925

danvolchek opened this issue Jun 10, 2024 · 26 comments
Labels
FixPending Issues that have a fix which has not yet been reviewed or submitted. NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@danvolchek
Copy link

Go version

go version go1.22.4 linux/arm64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='arm64'
GOBIN=''
GOCACHE='/home/dan/.cache/go-build'
GOENV='/home/dan/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/dan/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/dan/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/home/dan/sdk/go1.22.4'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/home/dan/sdk/go1.22.4/pkg/tool/linux_arm64'
GOVCS=''
GOVERSION='go1.22.4'
GCCGO='gccgo'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build3858791460=/tmp/go-build -gno-record-gcc-switches'

What did you do?

Using a local nameserver (192.168.0.1, provided by my router) and the go DNS resolver, call net.LookupHost.

/etc/resolv.conf:

# Generated by NetworkManager
search Home
nameserver 192.168.0.1
nameserver 205.171.3.25
nameserver 2001:428::1
# NOTE: the libc resolver may not support more than 3 nameservers.
# The nameservers listed below may not be recognized.
nameserver 2001:428::2

main.go:

package main

import (
	"fmt"
	"net"
)

func main() {
	addrs, err := net.LookupHost("ghcr.io")
	fmt.Println(addrs, err)
}

What did you see happen?

LookupHost alternates between failing and then succeeding, over and over:

$ GODEBUG=netdns=2 go run main.go
go package net: confVal.netCgo = false  netGo = false
go package net: dynamic selection of DNS resolver
go package net: hostLookupOrder(ghcr.io) = files,dns
[] lookup ghcr.io on 192.168.0.1:53: no such host

$ GODEBUG=netdns=2 go run main.go
go package net: confVal.netCgo = false  netGo = false
go package net: dynamic selection of DNS resolver
go package net: hostLookupOrder(ghcr.io) = files,dns
[140.82.116.34] <nil>

<pattern repeats, failure then success then failure then success, etc>

The pattern is always like this - it never fails or succeeds twice in a row. It happens without setting GODEBUG as well.

What did you expect to see?

I expect LookupHost to always succeed.

More context:

  • Narrowing down the root cause:
    • DNS lookups from other tools on the machine (host, dig) never fail when connecting to 192.168.0.1 - so I don't think it's a connectivity issue.
    • Using the native/cgo resolver always succeeds - so I don't think it's an OS issue:
      $ GODEBUG=netdns=cgo+2 go run main.go
      go package net: confVal.netCgo = true  netGo = false
      go package net: using cgo DNS resolver
      go package net: hostLookupOrder(ghcr.io) = cgo
      [140.82.116.33] <nil>
      
    • Commenting out nameserver 192.168.0.1 in my /etc/resolv.conf makes the go DNS resolver always succeed - so I think it is specific to that server.
    • The device providing the local DNS server is CenturyLink Zyxel C3000Z in case it matters.
  • Reproducibility:
    • I've reproduced this in go1.19.8 linux/arm64 and go1.21.11 linux/arm64 (both on the same machine as above), and go1.22.4 linux/amd64 (on a different machine in my network).
    • I cannot reproduce this in go1.18.7 linux/arm (on an older machine, RPI 3B+ with Raspbian 10) - the go resolver always succeeds.
  • Impact:

Let me know if any more info is needed/what I can do to debug further. Thanks!

@ianlancetaylor
Copy link
Contributor

If you edit /etc/resolv.conf so that 192.168.0.1 is the only nameserver, does your program consistently fail?

Does it have anything to do with "gchr.io", or does it fail for any host name?

We may need to see a tcpdump showing the packets for a failed DNS lookup.

@ianlancetaylor ianlancetaylor added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Jun 10, 2024
@danvolchek
Copy link
Author

If you edit /etc/resolv.conf so that 192.168.0.1 is the only nameserver, does your program consistently fail?

No, it still alternates between success and failure.

Does it have anything to do with "gchr.io", or does it fail for any host name?

Testing out:

  • google.com: Always succeeds, first invocation returns both an ipv6 address and an ipv4 address, subsequent invocations only return the ipv6 address.
  • codeberg.org: Same as google.com.
  • github.com: Same as ghcr.io.
  • mastodon.com: Same as ghcr.io, but returns three ipv4 addresses.
  • ipv6.vm3.test-ipv6.com: Always succeeds, returns one ipv6 address.

It may have something to do with ghcr.io - but I'm not sure what. Maybe ipv6 vs ipv4?

We may need to see a tcpdump showing the packets for a failed DNS lookup.

I'm happy to provide one if you provide commands/steps on how to get one.

@ianlancetaylor
Copy link
Contributor

Should work to run sudo tcpdump port 53 and then run your program in a different terminal. Ideally the system should be as quiescent as possible to just focus on the DNS queries sent by your program.

@danvolchek
Copy link
Author

danvolchek commented Jun 11, 2024

Thanks. When using tcpdump, the pattern changes slightly to success, success, failure, <repeats alternating success/failure>. The output from the first three runs (line breaks mine):

$ sudo tcpdump port 53
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on wlan0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
17:47:25.612532 IP 192.168.0.51.58864 > modem.Home.domain: 63256+ [1au] AAAA? ghcr.io. (36)
17:47:25.612551 IP 192.168.0.51.46510 > modem.Home.domain: 56146+ [1au] A? ghcr.io. (36)
17:47:25.619337 IP 192.168.0.51.42697 > modem.Home.domain: 36976+ PTR? 51.0.168.192.in-addr.arpa. (43)
17:47:25.620131 IP modem.Home.domain > 192.168.0.51.58864: 63256 0/1/1 (110)
17:47:25.620276 IP modem.Home.domain > 192.168.0.51.46510: 56146 1/0/1 A 140.82.116.34 (52)
17:47:25.626717 IP modem.Home.domain > 192.168.0.51.42697: 36976 NXDomain* 0/1/0 (138)
17:47:25.626837 IP 192.168.0.51.42907 > modem.Home.domain: 13907+ PTR? 1.0.168.192.in-addr.arpa. (42)
17:47:25.628333 IP modem.Home.domain > 192.168.0.51.42907: 13907- 1/0/0 PTR modem.Home. (66)

17:47:35.207135 IP 192.168.0.51.51905 > modem.Home.domain: 18081+ [1au] AAAA? ghcr.io. (36)
17:47:35.207271 IP 192.168.0.51.53423 > modem.Home.domain: 52191+ [1au] A? ghcr.io. (36)
17:47:35.215818 IP modem.Home.domain > 192.168.0.51.51905: 18081 0/1/1 (110)
17:47:35.216178 IP modem.Home.domain > 192.168.0.51.53423: 52191 1/0/1 A 140.82.116.34 (52)

17:47:42.132049 IP 192.168.0.51.49507 > modem.Home.domain: 19488+ [1au] AAAA? ghcr.io. (36)
17:47:42.132069 IP 192.168.0.51.47441 > modem.Home.domain: 49353+ [1au] A? ghcr.io. (36)
17:47:42.136263 IP modem.Home.domain > 192.168.0.51.47441: 49353- 1/0/1 OPT UDPsize=1232 (52)
17:47:42.138403 IP modem.Home.domain > 192.168.0.51.49507: 19488 0/1/1 (110)
17:47:42.138506 IP 192.168.0.51.45989 > modem.Home.domain: 35307+ [1au] A? ghcr.io.Home. (41)
17:47:42.138514 IP 192.168.0.51.45095 > modem.Home.domain: 46052+ [1au] AAAA? ghcr.io.Home. (41)
17:47:42.144408 IP modem.Home.domain > 192.168.0.51.45989: 35307 NXDomain 0/1/1 (116)
17:47:42.144520 IP modem.Home.domain > 192.168.0.51.45095: 46052 NXDomain 0/1/1 (116)

^C
20 packets captured
20 packets received by filter
0 packets dropped by kernel

@ianlancetaylor
Copy link
Contributor

Thanks. The presence of ghcr.io.Home. in the last group suggests that this has something to do with the search Home command in your resolve.conf file. The search list will be applied to any domain with fewer than 1 dot, which would be why ipv6.vm3.test-ipv6.com always succeeds.

Thanks. The success case replies to the A address query with

17:47:25.620276 IP modem.Home.domain > 192.168.0.51.46510: 56146 1/0/1 A 140.82.116.34 (52)

The single failure case replies with

17:47:42.136263 IP modem.Home.domain > 192.168.0.51.47441: 49353- 1/0/1 OPT UDPsize=1232 (52)

I don't know why these would be different. I don't know why the second case is only listing the additional record and is not listing the actual answer. This seems like a problem with your DNS server, but I am not an expert.

It might be interesting to try temporarily reverting https://go.dev/cl/386016 to see if that makes any difference. I don't know why it would, but I also don't know what is going on here.

@ianlancetaylor ianlancetaylor removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Jun 11, 2024
@danvolchek
Copy link
Author

danvolchek commented Jun 11, 2024

It might be interesting to try temporarily reverting https://go.dev/cl/386016 to see if that makes any difference. I don't know why it would, but I also don't know what is going on here.

Reverting that change does make it succeed! I don't get any failures for any of the previously failing domains.

@ruyi789
Copy link

ruyi789 commented Jun 11, 2024

I'm running this with no problem, you can try

		data := []byte{0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 4, 103, 104, 99, 114, 2, 105, 111, 0, 0, 1, 0, 1}
		b := make([]byte, 1024)
		c, err := net.Dial("udp", "8.8.8.8:53")
		fmt.Println(err)
		c.Write(data)
		n, err := c.Read(b)
		fmt.Println(n, err, b[:n])

@mateusz834
Copy link
Member

@danvolchek Can you run dig @modem.Home.domain ghcr.io +qr multiple times and see what happens?

@ianlancetaylor
Copy link
Contributor

@gopherbot Please open backport issues.

The Go DNS resolver doesn't work with at least one DNS server that apparently does not handle EDNS0 additional headers correctly. The fix is to add a new GODEBUG setting. Requesting a backport to add the GODEBUG setting to earlier releases.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/591995 mentions this issue: net: add GODEBUG=netedns0=0 to disable sending EDNS0 header

@gopherbot
Copy link
Contributor

Backport issue(s) opened: #67933 (for 1.21), #67934 (for 1.22).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@ianlancetaylor
Copy link
Contributor

I'm sending a change that adds a new GODEBUG=netedns0=0 setting that can be used to disable sending EDNS0 headers. That can be used to avoid this problem.

@mateusz834
Copy link
Member

mateusz834 commented Jun 11, 2024

@ianlancetaylor Maybe we should support the edns0 option instead, this is probably the reason why the cgo resolver works. #13279 (comment)

@ianlancetaylor
Copy link
Contributor

We do recognize and ignore the edns0 option in resolv.conf. The docs say that using that option enables EDNS0. Is there a way to use the option to disable EDNS0?

@mateusz834
Copy link
Member

mateusz834 commented Jun 11, 2024

@ianlancetaylor the cgo resolver (glibc) does not send the EDNS0 header when edns0 option is not present.

@mateusz834
Copy link
Member

mateusz834 commented Jun 11, 2024

If we go with the GODEBUG approach then to disable EDNS0 properly you need to make sure that edns0 is removed from the resolv.conf and also that the GODEBUG is set for the go resolver.
We do not always need to advertise support for EDNS0. The primary reason for introducing EDNS0 in issue #51127 was that WSL2 was sending packets larger than 512 bytes, ignoring the DNS-defined limits. However, if we remove the EDNS0 header, WSL2 will still function properly because we will accept packets up to ~1200 bytes, regardless of the advertised EDNS0, so i believe it is fine to make EDNS0 opt-in with the edns0 option.

@ianlancetaylor
Copy link
Contributor

My feeling is that EDNS0 is always better. We want to use it by default. And in fact this issue is the first problem reported with it.

The GODEBUG setting only needs to affect the Go resolver. Yes, the very few people who have this problem will need to make sure that they don't add edns0 to their resolv.conf. I think that is OK.

@danvolchek
Copy link
Author

@mateusz834

Can you run dig @modem.Home.domain ghcr.io +qr multiple times and see what happens?

$ dig @modem.Home.domain ghcr.io +qr
dig: couldn't get address for 'modem.Home.domain': not found

$ dig @modem.Home.domain ghcr.io +qr
dig: couldn't get address for 'modem.Home.domain': not found

<repeats>

@mateusz834
Copy link
Member

mateusz834 commented Jun 11, 2024

Oh, sorry can you replace the modem.Home.domain with the IP address of the broken DNS (192.168.0.1).

dig @192.168.0.1 ghcr.io +qr

@danvolchek
Copy link
Author

@mateusz834

Maybe we should support the edns0 option instead, this is probably the reason why the cgo resolver works. #13279 (comment)

I can confirm that when I add options edns0 to my /etc/resolv.conf, the cgo resolver fails in a similar fashion to the native go resolver.

@danvolchek
Copy link
Author

danvolchek commented Jun 11, 2024

Oh, sorry can you replace the modem.Home.domain with the IP address of the broken DNS (192.168.0.1).

dig @192.168.0.1 ghcr.io +qr

I see two different outputs:

$ dig @192.168.0.1 ghcr.io +qr
; <<>> DiG 9.18.24-1-Debian <<>> @192.168.0.1 ghcr.io +qr
; (1 server found)
;; global options: +cmd
;; Sending:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 25076
;; flags: rd ad; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 666dcdcf8d70e2ae
;; QUESTION SECTION:
;ghcr.io.			IN	A

;; QUERY SIZE: 48

;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 25076
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 666dcdcf8d70e2ae010000006668b07f82295db2bc36c7b1 (good)
;; QUESTION SECTION:
;ghcr.io.			IN	A

;; ANSWER SECTION:
ghcr.io.		22	IN	A	140.82.116.33

;; Query time: 4 msec
;; SERVER: 192.168.0.1#53(192.168.0.1) (UDP)
;; WHEN: Tue Jun 11 13:15:59 PDT 2024
;; MSG SIZE  rcvd: 80

and

$ dig @192.168.0.1 ghcr.io +qr
; <<>> DiG 9.18.24-1-Debian <<>> @192.168.0.1 ghcr.io +qr
; (1 server found)
;; global options: +cmd
;; Sending:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 49437
;; flags: rd ad; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: ab708726596b5ab4
;; QUESTION SECTION:
;ghcr.io.			IN	A

;; QUERY SIZE: 48

;; Warning: Message parser reports malformed message packet.
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 49437
;; flags: qr rd ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;ghcr.io.			IN	A

;; ANSWER SECTION:
.			0	CLASS1232 OPT	10 8 q3CHJllrWrQ=

;; ADDITIONAL SECTION:
ghcr.io.		20	IN	A	140.82.116.33

;; Query time: 0 msec
;; SERVER: 192.168.0.1#53(192.168.0.1) (UDP)
;; WHEN: Tue Jun 11 13:16:00 PDT 2024
;; MSG SIZE  rcvd: 64

@danvolchek
Copy link
Author

More context on my modem: it's running the latest firmware version, which is from 2020 as far as I can tell.

@ruyi789
Copy link

ruyi789 commented Jun 12, 2024

Obviously his DNS servers are contaminated. You can't help him.

@prattmic prattmic added the NeedsFix The path to resolution is known, but the work has not been done. label Jun 12, 2024
@prattmic prattmic added this to the Backlog milestone Jun 12, 2024
@dmitshur dmitshur modified the milestones: Backlog, Go1.23 Jun 12, 2024
@dmitshur dmitshur added the FixPending Issues that have a fix which has not yet been reviewed or submitted. label Jun 12, 2024
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/592235 mentions this issue: [release-branch.go1.21] net: add GODEBUG=netedns0=0 to disable sending EDNS0 header

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/592217 mentions this issue: [release-branch.go1.22] net: add GODEBUG=netedns0=0 to disable sending EDNS0 header

gopherbot pushed a commit that referenced this issue Jun 12, 2024
…g EDNS0 header

It reportedly breaks the DNS server on some modems.

For #6464
For #21160
For #44135
For #51127
For #51153
For #67925
Fixes #67933

Change-Id: I54a11906159f00246d08a54cc8be7327e9ebfd2c
Reviewed-on: https://go-review.googlesource.com/c/go/+/591995
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
Reviewed-by: Damien Neil <dneil@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
(cherry picked from commit ee4a42b)
Reviewed-on: https://go-review.googlesource.com/c/go/+/592235
Commit-Queue: Ian Lance Taylor <iant@google.com>
gopherbot pushed a commit that referenced this issue Jun 12, 2024
…g EDNS0 header

It reportedly breaks the DNS server on some modems.

For #6464
For #21160
For #44135
For #51127
For #51153
For #67925
Fixes #67934

Change-Id: I54a11906159f00246d08a54cc8be7327e9ebfd2c
Reviewed-on: https://go-review.googlesource.com/c/go/+/591995
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
Reviewed-by: Damien Neil <dneil@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
(cherry picked from commit ee4a42b)
Reviewed-on: https://go-review.googlesource.com/c/go/+/592217
TryBot-Bypass: Ian Lance Taylor <iant@golang.org>
Commit-Queue: Ian Lance Taylor <iant@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FixPending Issues that have a fix which has not yet been reviewed or submitted. NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

8 participants