-
Notifications
You must be signed in to change notification settings - Fork 17.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net: netgo dns resolver latency / timeout issues #22406
Comments
CC @mdempsky |
To clarify, you mean using tcpdump you see both the A and AAAA queries (or A and second A queries, in the case of your last experiment) sent around the same time, and then not seeing one or both responses for a while (if ever)? If tcpdump shows the queries sent okay, and no responses, I would think something's wrong outside of Go. If you're able to share a tcpdump capture, that would be helpful in confirming this hypothesis. |
Yes, I can see the queries. Here's some captured data while playing with the query types line: 1st Test: Type A and AAAA queries (file TypeAandAAAA.pcap) Example here is the A query for google.com. Only the query from packet 14 got a response: The other one timed out after 5s which causes the whole response for 2nd Test: Double Type A queries (file doubleAType.pcap): Example here is the A query for akamai.com. Only the query from packet 13 got a response: The other one timed out after 5s again. 3rd Test: Only Type A queries (file onlyTypeA.pcap): Left this running for about 10 minutes to prove that there are no issues even over a longer period. 4th Test: Only Type AAAA queries (file onlyTypeAAAA.pcap): Also left this running for about 10 minutes. I agree to your hypothesis but I still wonder why the tests Nr 3 & 4 fail and why I can't reproduce it on a VM with the cgo resolver. |
Was testing a bit more and was glancing through the glibc implementation that is used with the cgo resolver. Looks like glibc does no parallelism and also was a bit slower in our testings. Just to be sure, I was eliminating the parallelism in the netgo approach which didn't change my results. So I went back to the stdlib code and added a bit of delay for the 2nd query by changing this: Lines 481 to 482 in 93322a5
into for i, qtype := range qtypes {
if i > 0 {
time.Sleep(50 * time.Millisecond)
}
go func(qtype uint16, i int) { This didn't solve the problem completely but made the situation far better. I'll also share these findings with the Google Support and will report back once I hear anything back from them. |
Got reply from Google: They were able to reproduce this and opened an internal bug ticket. |
What version of Go are you using (
go version
)?go version go1.9.1 linux/amd64
Does this issue reproduce with the latest release?
yes
What operating system and processor architecture are you using (
go env
)?GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/go"
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build620621764=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
What did you do?
We're experiencing latency issues with DNS queries in Google Cloud GKE and GCE.
I'm not sure if this is a GCE or Go issue, but I can only reproduce it on GKE and GCE when using a Go program that uses the netgo dns resolver.
About 20% of those DNS calls take 5s or longer.
Here's a program that we've used to reproduce the problem reliably:
What did you expect to see?
All DNS responses at least 200ms or less.
What did you see instead?
20% take longer than 5s.
I was able to trace this down a bit by adding some time measurements and print statements to the stdlib.
I could verify that the long block happens here until this action finally times out:
go/src/net/dnsclient_unix.go
Line 57 in 93322a5
I verified with tcpdump that requests were sent but answers not received which points more to the direction of GCE.
But since a lot of those unanswered requests were
AAAA
I changed this line intowhich solved all latency and timeout issues.
I was assuming some ipv6 dns issues and to countercheck I changed this line again into
I expected to have no issues and just do two
A
queries instead ofA
andAAAA
.But interestingly the latency problems were back and that's why I'm reporting it here, too.
Since those two queries are made in parallel, could there be some race/lock down the stack when doing the udp send and receives?
The text was updated successfully, but these errors were encountered: