Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: netgo dns resolver latency / timeout issues #22406

Closed
gottwald opened this issue Oct 23, 2017 · 5 comments

Comments

Projects
None yet
4 participants
@gottwald
Copy link

commented Oct 23, 2017

What version of Go are you using (go version)?

go version go1.9.1 linux/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/go"
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build620621764=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

What did you do?

We're experiencing latency issues with DNS queries in Google Cloud GKE and GCE.
I'm not sure if this is a GCE or Go issue, but I can only reproduce it on GKE and GCE when using a Go program that uses the netgo dns resolver.
About 20% of those DNS calls take 5s or longer.

Here's a program that we've used to reproduce the problem reliably:

package main

import (
	"context"
	"fmt"
	"net"
	"os"
	"os/signal"
	"sync"
	"time"
)

var hosts = []string{
	"google.de",
	"google.com",
	"facebook.com",
	"twitter.com",
	"github.com",
	"microsoft.com",
	"akamai.com",
	"apple.com",
	"reddit.com",
	"wired.com",
	"mozilla.org",
	"kernel.org",
	"coredns.io",
	"kubernetes.io",
	"docker.com",
	"cockroachlabs.com",
	"news.ycombinator.com",
	"holidaycheck.com",
	"holidaycheck.de",
	"holidaycheck.at",
	"holidaycheck.ch",
	"iocrunch.com",
}

const dnsCallTimeout = 10 * time.Second

type resolveStat struct {
	d        time.Duration
	timedOut bool
}

func probeLoop(ctx context.Context, resultsChan chan resolveStat) {
	var wg sync.WaitGroup
	for _, host := range hosts {
		wg.Add(1)
		go func(host string) {
			defer wg.Done()
			stat := resolveStat{}
			ctx, cancel := context.WithTimeout(ctx, dnsCallTimeout)
			defer cancel()
			start := time.Now()
			var err error
			_, err = net.DefaultResolver.LookupHost(ctx, host)
			stat.d = time.Since(start)
			if err == context.Canceled {
				stat.timedOut = true
			}
			resultsChan <- stat
		}(host)
	}
	wg.Wait()
}

func handleStats(ctx context.Context, rstat chan resolveStat) {
	var dnsCalls int
	var timeouts int
	var min, max time.Duration
	var below50ms, below500ms, below1s, above1s, above5s int
	t := time.NewTicker(30 * time.Second)
	defer t.Stop()

	for {
		select {
		case stat := <-rstat:
			dnsCalls++
			if stat.timedOut {
				timeouts++
				continue
			}
			if stat.d > max {
				max = stat.d
			}
			if stat.d < min {
				min = stat.d
			}
			switch {
			case stat.d < 50*time.Millisecond:
				below50ms++
			case stat.d < 500*time.Millisecond:
				below500ms++
			case stat.d < 1*time.Second:
				below1s++
			case stat.d >= 5*time.Second:
				above5s++
			case stat.d >= 1*time.Second:
				above1s++
			}
		case <-t.C:
			fmt.Printf("REPORT at %s\n", time.Now())
			fmt.Printf("REPORT: calls: '%d'; timeouts (at %s): '%d'\n", dnsCalls, dnsCallTimeout, timeouts)
			fmt.Printf("REPORT: below 50ms: '%d'; below 500ms: '%d', below 1s: '%d'; above 1s: '%d'; above 5s: '%d'\n",
				below50ms, below500ms, below1s, above1s, above5s)
			fmt.Printf("REPORT: latency min: '%s'; latency max: '%s'\n", min, max)
		case <-ctx.Done():
			return
		}
	}
}

func main() {
	sigChan := make(chan os.Signal)
	signal.Notify(sigChan, os.Interrupt)
	ctx, cancel := context.WithCancel(context.Background())

	resultsChan := make(chan resolveStat)
	go handleStats(ctx, resultsChan)

	t := time.NewTicker(10 * time.Second)
	defer t.Stop()

	for {
		select {
		case <-t.C:
			probeLoop(ctx, resultsChan)
		case <-sigChan:
			cancel()
			return
		}
	}
}

What did you expect to see?

All DNS responses at least 200ms or less.

What did you see instead?

20% take longer than 5s.

I was able to trace this down a bit by adding some time measurements and print statements to the stdlib.
I could verify that the long block happens here until this action finally times out:

n, err := c.Read(b)

I verified with tcpdump that requests were sent but answers not received which points more to the direction of GCE.
But since a lot of those unanswered requests were AAAA I changed this line into

qtypes := [...]uint16{dnsTypeA}

which solved all latency and timeout issues.

I was assuming some ipv6 dns issues and to countercheck I changed this line again into

qtypes := [...]uint16{dnsTypeA, dnsTypeA}

I expected to have no issues and just do two A queries instead of A and AAAA.
But interestingly the latency problems were back and that's why I'm reporting it here, too.
Since those two queries are made in parallel, could there be some race/lock down the stack when doing the udp send and receives?

@ianlancetaylor ianlancetaylor changed the title netgo dns resolver latency / timeout issues net: netgo dns resolver latency / timeout issues Oct 25, 2017

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Oct 25, 2017

@ianlancetaylor ianlancetaylor added this to the Go1.10 milestone Oct 25, 2017

@mdempsky

This comment has been minimized.

Copy link
Member

commented Oct 25, 2017

I verified with tcpdump that requests were sent but answers not received which points more to the direction of GCE.

To clarify, you mean using tcpdump you see both the A and AAAA queries (or A and second A queries, in the case of your last experiment) sent around the same time, and then not seeing one or both responses for a while (if ever)?

If tcpdump shows the queries sent okay, and no responses, I would think something's wrong outside of Go. If you're able to share a tcpdump capture, that would be helpful in confirming this hypothesis.

@gottwald

This comment has been minimized.

Copy link
Author

commented Oct 25, 2017

Yes, I can see the queries.

Here's some captured data while playing with the query types line:
tcpdumps.tar.gz
(sorry had to tar gzip because GH refused pcap files)

1st Test: Type A and AAAA queries (file TypeAandAAAA.pcap)
(The default netgo way)

Example here is the A query for google.com.
2 queries were sent:
Packet Nr. 8 Standard query 0x78ed AAAA google.com
Packet Nr. 14 Standard query 0x460d A google.com

Only the query from packet 14 got a response:
Packet Nr. 83 Standard query response 0x460d A google.com A 64.233.166.100 A 64.233.166.101 A 64.233.166.113 A 64.233.166.139 A 64.233.166.102 A 64.233.166.138

The other one timed out after 5s which causes the whole response for net.DefaultResolver.LookupHost(ctx, host) to take at least 5s.

2nd Test: Double Type A queries (file doubleAType.pcap):

Example here is the A query for akamai.com.
2 queries were sent:
Packet Nr. 10 Standard query 0xea06 A akamai.com
Packet Nr. 13 Standard query 0x0ea7 A akamai.com

Only the query from packet 13 got a response:
Packet Nr. 73 Standard query response 0x0ea7 A akamai.com A 23.64.19.38

The other one timed out after 5s again.

3rd Test: Only Type A queries (file onlyTypeA.pcap):

Left this running for about 10 minutes to prove that there are no issues even over a longer period.
All queries were answered. Slowest one was about 91ms. No timeouts.

4th Test: Only Type AAAA queries (file onlyTypeAAAA.pcap):

Also left this running for about 10 minutes.
All queries were answered. Slowest one was about 92ms. No timeouts.

I agree to your hypothesis but I still wonder why the tests Nr 3 & 4 fail and why I can't reproduce it on a VM with the cgo resolver.
btw: also opened a Google Case for this: Case 14000650

@gottwald

This comment has been minimized.

Copy link
Author

commented Oct 25, 2017

Was testing a bit more and was glancing through the glibc implementation that is used with the cgo resolver. Looks like glibc does no parallelism and also was a bit slower in our testings.

Just to be sure, I was eliminating the parallelism in the netgo approach which didn't change my results.

So I went back to the stdlib code and added a bit of delay for the 2nd query by changing this:

for _, qtype := range qtypes {
go func(qtype uint16) {

into

for i, qtype := range qtypes {                                                                    
    if i > 0 {                                                                                    
        time.Sleep(50 * time.Millisecond)                                                         
    }                                                                                             
    go func(qtype uint16, i int) {   

This didn't solve the problem completely but made the situation far better.
Before we had problems with about 20% of the requests, now we're at 0.25%.

I'll also share these findings with the Google Support and will report back once I hear anything back from them.

@gottwald

This comment has been minimized.

Copy link
Author

commented Oct 27, 2017

Got reply from Google: They were able to reproduce this and opened an internal bug ticket.
Closing this here now.

@gottwald gottwald closed this Oct 27, 2017

@golang golang locked and limited conversation to collaborators Oct 27, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.