net: netgo dns resolver latency / timeout issues #22406

gottwald · 2017-10-23T19:52:25Z

What version of Go are you using (`go version`)?

go version go1.9.1 linux/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (`go env`)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/go"
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build620621764=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

What did you do?

We're experiencing latency issues with DNS queries in Google Cloud GKE and GCE.
I'm not sure if this is a GCE or Go issue, but I can only reproduce it on GKE and GCE when using a Go program that uses the netgo dns resolver.
About 20% of those DNS calls take 5s or longer.

Here's a program that we've used to reproduce the problem reliably:

package main

import (
	"context"
	"fmt"
	"net"
	"os"
	"os/signal"
	"sync"
	"time"
)

var hosts = []string{
	"google.de",
	"google.com",
	"facebook.com",
	"twitter.com",
	"github.com",
	"microsoft.com",
	"akamai.com",
	"apple.com",
	"reddit.com",
	"wired.com",
	"mozilla.org",
	"kernel.org",
	"coredns.io",
	"kubernetes.io",
	"docker.com",
	"cockroachlabs.com",
	"news.ycombinator.com",
	"holidaycheck.com",
	"holidaycheck.de",
	"holidaycheck.at",
	"holidaycheck.ch",
	"iocrunch.com",
}

const dnsCallTimeout = 10 * time.Second

type resolveStat struct {
	d        time.Duration
	timedOut bool
}

func probeLoop(ctx context.Context, resultsChan chan resolveStat) {
	var wg sync.WaitGroup
	for _, host := range hosts {
		wg.Add(1)
		go func(host string) {
			defer wg.Done()
			stat := resolveStat{}
			ctx, cancel := context.WithTimeout(ctx, dnsCallTimeout)
			defer cancel()
			start := time.Now()
			var err error
			_, err = net.DefaultResolver.LookupHost(ctx, host)
			stat.d = time.Since(start)
			if err == context.Canceled {
				stat.timedOut = true
			}
			resultsChan <- stat
		}(host)
	}
	wg.Wait()
}

func handleStats(ctx context.Context, rstat chan resolveStat) {
	var dnsCalls int
	var timeouts int
	var min, max time.Duration
	var below50ms, below500ms, below1s, above1s, above5s int
	t := time.NewTicker(30 * time.Second)
	defer t.Stop()

	for {
		select {
		case stat := <-rstat:
			dnsCalls++
			if stat.timedOut {
				timeouts++
				continue
			}
			if stat.d > max {
				max = stat.d
			}
			if stat.d < min {
				min = stat.d
			}
			switch {
			case stat.d < 50*time.Millisecond:
				below50ms++
			case stat.d < 500*time.Millisecond:
				below500ms++
			case stat.d < 1*time.Second:
				below1s++
			case stat.d >= 5*time.Second:
				above5s++
			case stat.d >= 1*time.Second:
				above1s++
			}
		case <-t.C:
			fmt.Printf("REPORT at %s\n", time.Now())
			fmt.Printf("REPORT: calls: '%d'; timeouts (at %s): '%d'\n", dnsCalls, dnsCallTimeout, timeouts)
			fmt.Printf("REPORT: below 50ms: '%d'; below 500ms: '%d', below 1s: '%d'; above 1s: '%d'; above 5s: '%d'\n",
				below50ms, below500ms, below1s, above1s, above5s)
			fmt.Printf("REPORT: latency min: '%s'; latency max: '%s'\n", min, max)
		case <-ctx.Done():
			return
		}
	}
}

func main() {
	sigChan := make(chan os.Signal)
	signal.Notify(sigChan, os.Interrupt)
	ctx, cancel := context.WithCancel(context.Background())

	resultsChan := make(chan resolveStat)
	go handleStats(ctx, resultsChan)

	t := time.NewTicker(10 * time.Second)
	defer t.Stop()

	for {
		select {
		case <-t.C:
			probeLoop(ctx, resultsChan)
		case <-sigChan:
			cancel()
			return
		}
	}
}

What did you expect to see?

All DNS responses at least 200ms or less.

What did you see instead?

20% take longer than 5s.

I was able to trace this down a bit by adding some time measurements and print statements to the stdlib.
I could verify that the long block happens here until this action finally times out:

go/src/net/dnsclient_unix.go

Line 57 in 93322a5

n, err := c.Read(b)

I verified with tcpdump that requests were sent but answers not received which points more to the direction of GCE.
But since a lot of those unanswered requests were AAAA I changed this line into

qtypes := [...]uint16{dnsTypeA}

which solved all latency and timeout issues.

I was assuming some ipv6 dns issues and to countercheck I changed this line again into

qtypes := [...]uint16{dnsTypeA, dnsTypeA}

I expected to have no issues and just do two A queries instead of A and AAAA.
But interestingly the latency problems were back and that's why I'm reporting it here, too.
Since those two queries are made in parallel, could there be some race/lock down the stack when doing the udp send and receives?

The text was updated successfully, but these errors were encountered:

ianlancetaylor · 2017-10-25T00:50:20Z

CC @mdempsky

mdempsky · 2017-10-25T09:14:51Z

I verified with tcpdump that requests were sent but answers not received which points more to the direction of GCE.

To clarify, you mean using tcpdump you see both the A and AAAA queries (or A and second A queries, in the case of your last experiment) sent around the same time, and then not seeing one or both responses for a while (if ever)?

If tcpdump shows the queries sent okay, and no responses, I would think something's wrong outside of Go. If you're able to share a tcpdump capture, that would be helpful in confirming this hypothesis.

gottwald · 2017-10-25T11:44:34Z

Yes, I can see the queries.

Here's some captured data while playing with the query types line:
tcpdumps.tar.gz
(sorry had to tar gzip because GH refused pcap files)

1st Test: Type A and AAAA queries (file TypeAandAAAA.pcap)
(The default netgo way)

Example here is the A query for google.com.
2 queries were sent:
Packet Nr. 8 Standard query 0x78ed AAAA google.com
Packet Nr. 14 Standard query 0x460d A google.com

Only the query from packet 14 got a response:
Packet Nr. 83 Standard query response 0x460d A google.com A 64.233.166.100 A 64.233.166.101 A 64.233.166.113 A 64.233.166.139 A 64.233.166.102 A 64.233.166.138

The other one timed out after 5s which causes the whole response for net.DefaultResolver.LookupHost(ctx, host) to take at least 5s.

2nd Test: Double Type A queries (file doubleAType.pcap):

Example here is the A query for akamai.com.
2 queries were sent:
Packet Nr. 10 Standard query 0xea06 A akamai.com
Packet Nr. 13 Standard query 0x0ea7 A akamai.com

Only the query from packet 13 got a response:
Packet Nr. 73 Standard query response 0x0ea7 A akamai.com A 23.64.19.38

The other one timed out after 5s again.

3rd Test: Only Type A queries (file onlyTypeA.pcap):

Left this running for about 10 minutes to prove that there are no issues even over a longer period.
All queries were answered. Slowest one was about 91ms. No timeouts.

4th Test: Only Type AAAA queries (file onlyTypeAAAA.pcap):

Also left this running for about 10 minutes.
All queries were answered. Slowest one was about 92ms. No timeouts.

I agree to your hypothesis but I still wonder why the tests Nr 3 & 4 fail and why I can't reproduce it on a VM with the cgo resolver.
btw: also opened a Google Case for this: Case 14000650

gottwald · 2017-10-25T15:12:32Z

Was testing a bit more and was glancing through the glibc implementation that is used with the cgo resolver. Looks like glibc does no parallelism and also was a bit slower in our testings.

Just to be sure, I was eliminating the parallelism in the netgo approach which didn't change my results.

So I went back to the stdlib code and added a bit of delay for the 2nd query by changing this:

go/src/net/dnsclient_unix.go

Lines 481 to 482 in 93322a5

    
           for _, qtype := range qtypes { 
        
           	go func(qtype uint16) {

into

for i, qtype := range qtypes {                                                                    
    if i > 0 {                                                                                    
        time.Sleep(50 * time.Millisecond)                                                         
    }                                                                                             
    go func(qtype uint16, i int) {

This didn't solve the problem completely but made the situation far better.
Before we had problems with about 20% of the requests, now we're at 0.25%.

I'll also share these findings with the Google Support and will report back once I hear anything back from them.

gottwald · 2017-10-27T15:34:31Z

Got reply from Google: They were able to reproduce this and opened an internal bug ticket.
Closing this here now.

ianlancetaylor changed the title ~~netgo dns resolver latency / timeout issues~~ net: netgo dns resolver latency / timeout issues Oct 25, 2017

ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 25, 2017

ianlancetaylor added this to the Go1.10 milestone Oct 25, 2017

gottwald closed this as completed Oct 27, 2017

golang locked and limited conversation to collaborators Oct 27, 2018

gopherbot added the FrozenDueToAge label Oct 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

net: netgo dns resolver latency / timeout issues #22406

net: netgo dns resolver latency / timeout issues #22406

gottwald commented Oct 23, 2017

ianlancetaylor commented Oct 25, 2017

mdempsky commented Oct 25, 2017

gottwald commented Oct 25, 2017

gottwald commented Oct 25, 2017 •

edited

Loading

gottwald commented Oct 27, 2017

net: netgo dns resolver latency / timeout issues #22406

net: netgo dns resolver latency / timeout issues #22406

Comments

gottwald commented Oct 23, 2017

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

ianlancetaylor commented Oct 25, 2017

mdempsky commented Oct 25, 2017

gottwald commented Oct 25, 2017

gottwald commented Oct 25, 2017 • edited Loading

gottwald commented Oct 27, 2017

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

gottwald commented Oct 25, 2017 •

edited

Loading