Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: race where Dialer.DialContext returns connections with altered write deadlines #16523

Closed
dsnet opened this issue Jul 28, 2016 · 6 comments

Comments

Projects
None yet
4 participants
@dsnet
Copy link
Member

commented Jul 28, 2016

Consider the following test which starts a serial chain of reverse proxies and then slams the chain with many concurrent requests.

const (
    // Adjust these two values to control frequency of i/o error occurrence.
    // numConcurrentRequests seems to have greater effect on my system.
    numProxyChain = 20
    numConcurrentRequests = 20

    port = 8080
    topPort = port + numProxyChain
)

func main() {
    // Start a chain of reverse proxies.
    go http.ListenAndServe(fmt.Sprintf("0.0.0.0:%d", port), http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        if _, err := w.Write([]byte("the quick brown fox jumped over the lazy dog")); err != nil {
            panic(err)
        }
    }))
    for i := port + 1; i <= topPort; i++ {
        url, _ := url.Parse(fmt.Sprintf("http://localhost:%d", i-1))
        rp := httputil.NewSingleHostReverseProxy(url)
        rp.Transport = &http.Transport{
            TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, // This is crucial. Don't know why!
        }
        go http.ListenAndServe(fmt.Sprintf("0.0.0.0:%d", i), rp)
    }

    var fail, total uint64
    go func() {
        // Stall until servers are ready.
        for {
            if _, err := http.Get(fmt.Sprintf("http://localhost:%d", topPort)); err == nil {
                break
            }
        }

        // Slam the servers!
        for {
            var wg sync.WaitGroup
            for i := 0; i < numConcurrentRequests; i++ {
                wg.Add(1)
                go func() {
                    defer wg.Done()
                    atomic.AddUint64(&total, 1)
                    now := time.Now()
                    resp, err := http.Get(fmt.Sprintf("http://localhost:%d", topPort))
                    d := time.Now().Sub(now)
                    if err != nil {
                        atomic.AddUint64(&fail, 1)
                        fmt.Println(err, d)
                        return
                    }
                    if resp.Status != "200 OK" {
                        atomic.AddUint64(&fail, 1)
                        fmt.Println(resp.Status, d)
                    }
                    resp.Body.Close()
                }()
            }
            wg.Wait()
        }
    }()

    time.Sleep(5 * time.Second)
    nf, nt := atomic.LoadUint64(&fail), atomic.LoadUint64(&total)
    fmt.Printf("Failure rate %0.2f%% (%d of %d)\n", float64(nf)/float64(nt)*100, nf, nt)
    if nf > 0 {
        os.Exit(1)
    }
}

Running this test on go1.7rc3 produces several failed GET requests:

...
2016/07/27 22:41:32 http: proxy error: write tcp 127.0.0.1:54671->127.0.0.1:8080: i/o timeout
502 Bad Gateway 41.756424ms
2016/07/27 22:41:33 http: proxy error: write tcp 127.0.0.1:58984->127.0.0.1:8081: i/o timeout
502 Bad Gateway 29.08107ms
2016/07/27 22:41:33 http: proxy error: write tcp 127.0.0.1:59777->127.0.0.1:8080: i/o timeout
502 Bad Gateway 35.419027ms
Failure rate 0.70% (19 of 2720)

While running this test on go1.6.2 produces no failed GET requests:

Failure rate 0.00% (0 of 2920)

I don't believe it is proper behavior to fail with an i/o timeout error after only ~40ms of real time. Something is causing the connection to timeout too early. Git bisect indicates the source of this issue as 1518d43.

This is a regression from go1.6.2

/CC @bradfitz @broady @adg

@dsnet dsnet added this to the Go1.7Maybe milestone Jul 28, 2016

@bradfitz bradfitz self-assigned this Jul 28, 2016

@bradfitz

This comment has been minimized.

Copy link
Member

commented Jul 28, 2016

@dsnet, can you try https://go-review.googlesource.com/25330? This is pretty impossible for me to reproduce on my Mac laptop. I have seen the failure sometimes, but its reproducibility comes and goes from build to build, seemingly for no reason.

I think that's the fix, though. I'll try on Linux later and write some more specific tests.

/cc @pmarks-net who touched that code last and might be amused. (Not your fault! You fixed my original buggy code. It was just too subtle.)

@gopherbot

This comment has been minimized.

Copy link

commented Jul 28, 2016

CL https://golang.org/cl/25330 mentions this issue.

@bradfitz

This comment has been minimized.

Copy link
Member

commented Aug 1, 2016

I spent more time debugging this.

There is some bad interaction between net/http.Transport's "socket late binding" implementation (https://insouciant.org/tech/connection-management-in-chromium/) along with net/http.Server's per-Request Contexts (which get canceled at the end of ServeHTTP), and net.DialContext, which is getting canceled via the net/http.Server's per-Request context cancelFunc. (I locally modified the context package to track the stack of who canceled a context)

A "return nil" at the top of net/http/transport.go:getIdleConnCh "fixes" it, but that's not a real fix. I need to understand what's actually happening first.

/cc @dpiddy (thanks for the 5ms sleep debugging hint)

@bradfitz

This comment has been minimized.

Copy link
Member

commented Aug 1, 2016

@pmarks-net

This comment has been minimized.

Copy link
Contributor

commented Aug 1, 2016

Your latest patch looks similar to my suggestion "2) Before returning success, connect() must verify that cancelation has been aborted. If cancelation won the race, it should close the connection and return a failure instead."

But do I understand correctly that this does not actually "fix" any connections, just makes the error messages more accurate?

@bradfitz

This comment has been minimized.

Copy link
Member

commented Aug 1, 2016

@pmarks-net, yes, it doesn't fix any connections because there's no runtime netpoller API to get the old write timeout value, so it's not possible to restore it. Instead, this just returns an error (instead of returning nil) on connect, so the otherwise-poisoned connection isn't used by the caller. (In this case, the http.Transport, which was getting write errors writing the HTTP request on the connection with its write deadline set way in the past)

@gopherbot gopherbot closed this in 28ee179 Aug 2, 2016

@bradfitz bradfitz changed the title net/http: overly aggresive timeout errors on heavily loaded server net: race where Dialer.DialContext returns connections with altered write deadlines Aug 2, 2016

@golang golang locked and limited conversation to collaborators Aug 2, 2017

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.