Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: io.Copy to timed-out net.Conn hangs? #4403

Closed
bradfitz opened this issue Nov 16, 2012 · 20 comments

Comments

Projects
None yet
9 participants
@bradfitz
Copy link
Member

commented Nov 16, 2012

on OS X, patching in un-mailed http://golang.org/cl/6851061 which adds this
net/http test:

func TestTransportDustinRace(t *testing.T) {
        mux := NewServeMux()
        mux.HandleFunc("/get", func(w ResponseWriter, r *Request) {
                io.Copy(w, neverEnding('a'))
        })
        mux.HandleFunc("/put", func(w ResponseWriter, r *Request) {
                io.Copy(ioutil.Discard, r.Body)
        })
        ts := httptest.NewServer(mux)
        defer ts.Close()

        client := &Client{
                Transport: &Transport{
                        Dial: func(n, addr string) (net.Conn, error) {
                                conn, err := net.Dial(n, addr)
                                if err != nil {
                                        return nil, err
                                }
                                conn.SetDeadline(time.Now().Add(50 * time.Millisecond))
                                return conn, nil
                        },
                        DisableKeepAlives: true,
                },
        }

        for i := 0; i < 50; i++ {
                println("iter", i)
                sres, err := client.Get(ts.URL + "/get")
                if err != nil {
                        t.Logf("Error issuing GET: %v", err)
                        return
                }
                req, _ := NewRequest("PUT", ts.URL+"/put", sres.Body)
                res, err := client.Do(req)
                println(fmt.Sprintf("iteration %d; res=%v, err=%v", i, res, err))
                t.Logf("iteration %d; res=%v, err=%v", i, res, err)
        }
        println("done")
}

the test hangs after "done", before the testserver deferred Close() call,
because the /get handler's io.Copy (from neverEnding('a') to the ResponseWriter) never
completes, always waiting for writability:

goroutine 219 [chan receive]:
net.(*pollServer).WaitWrite(0xc200088de0, 0xc2002b4cc0, 0xc200088c90, 0x23,
0xffffffffffffff01, ...)
    /Users/bradfitz/go/src/pkg/net/fd_unix.go:251 +0x63
net.(*netFD).Write(0xc2002b4cc0, 0xc200512000, 0x1000, 0x1000, 0x0, ...)
    /Users/bradfitz/go/src/pkg/net/fd_unix.go:536 +0x391
net.(*conn).Write(0xc200473c38, 0xc200512000, 0x1000, 0x1000, 0xffffffff, ...)
    /Users/bradfitz/go/src/pkg/net/net.go:129 +0xd3
bufio.(*Writer).Flush(0xc20050d200, 0xff8, 0xff8, 0xc200514000)
    /Users/bradfitz/go/src/pkg/bufio/bufio.go:456 +0xb9
bufio.(*Writer).Write(0xc20050d200, 0xc200514000, 0x8000, 0x8000, 0x0, ...)
    /Users/bradfitz/go/src/pkg/bufio/bufio.go:492 +0x1ef
net/http.(*response).Write(0xc20050d280, 0xc200514000, 0x8000, 0x8000, 0x8000, ...)
    /Users/bradfitz/go/src/pkg/net/http/server.go:501 +0x461
net/http.(*writerOnly).Write(0xc20050e270, 0xc200514000, 0x8000, 0x8000, 0x8000, ...)
    /Users/bradfitz/go/src/pkg/net/http/chunked.go:0 +0x65
io.Copy(0xc2000db1e0, 0xc20050e270, 0xc2000db0f0, 0x19961, 0x8c8000, ...)
    /Users/bradfitz/go/src/pkg/io/io.go:367 +0x236
net/http.(*response).ReadFrom(0xc20050d280, 0xc2000db0f0, 0x19961, 0xc2000db120, 0x0,
...)
    /Users/bradfitz/go/src/pkg/net/http/server.go:168 +0x1ae
io.Copy(0xc2000db0c0, 0xc20050d280, 0xc2000db0f0, 0x19961, 0x0, ...)
    /Users/bradfitz/go/src/pkg/io/io.go:357 +0xa2
net/http_test.func·104(0xc2000ea880, 0xc20050d280)
    /Users/bradfitz/go/src/pkg/net/http/transport_test.go:907 +0x9c
net/http.HandlerFunc.ServeHTTP(0x96ac0, 0xc2000ea880, 0xc20050d280, 0xc2001bf410,
0x310420, ...)
    /Users/bradfitz/go/src/pkg/net/http/server.go:733 +0x3b
net/http.(*ServeMux).ServeHTTP(0xc200088cf0, 0xc2000ea880, 0xc20050d280, 0xc2001bf410,
0x3de6d, ...)
    /Users/bradfitz/go/src/pkg/net/http/server.go:992 +0x61
net/http/httptest.(*waitGroupHandler).ServeHTTP(0xc2000cf320, 0xc2000ea880,
0xc20050d280, 0xc2001bf410, 0x0, ...)
    /Users/bradfitz/go/src/pkg/net/http/httptest/server.go:180 +0x8f
net/http.(*conn).serve(0xc200332f00, 0x0)
    /Users/bradfitz/go/src/pkg/net/http/server.go:696 +0x55d
created by net/http.(*Server).Serve
    /Users/bradfitz/go/src/pkg/net/http/server.go:1120 +0x36d

why?

Keeping this assigned to me, but help would be appreciated if anybody is bored.
@remyoudompheng

This comment has been minimized.

Copy link
Contributor

commented Nov 17, 2012

Comment 1:

Because persistConn.roudTrip doesn't check for bufio.Writer.Flush error.
http://golang.org/src/pkg/net/http/transport.go#L678
@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Nov 17, 2012

Comment 2:

That's not the right code.  It's changed notably since then.
It's currently:
func (pc *persistConn) writeLoop() {
        for {
                select {
                case wr := <-pc.writech:
                        if pc.isBroken() {
                                wr.ch <- errors.New("http: can't write HTTP request on broken connection")
                                continue
                        }
                        err := wr.req.Request.write(pc.bw, pc.isProxy, wr.req.extra)
                        if err == nil {
                                err = pc.bw.Flush()
            }
                        if err != nil {
                            pc.markBroken()
                        }
                        wr.ch <- err
        case <-pc.closech:
                        return
                }
        }
}
@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Nov 21, 2012

Comment 3:

Okay, I spent too much time debugging this today.  Here's what I now know:
* OS X vs Linux doesn't matter
* the /get goroutine never finishes because the io.Copy never finishes because the
net.Conn never becomes writable.
* when the client disconnects, the server never closes its connection, since it's
blocked on the write.  This bug explains the file descriptor leak in Dustin's stress
test from issue #4191
* after reading the GET request's headers, we never do another Read on the net.Conn, so
we never know the client disconnected.
* keeping a goroutine always reading from the net.Conn, preemptively reading the
upcoming request would be one way to fix this (related to pipeling and CloseNotifier in
issue #2510), but I also wondering whether the pollserver should be telling us, if we're
waiting for writability, that the client is now gone.
I'll be looking at the poll server now and testing some behavior before I do a
concurrent reader fix for this.
@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Nov 22, 2012

Comment 4:

Okay, I'm getting more and more confused.
Any help debugging this would be awesome.
Here's my latest confusion:  http://golang.org/cl/6851061
Start with this mystery:
$ go test -v -run=Dustin.*1 net/http
Notice that it never times out.  (Look for the "XXX" comments in transport_test.go)
But uncomment the "return conn, nil" line before the logging wrapper return line below
and notice that it successfully times out.
Why is the logging wrapper affecting the wrapped net.Conn's timeout?  Is this some bug
in the pollserver and scheduler differences tickle it?
I'm so confused.
@alberts

This comment has been minimized.

Copy link
Contributor

commented Nov 22, 2012

Comment 5:

Any chance that https://golang.org/issue/4384?c=12 is showing up the
same underlying pollserver issue?
@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Nov 22, 2012

Comment 6:

I would believe anything at this point, but I'm looking for a smoking gun.
@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Nov 22, 2012

Comment 7:

Uploaded a new version of http://golang.org/cl/6851061 with better logging. 
(the server conns weren't being labeled as "server-1" and "server-2")
Why does DustinRace1 loop forever without timeouts if the Transport dialer
(transport_test.go:914) returns the NewLoggingConn wrapper rather than returning conn
directly?  What about the wrapper makes the underlying deadlines no longer work?  But
the wrapper DOES work with the deadlines if the deadlines are smaller (or negative), so
somehow the wrapper only masks 50-100ms and longer deadlines??  Is something wrongly
renewing the deadlines?
(at least right now on Linux, but I seem to recall this also failing on OS X)
@dvyukov

This comment has been minimized.

Copy link
Member

commented Nov 22, 2012

Comment 8:

Does ts.Close() calls Close() on netFD associated with the HTTP connection?
If yes, than that Close() must unblock netFD.Write().
@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Nov 22, 2012

Comment 9:

Dmitry, you're looking at the patch and not looking at the problem.  :)
If you run the patch, you'll see that ts.Close isn't the problem: it's correctly waiting
on a WaitGroup, waiting for all the test handlers to complete.
@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Nov 22, 2012

Comment 10:

New information:
If I replace the *loggingConn's log statements with sleeps, I find that 500 nanosecond
sleeps makes the test hang (timeouts don't work), but 200 nanoseconds makes the test end
successfully (the timeouts still work).
Looking like an race in the poll server, deep in epoll or pollster land.  Fun.
I'll check what OS X does later. (I brought two laptops with me, just for this fun)
@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Nov 22, 2012

Comment 11:

The latest clue: the net package doesn't do timeouts reliably:
http://golang.org/cl/6851096/
(unrelated to HTTP)
@dvyukov

This comment has been minimized.

Copy link
Member

commented Nov 23, 2012

Comment 12:

> Dmitry, you're looking at the patch and not looking at the problem.  :)
> If you run the patch, you'll see that ts.Close isn't the problem:
> it's correctly waiting on a WaitGroup, waiting for all the test handlers to complete.
But does it call netFD.Close() on the connections before waiting?
What is the exact problem? I am lost. Earlier you said that server connections do not
finish after close, and then that client connection do not finish after timeout. Are
there >=2 problems?
@mikioh

This comment has been minimized.

Copy link
Contributor

commented Nov 23, 2012

Comment 13:

I added a bit debug info to net/http/server.go to see what happens in net/httptest
server.
https://golang.org/cl/6843108/
As mentioned in golang-dev, on freebsd, D1 works well whether new net/net_test.go exists.
And D2 doesn't.
@davecheney

This comment has been minimized.

Copy link
Contributor

commented Nov 23, 2012

Comment 15:

oh dear.
os.file.close() is blocking .. indefinitely
goroutine 5 [runnable]:
syscall.Syscall()
        /home/dfc/go/src/pkg/syscall/asm_linux_amd64.s:36 +0x64
syscall.Close(0x3, 0x0, 0x0, 0x30)
        /home/dfc/go/src/pkg/syscall/zerrors_linux_amd64.go:1761 +0x52
os.(*file).close(0xc200058750, 0x1, 0xc200058810, 0x43801b)
        /home/dfc/go/src/pkg/os/file_unix.go:108 +0x78
os.(*File).Close(0xc200072260, 0x7f52e936de40, 0xc2000b1000, 0x4399a8)
        /home/dfc/go/src/pkg/os/file_unix.go:99 +0x28
net.(*netFD).decref(0xc200081000, 0x5813e8)
        /home/dfc/go/src/pkg/net/fd_unix.go:375 +0x5b
net.(*netFD).accept(0xc200081000, 0x44e630, 0x0, 0xc200058810,
0xc20007a600, ...)
        /home/dfc/go/src/pkg/net/fd_unix.go:603 +0x78
net.(*TCPListener).AcceptTCP(0xc200072268, 0xc200072410, 0x0, 0x0, 0x10, ...)
        /home/dfc/go/src/pkg/net/tcpsock_posix.go:237 +0x62
net.(*TCPListener).Accept(0xc200072268, 0x0, 0x0, 0x0, 0x0, ...)
        /home/dfc/go/src/pkg/net/tcpsock_posix.go:247 +0x49
net.func�038(0xc200070d30, 0xc200072278, 0xc200072220, 0xc200072280, 0x0, ...)
        /home/dfc/go/src/pkg/net/net_test.go:261 +0x47
created by net.testVariousDeadlines
        /home/dfc/go/src/pkg/net/net_test.go:276 +0x221
@dvyukov

This comment has been minimized.

Copy link
Member

commented Nov 23, 2012

Comment 16:

It can be because the thread can't reenter Go rutime and blocked in exitsyscall(). Try
running with GOTRACEBACK=2
@davecheney

This comment has been minimized.

Copy link
Contributor

commented Nov 23, 2012

Comment 17:

I'm sorry, that previous report was wrong.
I am looking at
https://code.google.com/p/go/source/browse/src/pkg/net/fd_unix.go#426
wondering if this line never returned EAGAIN, will the timeout logic
implemented in WaitRead ever fire ? Especially copying over the
loopback interface there may always be data in the buffer.
@alexbrainman

This comment has been minimized.

Copy link
Member

commented Nov 23, 2012

Comment 18:

I agree with dave, the test is sending so much data that net read always succeeds, and
it never even get a chance to check for timeouts. Unless you move timeout check logic
before the read, like I did in windows CL.
I cannot check this theory on Linux because the new timeout test passes on my Linux pc.
Alex
@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Nov 24, 2012

Comment 19:

Dave, Alex, thanks so much!
Fix and new minimal tests at http://golang.org/cl/6851096
@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Nov 24, 2012

Comment 20:

Labels changed: removed priority-triage.

@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Nov 24, 2012

Comment 21:

This issue was closed by revision 5fa3aeb.

Status changed to Fixed.

@bradfitz bradfitz added fixed labels Nov 24, 2012

@bradfitz bradfitz self-assigned this Nov 24, 2012

@rsc rsc added this to the Go1.1 milestone Apr 14, 2015

@rsc rsc removed the go1.1 label Apr 14, 2015

@golang golang locked and limited conversation to collaborators Jun 24, 2016

This issue was closed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.