Description
What we meet
We encountered this goroutine leak in our own project(https://github.com/coreos/etcd) when using the function to check the goroutine that is still running after test finishes:
z_last_test.go:97: Test appears to have leaked a Transport:
net/http.(*persistConn).readLoop(0xc20858c0b0)
/usr/local/go/src/net/http/transport.go:858 +0x4eb
created by net/http.(*Transport).dialConn
/usr/local/go/src/net/http/transport.go:660 +0x10d4
testing.RunTests(0x6bb550, 0x838980, 0x1a, 0x1a, 0x11800)
/usr/local/go/src/testing/testing.go:556 +0xdc3
testing.(*M).Run(0xc20802e140, 0x843160)
/usr/local/go/src/testing/testing.go:485 +0xe1
main.main()
github.com/coreos/etcd/integration/_test/_testmain.go:102 +0x28d
The error is under go 1.4.2
The case is writeLoop is closed, but readLoop is still there, and hangs on rc := <-pc.reqch
forever.
What we have found
After digging into it, @xiang90 find this, and we think that is the root cause of the leak.
(All codes are excerpted from "pkg/net/http/transport.go" in go1.4)
Bug case
dialConn
inits a persistConn
, and go readLoop()
roundTrip
is called. It adds 1 to pc.numExpectedResponses
and sends a requestAndChan
to reqch
, and the loop in readLoop
receives and handles it.
roundTrip
has not received the response, and is blocked at WaitResponse
block.
readLoop
reads the response from pc.br
, and no error. It has no body and pushes the response into channel. Then it wait on pc.br.Peek(1)
again.
pc.br
is closed by remote target, and returns io.EOF.
pc.br.Peek(1)
in readLoop
returns error io.EOF.
The loop in readLoop
finds that pc.numExpectedResponses == 0
is false, then waits on reqCh
.
roundTrip
receives the response and exits.
No further roundTrip
comes, and the persistConn is closed
. So no one will push requestAndChan
to reqCh
in the future, and reqCh
will not be closed.
The readLoop
goroutine waits on reqCh
forever.
Init
func (t *Transport) dialConn(cm connectMethod) (*persistConn, error) {
pconn := &persistConn{
t: t,
cacheKey: cm.key(),
reqch: make(chan requestAndChan, 1),
writech: make(chan writeRequest, 1),
closech: make(chan struct{}),
writeErrCh: make(chan error, 1),
}
...
pconn.br = bufio.NewReader(noteEOFReader{pconn.conn, &pconn.sawEOF})
...
go pconn.readLoop()
persistConn.readLoop
func (pc *persistConn) readLoop() {
alive := true
for alive {
pb, err := pc.br.Peek(1)
pc.lk.Lock()
if pc.numExpectedResponses == 0 {
if !pc.closed {
pc.closeLocked()
if len(pb) > 0 {
log.Printf("Unsolicited response received on idle HTTP channel starting with %q; err=%v",
string(pb), err)
}
}
pc.lk.Unlock()
return
}
pc.lk.Unlock()
rc := <-pc.reqch
var resp *Response
if err == nil {
resp, err = ReadResponse(pc.br, rc.req)
if err == nil && resp.StatusCode == 100 {
// Skip any 100-continue for now.
// TODO(bradfitz): if rc.req had "Expect: 100-continue",
// actually block the request body write and signal the
// writeLoop now to begin sending it. (Issue 2184) For now we
// eat it, since we're never expecting one.
resp, err = ReadResponse(pc.br, rc.req)
}
}
...
}
}
persistConn.roundTrip
func (pc *persistConn) roundTrip(req *transportRequest) (resp *Response, err error) {
pc.t.setReqCanceler(req.Request, pc.cancelRequest)
pc.lk.Lock()
pc.numExpectedResponses++
headerFn := pc.mutateHeaderFunc
pc.lk.Unlock()
...
resc := make(chan responseAndError, 1)
pc.reqch <- requestAndChan{req.Request, resc, requestedGzip}
...
WaitResponse:
...
pc.lk.Lock()
pc.numExpectedResponses--
pc.lk.Unlock()
...
return re.res, re.err
}
persistConn.close
func (pc *persistConn) close() {
pc.lk.Lock()
defer pc.lk.Unlock()
pc.closeLocked()
}
Can we reproduce it?
I can get the leak when running etcd integration tests once in five round in race mode.
I have done my best to reproduce it in a simple go program(2-3 full days), but i failed to make it happen because it is really hard to let the code run in the order that is described in bug case.
i hope that you can read over the bug case, and i can help on testing whether this is fixed.