net: race where Dialer.DialContext returns connections with altered write deadlines #16523
Comments
@dsnet, can you try https://go-review.googlesource.com/25330? This is pretty impossible for me to reproduce on my Mac laptop. I have seen the failure sometimes, but its reproducibility comes and goes from build to build, seemingly for no reason. I think that's the fix, though. I'll try on Linux later and write some more specific tests. /cc @pmarks-net who touched that code last and might be amused. (Not your fault! You fixed my original buggy code. It was just too subtle.) |
CL https://golang.org/cl/25330 mentions this issue. |
I spent more time debugging this. There is some bad interaction between net/http.Transport's "socket late binding" implementation (https://insouciant.org/tech/connection-management-in-chromium/) along with net/http.Server's per-Request Contexts (which get canceled at the end of ServeHTTP), and net.DialContext, which is getting canceled via the net/http.Server's per-Request context cancelFunc. (I locally modified the context package to track the stack of who canceled a context) A "return nil" at the top of net/http/transport.go:getIdleConnCh "fixes" it, but that's not a real fix. I need to understand what's actually happening first. /cc @dpiddy (thanks for the 5ms sleep debugging hint) |
Sent an updated https://go-review.googlesource.com/25330. PTAL. |
Your latest patch looks similar to my suggestion "2) Before returning success, connect() must verify that cancelation has been aborted. If cancelation won the race, it should close the connection and return a failure instead." But do I understand correctly that this does not actually "fix" any connections, just makes the error messages more accurate? |
@pmarks-net, yes, it doesn't fix any connections because there's no runtime netpoller API to get the old write timeout value, so it's not possible to restore it. Instead, this just returns an error (instead of returning nil) on connect, so the otherwise-poisoned connection isn't used by the caller. (In this case, the http.Transport, which was getting write errors writing the HTTP request on the connection with its write deadline set way in the past) |
Consider the following test which starts a serial chain of reverse proxies and then slams the chain with many concurrent requests.
Running this test on
go1.7rc3
produces several failed GET requests:While running this test on
go1.6.2
produces no failed GET requests:I don't believe it is proper behavior to fail with an i/o timeout error after only ~40ms of real time. Something is causing the connection to timeout too early. Git bisect indicates the source of this issue as 1518d43.
This is a regression from
go1.6.2
/CC @bradfitz @broady @adg
The text was updated successfully, but these errors were encountered: