net/http: (*Transport).getConn traces through stale contexts #21597

bcmills · 2017-08-24T19:17:23Z

(*http.Transport).getConn currently starts a dialConn call in a background goroutine:

Lines 942 to 945 in 0b0cc41

    
           go func() { 
        
           	pc, err := t.dialConn(ctx, cm) 
        
           	dialc <- dialRes{pc, err} 
        
           }()

That records traces to the provided Context and eventually invokes t.DialContext with it:

go/src/net/http/transport.go

Line 1029 in 0b0cc41

trace := httptrace.ContextClientTrace(ctx)

go/src/net/http/transport.go

Line 1060 in 0b0cc41

conn, err := t.dial(ctx, "tcp", cm.addr())

This is pretty much a textbook illustration of the problem described in #19643 (Context API for continuing work). If (*Transport).getConn returns early (due to cancellation or to availability of an idle connection), the caller may have already written out the corresponding traces, and dialConn (and/or the user-provided DialContext callback) will unexpectedly access a Context that the caller believes to be unreachable.

httptrace.ClientTrace says, "Functions may be called concurrently from different goroutines and some may be called after the request has completed or failed." However, that is not true of Context instances in general: if the http package wants to save a trace after a call has returned, it should call Value ahead of time and save only the ClientTrace pointer. If dialConn calls a user-provided DialContext function, then getConn should cancel the Context passed to it and wait for DialContext to return before itself returning.

See also #20617 (Context race in http.Transport).

The text was updated successfully, but these errors were encountered:

tombergan · 2017-08-24T20:16:36Z

Yep, that's technically a bug, but it seems basically unfixable given the current context and httptrace APIs. Do you have any suggestions? I'm all ears.

When we made httptrace, I guess we had implicitly assumed that each trace object was single-use, which made it "safe" to access the trace object outside of the scope of the context. I say "safe" because I don't think we actually through this problem completely.

Aside: #20617 might be related to #19643 but is unrelated to this bug.

bcmills · 2017-08-24T21:12:11Z

Do you have any suggestions?

Per above, unpack the ClientTrace early and save it, and wait for DialContext to return before allowing getConn to return. The change may be ugly, but it seems straightforward.

tombergan · 2017-08-24T21:31:04Z

wait for DialContext to return before allowing getConn to return. The change may be ugly, but it seems straightforward.

We can't do that, it will be a performance regression. But you're right, one way to "fix" the problem is to move all background tasks onto the critical path. I just don't think that's an acceptable fix.

bcmills · 2017-08-24T21:42:02Z

We can't do that, it will be a performance regression.

If DialContext is responsive to context cancellation, then it will introduce a negligible delay if an idle connection is found. If DialContext is not responsive to cancellation, then it will fix a potential OOM condition if too many dials are in flight.

Beyond that, as far as I can tell, it will add one allocation (for the child context), and then only if the DialContext function is non-nil. I would be shocked if that has a noticeable impact relative to setting up an HTTP connection.

tombergan · 2017-08-24T22:34:17Z

Sorry, I misunderstood your suggestion. You're suggesting that we cancel the pending dial if an idle conn becomes available before the dial finishes (which we currently do not do ... this is the part I missed). If the pending dial is canceled, then I agree, waiting for a canceled dial to finish does not seem like a problem.

That is still a semantics change that could affect performance. Currently, we let the pending dial finish and add it to the pool (up to Transport.MaxIdleConns and Transport.MaxIdleConnsPerHost). If we cancel the pending dial, that means a follow-up request will need to wait for a new dial or for a prior request to finish. This is potentially slower than the current implementation, where the follow-up request might use the pending dial. I am sure some usage pattern of http.Transport would be harmed by this change, and eventually we'd get a bug report.

I consider it a bug that we dial using the request's ctx. Ideally we'd dial using a background ctx, but that depends on #19643, which is unlikely to be fixed any time soon.

I think I have a good (Google-internal) project that could act as a benchmark for your suggestion. If there's no performance impact for that project, then your suggestion is likely a good solution.

rsc · 2017-11-22T18:57:18Z

Moving to Go 1.11.

tombergan self-assigned this Aug 24, 2017

tombergan added the Thinking label Aug 24, 2017

tombergan added this to the Go1.10 milestone Aug 24, 2017

tombergan mentioned this issue Aug 24, 2017

net: DialContext can use a stale context via happy eyeballs #21600

Open

rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017

bcmills mentioned this issue Mar 21, 2018

net: goroutine leak when when calling File on a TCPConn #24455

Closed

bradfitz added NeedsFix The path to resolution is known, but the work has not been done. help wanted labels May 29, 2018

bradfitz modified the milestones: Go1.11, Go1.12 Jul 9, 2018

andybons modified the milestones: Go1.12, Go1.13 Feb 12, 2019

andybons modified the milestones: Go1.13, Go1.14 Jul 8, 2019

rsc modified the milestones: Go1.14, Backlog Oct 9, 2019

rsc unassigned tombergan Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

net/http: (*Transport).getConn traces through stale contexts #21597

net/http: (*Transport).getConn traces through stale contexts #21597

bcmills commented Aug 24, 2017 •

edited

Loading

tombergan commented Aug 24, 2017

bcmills commented Aug 24, 2017

tombergan commented Aug 24, 2017 •

edited

Loading

bcmills commented Aug 24, 2017 •

edited

Loading

tombergan commented Aug 24, 2017

rsc commented Nov 22, 2017

net/http: (*Transport).getConn traces through stale contexts #21597

net/http: (*Transport).getConn traces through stale contexts #21597

Comments

bcmills commented Aug 24, 2017 • edited Loading

tombergan commented Aug 24, 2017

bcmills commented Aug 24, 2017

tombergan commented Aug 24, 2017 • edited Loading

bcmills commented Aug 24, 2017 • edited Loading

tombergan commented Aug 24, 2017

rsc commented Nov 22, 2017

bcmills commented Aug 24, 2017 •

edited

Loading

tombergan commented Aug 24, 2017 •

edited

Loading

bcmills commented Aug 24, 2017 •

edited

Loading