New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: (*Transport).getConn traces through stale contexts #21597

Open
bcmills opened this Issue Aug 24, 2017 · 6 comments

Comments

Projects
None yet
4 participants
@bcmills
Member

bcmills commented Aug 24, 2017

(*http.Transport).getConn currently starts a dialConn call in a background goroutine:

go func() {
pc, err := t.dialConn(ctx, cm)
dialc <- dialRes{pc, err}
}()

That records traces to the provided Context and eventually invokes t.DialContext with it:

trace := httptrace.ContextClientTrace(ctx)

conn, err := t.dial(ctx, "tcp", cm.addr())

This is pretty much a textbook illustration of the problem described in #19643 (Context API for continuing work). If (*Transport).getConn returns early (due to cancellation or to availability of an idle connection), the caller may have already written out the corresponding traces, and dialConn (and/or the user-provided DialContext callback) will unexpectedly access a Context that the caller believes to be unreachable.

httptrace.ClientTrace says, "Functions may be called concurrently from different goroutines and some may be called after the request has completed or failed." However, that is not true of Context instances in general: if the http package wants to save a trace after a call has returned, it should call Value ahead of time and save only the ClientTrace pointer. If dialConn calls a user-provided DialContext function, then getConn should cancel the Context passed to it and wait for DialContext to return before itself returning.


See also #20617 (Context race in http.Transport).

@tombergan

This comment has been minimized.

Contributor

tombergan commented Aug 24, 2017

Yep, that's technically a bug, but it seems basically unfixable given the current context and httptrace APIs. Do you have any suggestions? I'm all ears.

When we made httptrace, I guess we had implicitly assumed that each trace object was single-use, which made it "safe" to access the trace object outside of the scope of the context. I say "safe" because I don't think we actually through this problem completely.

Aside: #20617 might be related to #19643 but is unrelated to this bug.

@bcmills

This comment has been minimized.

Member

bcmills commented Aug 24, 2017

Do you have any suggestions?

Per above, unpack the ClientTrace early and save it, and wait for DialContext to return before allowing getConn to return. The change may be ugly, but it seems straightforward.

@tombergan

This comment has been minimized.

Contributor

tombergan commented Aug 24, 2017

wait for DialContext to return before allowing getConn to return. The change may be ugly, but it seems straightforward.

We can't do that, it will be a performance regression. But you're right, one way to "fix" the problem is to move all background tasks onto the critical path. I just don't think that's an acceptable fix.

@bcmills

This comment has been minimized.

Member

bcmills commented Aug 24, 2017

We can't do that, it will be a performance regression.

If DialContext is responsive to context cancellation, then it will introduce a negligible delay if an idle connection is found. If DialContext is not responsive to cancellation, then it will fix a potential OOM condition if too many dials are in flight.

Beyond that, as far as I can tell, it will add one allocation (for the child context), and then only if the DialContext function is non-nil. I would be shocked if that has a noticeable impact relative to setting up an HTTP connection.

@tombergan

This comment has been minimized.

Contributor

tombergan commented Aug 24, 2017

Sorry, I misunderstood your suggestion. You're suggesting that we cancel the pending dial if an idle conn becomes available before the dial finishes (which we currently do not do ... this is the part I missed). If the pending dial is canceled, then I agree, waiting for a canceled dial to finish does not seem like a problem.

That is still a semantics change that could affect performance. Currently, we let the pending dial finish and add it to the pool (up to Transport.MaxIdleConns and Transport.MaxIdleConnsPerHost). If we cancel the pending dial, that means a follow-up request will need to wait for a new dial or for a prior request to finish. This is potentially slower than the current implementation, where the follow-up request might use the pending dial. I am sure some usage pattern of http.Transport would be harmed by this change, and eventually we'd get a bug report.

I consider it a bug that we dial using the request's ctx. Ideally we'd dial using a background ctx, but that depends on #19643, which is unlikely to be fixed any time soon.

I think I have a good (Google-internal) project that could act as a benchmark for your suggestion. If there's no performance impact for that project, then your suggestion is likely a good solution.

@rsc

This comment has been minimized.

Contributor

rsc commented Nov 22, 2017

Moving to Go 1.11.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment