New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: DialContext can use a stale context via happy eyeballs #21600

Open
tombergan opened this Issue Aug 24, 2017 · 9 comments

Comments

Projects
None yet
3 participants
@tombergan
Contributor

tombergan commented Aug 24, 2017

When happy eyeballs is enabled, DialContext makes parallel calls to dialSingle via dialParallel:

func dialParallel(ctx context.Context, dp *dialParam, primaries, fallbacks addrList) (Conn, error) {

Each dialSingle call may call ctx.Value, such as here:

trace, _ := ctx.Value(nettrace.TraceKey{}).(*nettrace.Trace)

dialParallel doesn't wait for all dialSingle calls to complete. This means, in theory, dialSingle can call ctx.Value on a ctx that has gone out of scope. This is effectively the same problem as #21597.

/cc @bcmills

@tombergan

This comment has been minimized.

Contributor

tombergan commented Aug 24, 2017

@bcmills, I'm reconsidering whether this and #21597 are actually bugs. The documentation for httptrace explicitly says that "Functions may be called concurrently from different goroutines and some may be called after the request has completed or failed."
https://golang.org/pkg/net/http/httptrace/#ClientTrace

I don't see any documentation in context which states that the above is illegal, or that it's illegal to call Value on a context that has gone out-of-scope. The notion of a context being out-of-scope is not even defined in the API documentation:
https://golang.org/pkg/context/

Both this issue and #21597 arise from an internal Google package that implements a context type, where the Value method of that type panics if called after the context has gone out-of-scope (where "out-of-scope" has a well-defined meaning for that specific package). In terms of the vanilla context package, I'm not convinced this is actually a bug. Thoughts?

@bcmills

This comment has been minimized.

Member

bcmills commented Aug 25, 2017

In terms of the vanilla context package, I'm not convinced this is actually a bug. Thoughts?

Over-retaining a context.Context argument is a bug regardless of whether the particular values accessed are safe to use. The Context can carry references to arbitrarily large data structures (e.g. logs for large requests or entire sessions of requests) limited by semaphores or other means, so unexpectedly over-retaining the Context in background operations — particularly on unusual paths (such as reestablishing connections, which can happen in bursts when a network outage or slowdown resolves) — can lead to large, unpredictable spikes in memory usage.

This is more of an issue with Context than with other data structures in general, because the Context can vary in size from a tiny cancellation channel to a very large tree of data.

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Aug 25, 2017

I don't think there is any such thing as a "stale context." A better way to put this might be that the current happy eyeballs code can keep a context value live unnecessarily. But I don't think this is true. The happy eyeballs code calls context.WithCancel to get a cancelable context, and as soon as one serial call succeeds, it cancels the context, which will cancel the pending serial calls.

So I don't see a bug here.

I haven't looked at the code for #21597, but from reading the issue my impression is that that code does not currently use a cancelable context to cancel pending actions. If that is correct, then the issues do not seem similar.

@bcmills

This comment has been minimized.

Member

bcmills commented Aug 25, 2017

as soon as one serial call succeeds, it cancels the context, which will cancel the pending serial calls.

That's true, and makes the over-retention here much less severe in practice than the one in #21597, but because dialParallel doesn't wait for the second (canceled) call to return, it can still lead to arbitrary increases in memory usage under CPU pressure, where there can be an arbitrarily large scheduling delay between cancelling the Context and actually returning from the background function call.

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Aug 25, 2017

Agreed, but my guess is that on balance the current behavior is preferable. It doesn't penalize the common case.

@bcmills

This comment has been minimized.

Member

bcmills commented Aug 25, 2017

on balance the current behavior is preferable. It doesn't penalize the common case.

In the common case (where we are not overloaded on CPU), it shouldn't make much difference one way or the other. If dialSingle really is responsive to cancellation, the amount of latency added by waiting for the goroutine to return should be negligible compared to the latency of dialing a network connection.

That is: the difference only matters when dialSingle is not responsive, and that is precisely the condition under which the over-retention is likely to matter too.

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Aug 25, 2017

The common case is that a Context value is tiny.

@tombergan

This comment has been minimized.

Contributor

tombergan commented Aug 28, 2017

AFAIK there are no actual cases where someone suffered an OOM due to net or net/http retaining httptrace structs in contexts. This issue and #21597 arose because of a Google-internal package that does the following:

func Go(ctx context, f func(context)) {
  c := newCtx(ctx)
  go func() {
    defer close(c.closed)
    f(c)
  }()
}

func (c ctx) Value(key interface{}) interface{} {
  select{
  case <-c.closed:
    panic("Value called after ctx done")
  default:
  }
  ...
}

The open question is whether that panic should exist. There are two ways to view the question:

  1. Any function that creates goroutines should wait for all of those goroutines to complete before returning (in the absence of a detach feature like #19643). I believe this is @bcmills' position. net.dialParallel cancels its goroutines but does not wait for them to complete, which means the above panic can be triggered if net.dialParallel is called by f(ctx). Instead, net.dialParallel should wait for all goroutines to complete before returning.

  2. The above Value implementation should not panic. The context API does not make it illegal to hold onto a context in a background goroutine, therefore, it's not reasonable for Context.Value to panic as in the above implementation.

/cc @zombiezen @Sajmani

@bcmills

This comment has been minimized.

Member

bcmills commented Jan 5, 2018

The common case is that a Context value is tiny.

That's not at all obvious to me.

Yes, an individual Context value is tiny: but the set of memory that it pins is arbitrarily large. (For example, it might hold a pointer to an arbitrarily large log entry or journal to be flushed at the end of the call to which the Context was passed.)

@ianlancetaylor ianlancetaylor added this to the Go1.11 milestone Mar 30, 2018

@ianlancetaylor ianlancetaylor modified the milestones: Go1.11, Unplanned Jun 27, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment