Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: Go 2: bake a cooperative goroutine cancellation mechanism into the language #27982

Closed
faiface opened this issue Oct 3, 2018 · 25 comments
Closed

Comments

@faiface
Copy link

@faiface faiface commented Oct 3, 2018

Problem: the "context" package

More than a year ago, I wrote a blog post titled Context Should Go Away For Go 2 which received a fair amount of support and response. In the said blog post, I described reasons why the "context" package is a bad idea because it's too infectious.

As explained in the blog post, the reason why "context" spreads so much and in such an unhealthy fashion is that it solves the problem of cancellation of long-running procedures.

I promised to follow the blog post (which only complained about the problem) with a solution. Considering the recent progress around Go 2, I decided it's the right time to do the follow up now. So, here it is!

Solution: bake cancellation into Go 2

My proposed solution is to bake cancellation into the language and thus avoiding the need to pass the context around just to be able to cancel long-running procedures. The "context" package could still be kept for the purpose of goroutine-local data, however, this purpose does not cause it to spread, so that's fine.

In the following sections, I'll explain how exactly the baked-in cancellation would work.

One quick point before we start: this proposal does not make it possible to "kill" a goroutine - the cancellation is always cooperative.

Examples to get the idea

I'll explain the proposal in a series of short, very contrived examples.

We start a goroutine:

go longRunningThing()

In Go 1, the go keyword is used to start a goroutine but doesn't return anything. I propose it should return a function which when called, cancels the spawned goroutine.

cancel := go longRunningThing()
cancel()

We started a goroutine and then canceled it immediately.

Now, as I've said, cancellation must be a cooperative operation. The longRunningThing function needs to realize its own cancellation on request. How could it look like?

func longRunningThing() {
    select {
    case <-time.After(5 * time.Second):
        fmt.Println("finished")
    }
}

This longRunningThing function does not cooperate. It takes 5 seconds no matter what. That's the first takeaway: cancellation is optional - if a goroutine does not support cancellation, it remains unaffected by it. Here's how we add the support:

func longRunningThing() {
    select {
    case <-time.After(5 * time.Second):
        fmt.Println("finished")
    canceling:
        fmt.Println("canceled")
    }
}

I propose the select statement gets an additional branch called canceling (a new keyword) which gets triggered whenever the goroutine is scheduled for cancellation, i.e. when the function returned from the go statement gets called.

The above program would therefore print:

canceled

What if the long-running thing spawns some goroutines itself? Does it have to handle their cancellation explicitly? No, it doesn't. All goroutines spawned inside a canceled goroutine get canceled first and the originally canceled goroutine starts its cancellation only after all its 'child' goroutines finish.

For example:

func longRunningThing() {
    go anotherLongRunningThing()
    select {
    case <-time.After(5 * time.Second):
        fmt.Println("finished")
    canceling:
        fmt.Println("canceled")
    }
}

func anotherLongRunningThing() {
    select {
    case <-time.After(3 * time.Second):
        fmt.Println("child finished")
    canceling:
        fmt.Println("child canceled")
    }
}

This time, running:

cancel := go longRunningThing()
cancel()

prints out:

child canceled
canceled

This feature is here because the child goroutines usually communicate with the parent goroutine. It's good for the parent goroutine to stay fully intact until the child goroutines finish.

Now, let's say, that instead of in another goroutine, longRunningThing needs to execute anotherLongRunningThing three times sequentially, like this (anotherLongRunningThing remains unchanged):

func longRunningThing() {
    anotherLongRunningThing()
    anotherLongRunningThing()
    anotherLongRunningThing()
}

This time, longRunningThing doesn't even handle the cancellation at all. But, cancellation propagates to all nested calls. canceling this longRunningThing would print:

child canceled
child canceled
child canceled

All anotherLongRunningThing calls got canceled one by one.

What if anotherLongRunningThing can fail, or just wants to signal it was canceled instead of finishing successfully? We can make it return an error:

func anotherLongRunningThing() error {
    select {
    case <-time.After(3 * time.Second):
        return nil
    canceling:
        return errors.New("canceled")
    }
}

Now we update the longRunningThing to handle the error (using the new error handling proposal):

func longRunningThing() error {
    check anotherLongRunningThing()
    check anotherLongRunningThing()
    check anotherLongRunningThing()
    return nil
}

In this version, longRunningThing returns the first error it encounters while executing anotherLongRunningThing three times sequentially. But how do we receive the error? We spawned the function in a goroutine and there's no way to get the return value of a goroutine in Go 1.

Here comes the last thing I propose. I propose that the function returned from the go statement has the same return values as the function that was set to run in the goroutine. So, in our case, the cancel function has type func() error:

cancel := go longRunningThing()
err := cancel()
fmt.Println(err)

This prints:

canceled

However, if we waited 10 seconds before canceling the goroutine (longRunningThing takes 9 seconds), we'd get no error, because the function finished successfully:

cancel := go longRunningThing()
time.Sleep(10 * time.Second)
err := cancel()
fmt.Println(err)

Prints out:

<nil>

And lastly, say we have a function called getGoods which contacts some service, gets some goods back and sends them on a channel. We only want to wait for the goods for 5 seconds, no more. Here's how we implement a timeout:

goods := make(chan Good)
cancel := go getGoods(goods)

select {
case x := <-goods:
    // enjoy the goods
case <-time.After(5 * time.Second):
    err := cancel()
    return errors.Wrap(err, "timed out")
}

And that is the end of this series of short examples. I've shown all of the proposed features. In the next section, I'll describe the features more carefully and explain precisely how they'd work.

Details

I propose to extend the go statement to return a function, which takes no arguments and its return type is the same as the return type of the function called in the go statement, including multiple return values. Secondly, I propose to extend the select statement with an optional canceling branch.

For example:

var (
    f1 func() float64           = go math.Sqrt(100)
    f2 func() (*os.File, error) = go os.Open("file.txt")
    f3 func() int               = go rand.Intn(20)
)

Calling the function returned from the go statement suspends until the spawned goroutine returns, then it returns exactly what the spawned function returned. Calling the returned function multiple times causes nothing additional and always returns the same results.

Calling the functions assigned above results in this:

fmt.Println(f1()) // 10
fmt.Println(f2()) // &{0xc4200920f0} <nil>
fmt.Println(f3()) // 17

// we can call them as many times as we want
fmt.Println(f3(), f3(), f3()) // 17 17 17

Furthermore, calling the returned function causes the spawned goroutine to start a cancellation process. The cancellation process has two stages:

  1. canceling all child goroutines (goroutines spawned inside the goroutine that is being canceled). This stage finishes when all child goroutines finish cancellation.
  2. Switching into the canceling mode. In this mode, all select statements always select the canceling branch if present. If not present, they function normally.

Eventually, the goroutine returns. The call to the function returned from the go statement unsuspends and returns these values.

Other uses of the mechanism

The mechanism can also be used for other purposes. One that comes to my mind is to use the functions returned from the go statement as futures. Indeed, this is a common pattern in Go:

ch := make(chan T)
go func() {
    ch <- function()
}()
// some other code
x := <-ch

This whole boilerplate is here just to execute function concurrently and use its return value later. With my proposal, we could simplify that code like this:

future := go function()
// some other code
x := future()

Of course, this would only work if function wouldn't support cancellation, but most functions shouldn't support it, and those that do should document it.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 3, 2018

Some programs will want the ability of a cancelable goroutine G1 to start a new goroutine G2 that should keep running even if G1 is canceled. For example, this will happen if there is a shared background goroutine that is started the first time it is needed. People will want a mechanism to cancel G1 without having to wait for G2 to exit.

Loading

@faiface
Copy link
Author

@faiface faiface commented Oct 3, 2018

Yes, I am completely open to the possibility of not propagating the cancellation to the child goroutines. Maybe that would even be better. In most cases, it would be possible to structure the program in such a way as to do this:

defer (go function())()

and that would automatically cancel the child goroutine on return.

Loading

@EVODelavega
Copy link

@EVODelavega EVODelavega commented Oct 3, 2018

You're missing a very, very important reason why the context package is so prolific. Suppose I'm entering the main routine, connect to some kind of DB, and some other services. When one of these core resources fail, or the app receives a term or kill signal, I want to cleanly close all connections at once. If there's any routines running (e.g. pub/sub consumers), you'd have to have access to all cancel functions and call them one by one. I, meanwhile, just cancel my context and all routines receive the ctx.Done(), and can return.

Loading

@faiface
Copy link
Author

@faiface faiface commented Oct 3, 2018

@EVODelavega first of all, nothing prevents you from having those cancel functions in a slice and call them using a for loop. Second, if they were all spawned together from one other goroutine, you can just cancel that one and (considering my original proposal) the cancellation will propage to the children.

Loading

@jmcclell
Copy link

@jmcclell jmcclell commented Oct 3, 2018

This feels like exception handling in reverse. Forward propagation of cancellation events . that must (should?) be handled, with no knowledge at the call-site whether or not they will be.

Loading

@nomad-software
Copy link

@nomad-software nomad-software commented Oct 3, 2018

One problem with this change is that every goroutine would require a select in order to handle a cancellation, even if the select is not used for anything else.

Loading

@faiface
Copy link
Author

@faiface faiface commented Oct 3, 2018

@nomad-software That's exactly what you currently have to do with context too.

Loading

@thejerf
Copy link

@thejerf thejerf commented Oct 3, 2018

If we're going to bake a new thread-scoped value into a goroutine's environment that can be accessed via a new keyword (or any other mechanism), I'd much rather it was a full context object, in which case the cancellation would just be a special case.

Loading

@ghost
Copy link

@ghost ghost commented Oct 3, 2018

Yes, I am completely open to the possibility of not propagating the cancellation to the child goroutines.

Overriding could happen by functions explicitly handling cancellation on the sub-goroutines:

cancel := go func() {
    go func() {                         // Propegated
        ...
    }
    cancel := go func() {         // Not propegated
        ...
    }
}

The issue I see with auto-propegation is that it's implied behavior that could easily result in unexpected results -- e.g., in a library. You'd have to "infect" the documentation for all functions in the calling chain, per:

most functions shouldn't support it, and those that do should document it.

I'm imagining a rats-nest of "This function doesn't support cancellation, but it does call a function in a library that says it supports cancellation."

I think explicit propegation is a better option, even if it results in more code, and especially if the default expectation is non-handling.

Loading

@deanveloper
Copy link

@deanveloper deanveloper commented Oct 3, 2018

I also dislike "context", and instead made my own, renamed version ("cancel"). After using it for a while and talking with coworkers about it, I've made a proposal here which talks about my experience with a renamed package with very similar functionality.

I honestly forgot I was going to make a proposal about my experience with my "cancel" package until I saw this.

I think this is a nice approach to the same problem, though. My only real issue is that you don't really know how effective the cancel() function is. I really like that this removes the ctx context.Context repetition, having two versions of a function that do the same thing, and removes the unintuitiveness of the name "context".

Loading

@sigmonsays
Copy link

@sigmonsays sigmonsays commented Oct 3, 2018

👎
this feature only complicates the language for very little value add. Just use channels as designed.

Loading

@faiface
Copy link
Author

@faiface faiface commented Oct 3, 2018

@sigmonsays using channels as designed has led us to the context package, which is the main motivation for this proposal. Everything wrong with the context package is described in the linked blog post

Loading

@freman
Copy link

@freman freman commented Oct 3, 2018

but what if you use context for... context, you know, values related to the context?

Loading

@faiface
Copy link
Author

@faiface faiface commented Oct 3, 2018

@freman

The "context" package could still be kept for the purpose of goroutine-local data, however, this purpose does not cause it to spread, so that's fine.

Loading

@mfhholmes
Copy link

@mfhholmes mfhholmes commented Oct 4, 2018

In your proposal, what would happen if you wrote: if canceling { return nil }? Is it a magic variable with a bool value, or is it a language construct only for use in select statements?

As someone has mentioned above, this feels like exception handling, it's try...catch but in reverse. I think there'd be more value in either wrapping this into the error handling, or extending this to cover exception handling.

Loading

@tmathews
Copy link

@tmathews tmathews commented Oct 4, 2018

The creativity here is cool, but it doesn't feel fully fleshed out and this may be why you are seeing a lot of disagreement here. It doesn't seem like this proposal covers every use case that needs to be appropriately handled.

It also seems like this proposal is adding syntactical sugar on top of something we already have in order to satisfy a minor annoyance. What this proposal is doing is suggesting we add more ways to do the same thing but with more limitations. This will unfortunately cause writers and readers to know another aspect of the language that is unnecessary. It is hard to design something simple yet complex and keeping the core values that we already have is important. I would rather go the straight forward route, may it be more verbose, in favor that it keeps the simplicity and promotes fast sharing of knowledge and code.

This is not to shoot down, but to promote rethinking of the current strategy proposed.

Loading

@ghost
Copy link

@ghost ghost commented Oct 4, 2018

Nicely said, @tmathews. I'd disagree only with the characterization of this as a "minor annoyance." Contexts remind me a lot of IO in Haskell, where a developer needs to pass around a value that they have no interest in, nor use. This proposal may not be the right solution, but I do tend to agree with describing Contexts -- used for flow control -- as "infecting" code.

Loading

@deanveloper
Copy link

@deanveloper deanveloper commented Oct 4, 2018

It also seems like this proposal is adding syntactical sugar on top of something we already have in order to satisfy a minor annoyance.

I like this because it removes

  1. context poisoning, every function has (and "should have") an alternate FuncContext variant
  2. ctx context.Context, which mirrors Foo foo = new Foo() an aspect of many verbose languages which Go wanted to fix.
  3. The readability issue of context.Context. The name Context has nothing to do with cancellation, despite being its primary purpose.

What this proposal is doing is suggesting we add more ways to do the same thing but with more limitations

This proposal is aiming to remove the "cancellation" aspect of context, and instead bake it into the language. context can then still be used for goroutine-local data.

This will unfortunately cause writers and readers to know another aspect of the language that is unnecessary.

They already do, with context if they encounter a function and don't know what it is. Not to mention that because of context's unintuitive name, it makes it hard to understand what it does.

I would rather go the straight forward route, may it be more verbose, in favor that it keeps the simplicity and promotes fast sharing of knowledge and code.

I totally agree. Honestly I'd much rather just have a parallel of how context's cancellation works but with goroutines, rather than this complex mess of "this propagates, but this doesn't..." and the fact that cancelling becomes a blocking operation... I really think that this problem has a much more simple solution.

Loading

@EVODelavega
Copy link

@EVODelavega EVODelavega commented Oct 4, 2018

I've commented earlier, but perhaps I didn't express my take on this proposal clearly. I'm going to try and reiterate my main concern, and highlight a couple more issues that nobody seems to have answered definitively.

Cancel the application context is more efficient

Having a ctx, cfunc := context.WithCancel(context.Background()) being passed down from the main routine allows for a clean, simple way to terminate all routines (and release resources) when your application receives a TERM or KILL signal. This simple mechanism is surely preferable over the either looping over a bunch of cancel functions. In case of a TERM/KILL signal, there needs to be a mechanism to propagate the cancel functions back up to the main routine. This is probably going to be a channel, which by definition is not atomic, it is not possible to buffer the channel correctly, and when you start iterating over the cancel functions, more routines could be starting up, so closing the channel is impossible. We need a channel that has a dynamic buffer, which is ridiculous. What about cancel functions in the channel (or ones that were read from the channel, but have since been called already)? The call should be a NOOP, but they're unable to be GC'ed, and just mean the final loop to explicitly cancel running routines will take longer. While we're cancelling routines sequentially, they might be performing tasks that really don't want them to do.

The alternative as suggested was implicitly propagating the cancellation, but I'll cover the issues I see there later.

Contexts are wrapped

Because contexts are wrapped (e.g., a call taking in the cancel context from main can pass it on after wrapping it using something like wrapped := context.WtihValue(ctx, someKey, someValue). In doing so, the cancel call will still propagate, but routine-specific values can be passed through to routines that need contextual values.

Context allows routines to be tied to request context

A default background context (with cancellation from main) can be used, but in specific calls (e.g. handling a request) certain routines can be tied to the request context. Having channels to a channel that requires an explicit call to cancel(), does provide the same granularity of control. I can start a DB transaction while a gRPC stream is open, but I want to cancel any transactions relevant to that request by passing in the context. This proposal relies on a blocking <-requestCtx.Done() to then manually call one or more cancel() functions. This adds clutter and is error-prone. It does not remove the need of context. It simply relies on the code being bug-free to avoid sleeping routines, or DB transactions from cancelling when they should. What's more: looping over cancel calls is not atomic. Race conditions can become an issue, and surely that's something we ought to avoid.

Opting out of child routine cancellation is dangerous

If the main context is cancelled, it means the application is going to terminate. There should be no opt-out. The proposal to allow child routines to opt-out of parent cancellation is a bad idea.
The biggest oversight here, IMO, is that there seems to be no definitive answer to what would happen to the second child routine in this particular case:

cancel := go func() {
    go func() {                         // Propegated
        ...
    }()
    defer log.Print("parent routine returns")
    cancel := go func() {         // Not propegated
        ...
    }()
   defer cancel() // cancel child on return
}
// ...
cancel() // cancels the parent, and the first child routine

We're highly likely to end up with no way to cancel the second child routine. If the main routine contains a defer cancel(), will it be invoked? If it does, the cancel call essentially propagates to the children. Clearly, this defer should not be invoked. Should the defer log.Print() call be executed? It's unclear. What about constructs like this:

cancel := go func() {
    cancel := go func(){}() // non-propagating cancel for child
    defer func() {
        log.Println("routine returns") // <-- this is fine
        cancel() // cancel child routine // <-- this should be skipped??
    }()
}()

essentially, defer becomes a "suggestion" not a guarantee. The use of defer, then, would become an anti-pattern, bad practice, and dangerous.

More on propagating cancellation

I'm not a contributor to the core packages or compiler (yet). What I do know having read through quite a lot of the docs and code, is that implementing this proposal will require substantial work on the compiler side of things (the propagation of cancel will be a compile time thing). It'll also add significant complexity to the runtime. A simple example: if I cancel a parent routine, in what order should the cancellation take effect? First cancel the parent, then the children? No. That will cause the problem I highlighted WRT channels closed by the parent, leaving the child free to write to it. Child to parent is safer, but this creates a new problem:

cancel := go func() {
    ch := startSomething() // this call internally starts a new routine We don't know whether it's going to be cancelled or not ==> bad
    for v := range ch {
       // do something. This blocks the parent until the child routine closes the channel
    }
}()

If the child gets cancelled first, the parent routine will return before it gets cancelled. Depending on what we'd do with defer calls, this might be an issue.

Another issue: What if I want to have the caller decide whether or not the child routine should be cancelled when the parent is or not, I can do something like this:

func doSomething() (<-chan struct{}, cancelFunc) {
}

// then:
cancel := go func() {
    ch1, _ := doSomething() // cancel propagates
    ch2, cancel := doSomething() // cancel propagates
}()

Not only is this ugly and unsafe, I would argue this is overloading the ignoring _ token. Is an error being ignored, or not? This is not helping you to write self-documenting code. It's a maintenance nightmare. Far more than a slight over-use of the context package is.

Conclusion:

The benefits of the context package are plenty:

  • It's established, well understood, and trivial to use
  • It allows for granular control (tie routines to a request context, passing in values to certain routines, contexts with timeouts, ...) most of these things need to be manually implemented if this proposal becomes the new norm
  • cancelling a context is a single fire-and-forget call. Manually having to cancel all routines on application termination requires either passing up all cancel calls to the main routine somehow (channel?), a special cancel case, or simply not caring and hoping for the best. There is a huge problem with looping over the cancel functions in the main routine as explained above.

The issues with this proposal - other than directly negating the added value of a context are many:

  • Any defer call in a routine should be examined, making sure it's not a defer cancel(). deferred closures are even more of a mess.
  • It's practically impossible (without changing the chan type) to propagate the cancel funcs back to the main routine.
  • As demonstrated, this approach is guaranteed to lead to race conditions.
  • On race conditions: how can the race detector be expected to make sense of this?
  • routine leaks are almost certainly going to be a big problem.
  • When you say routines, you're probably thinking channels. If child routines can opt-out of cancelling the context, what happens to a channel created by the parent routine? The parent routine might be consuming the channel. If cancelled, I'd expect the parent routine to have a defer close(ch) somehwere. The child routine that isn't cancelled might be pushing stuff on that channel. This leads to a runtime panic.

I could spend quite some time listing more problems, but I hope it's clear from this list of highlights that this proposal will

  1. Not eliminate the context package (request context, values in context, timeout contextx, etc..)
  2. Introduce a lot of boilerplate code to try and manage the cancellation of routines somewhat
  3. It'll try to manage the cancellation properly, but the problem is one without a solution. It's almost certain to be impossible to do so.
    4, most likely compromise the defer mechanism
  4. Compromise the race detector
  5. Compromise data integrity in real world applications
  6. Dramatically increase the possibility of runtime panics owing to parent routines closing channels that are used by child routines that aren't cancelled when the parent is.
  7. It's not clear what the status of child routines that are initialised through a synchronous call will be.

I could go on for quite some time pointing out oversights, issues, risks, problems, whatevers, and what-ifs. I'm going to leave it here. I hope I've provided enough reasons why this proposal is, IMHO, something that should not make it into the language. To sum it up in a one-liner, though:

The context package solves a myriad of problems. This proposal might solve 1 or 2 problems, whilst introducing at least 10 new ones.

Loading

@faiface
Copy link
Author

@faiface faiface commented Oct 4, 2018

@EVODelavega I'll get back to you tomorrow and explain why, but in short: I'm sorry, but you must've misunderstood the proposal in significant ways because the majority of the supposed problems you pointed out simply don't exist. Literally, every problem listed at the end of your post isn't real. If it's not clear, I'll explain in detail tomorrow.

Loading

@tv42
Copy link

@tv42 tv42 commented Oct 5, 2018

This:

goods := make(chan Good)
cancel := go getGoods(goods)

select {
case x := <-goods:
    // enjoy the goods
case <-time.After(5 * time.Second):
    err := cancel()
    return errors.Wrap(err, "timed out")
}

is a lot code for doing this:

ctx, cancel := context.WithTimeout(ctx, 5 * time.Second)
defer cancel()
goods, err := getGoods()

It also forces a creation of an extra channel goods and a goroutine, which aren't there in the context-based version.

Loading

@turtleDev
Copy link

@turtleDev turtleDev commented Oct 5, 2018

I'm not sold on this proposal, mostly because it's trying to solve a problem where there already exists a widely used solution baked into the standard library.

Don't fix what isn't broken. Sure context basically becomes a central part of interactions between components, but that is about it. It doesn't make code hard to understand or unreliable. It doesn't make it hard to reason about the semantics.

Loading

@shankaryoga21
Copy link

@shankaryoga21 shankaryoga21 commented Oct 5, 2018

I am not understand . how to go please give idea

Loading

@wsc1
Copy link

@wsc1 wsc1 commented Oct 18, 2018

@freman

The "context" package could still be kept for the purpose of goroutine-local data, however, this purpose does not cause it to spread, so that's fine.

@faiface your observation that context spreads and is contagious is quite motivating to rethink the problem. The fact that the solution is a contagious thing called context is by itself an indication that there is a design problem somewhere that go2 could solve.

I wrote a dedicated cancellation mechanism about the same time context's started getting into the std lib. it supports a wider asynchronous interface to long running tasks.

That said, proliferation of things called contexts, as fundamentally distasteful as that combination is, are widely used and relied upon. So any solution will likely need to work seamlessly with them, if only for arriving at consensus.

Loading

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 23, 2018

This proposal seems to be a bit of syntactic sugar around what one can do today using context.Context. Syntactic sugar can be OK, but there are several other things we could do if we change the go statement to return a value. It's not clear that this is the best choice here, as it doesn't seem to add much to what we can do already.

Closing this issue in favor of a more general context discussion in #28342.

Loading

@golang golang locked and limited conversation to collaborators Oct 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet