Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: Go 2: goroutine collector or closer to prevent goroutine leaks #41414

Open
Splizard opened this issue Sep 16, 2020 · 27 comments
Open

proposal: Go 2: goroutine collector or closer to prevent goroutine leaks #41414

Splizard opened this issue Sep 16, 2020 · 27 comments

Comments

@Splizard
Copy link

@Splizard Splizard commented Sep 16, 2020

Memory management in Go is implicit and this is considered to be a good thing. I think this needs to be kept in mind when discussing language features that are implicit vs explicit.

Memory leaks are one of the issues that an implicit garbage collector is able to prevent. All memory is accounted for in Go.

However, goroutines leaks have replaced memory leaks and just like manual memory-managment, goroutine-management is a pain. Just like in C where developers must think about the lifetime of any memory they allocate. In Go, developers need to think about the lifetime of any goroutine they create. In both cases, this is done explicitly. In C with free and in Go with context.Context. Both are optional and can result in leaks.

@networkimprov notes this about goroutine leaks.

for stalled I/O syscalls, the leak encompasses an OS thread, which is typically a lot more memory than a goroutine.

The question would be then, if Go has an implicit garbage collector, why doesn't it have an implicit goroutine collector.

My proposal includes a small language change so that goroutines are referenced and therefore implicitly collectable.

package main

import "fmt"

var leak chan struct{}

func DoSomething() {
    <- leak
}

func main() {
    //goroutine is a new builtin 'non-comparable reference type' that is returned by go calls.
    var job goroutine = go DoSomething()

    //when a value of the goroutine type is *collected* by the garbage collector, 
    // 1. blocking IO operations inside that goroutine return an "end of goroutine" error.
    // 2. subsequent blocking IO operations inside that goroutine panic with "end of goroutine".
    // 3. blocking channel operations panic with "end of goroutine".
    job = nil

    //Prevent unused variable error.
    fmt.Println(job) //Output: nil
}

To extend the lifetime of a goroutine, assign it to a global variable or appropriately scoped value (ie a http.Request).
To cancel a goroutine, remove any references to it that are held by your program.

This proposal would also reduce the need of context.Context being added to every function signature and addresses some of the issues of #28342

If backwards compatabillity is desired, go calls made as a statement should not have their goroutines collected. As for future compatabillity, the goroutine type is flexible and other functionality for dealing with goroutines (ie return values, recovering from panics, and/or priority) could be added in the future without breaking compatabillity.

Granted, this wouldn't prevent all goroutines leaks ie (infinite loops without blocking operations or error handling) but I believe it would catch most real-world cases.

This proposal is similar to #27982 however it doesn't have the same objections as that proposal.
Namely, there is no parent-child relationship between goroutines and this is not exactly syntactic sugar for contexts.

Closable goroutines

An alternative to having a goroutine collector, is to have closable goroutines which are closed with the builtin close() function.
When closed they act as if they were collected by a goroutine collector as described above.
See comment below.

var job = go DoSomething()

//when a value of the goroutine type is closed.
// 1. blocking IO operations inside that goroutine return an "goroutine closed" error.
// 2. subsequent blocking IO operations inside that goroutine panic with "goroutine closed".
// 3. blocking channel operations panic with "goroutine closed".
close(job)

FAQ

What blocking operations return an error or panic?
The only blocking IO operations that raise an error or panic are channel operations and system-call backed functions that can return an error such as the methods on os.File or net.Conn, all other blocking operations block as usual.

@gopherbot gopherbot added this to the Proposal milestone Sep 16, 2020
@gopherbot gopherbot added the Proposal label Sep 16, 2020
@davecheney
Copy link
Contributor

@davecheney davecheney commented Sep 16, 2020

The question would be then, if Go has an implicit garbage collector, why doesn't it have an implicit goroutine collector.

I don’t understand this statement, go does have such a facility — when a goroutine exits any memory references that are now unreachable are freed. This includes the goroutines stack and it’s associated runtime data structures.

What I think you’re asking for is explicit goroutine collection, that is, a way to cause a goroutine to exit this freeing its resources. Is that what you are asking for?

@Splizard
Copy link
Author

@Splizard Splizard commented Sep 16, 2020

go does have such a facility — when a goroutine exits any memory references that are now unreachable are freed.

@davecheney that is garbage collection, unreachable memory is automatically freed by the runtime. I am referering to cleaning up 'zombie' goroutines that are no longer completing any useful function because whatever created them has dropped every reference to them presumably because the work is complete, cancelled or there was a panic.

I suppose it's also addressing #5308 in some form.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Sep 16, 2020

Interesting idea.

If a collected goroutine panics due to a blocking channel operation, that will most likely terminate the program. Is that what we want?

What if the goroutine is holding a lock when it is collected, then it is quite possible that the lock will not be released, and the program will deadlock. Is that what we want? It seems hard to debug.

There are quite a few ways that a goroutine can block. It can be executing a blocking system call, or it can be blocked in cgo code. It can be blocking trying to acquire a mutex. It can be sleeping due to time.Sleep. We would have to identify each possibility and figure out how to handle it. That is a lot of library interaction for a language feature.

@davecheney
Copy link
Contributor

@davecheney davecheney commented Sep 16, 2020

Thanks for confirming. I have two additional questions;

  1. What are your criteria for identifying zombie goroutines?
  2. What is preventing those zombie goroutines from exiting?
@Splizard
Copy link
Author

@Splizard Splizard commented Sep 16, 2020

@ianlancetaylor

If a collected goroutine panics due to a blocking channel operation, that will most likely terminate the program. Is that what we want?

That's not ideal, I think this type of panic could be automatically recovered by the goroutine before exiting.

What if the goroutine is holding a lock when it is collected, then it is quite possible that the lock will not be released, and the program will deadlock. Is that what we want? It seems hard to debug.

You would expect locks to be released with defer, on errors, or after recovering from a panic.

There are quite a few ways that a goroutine can block.

The proposal only applies to channel operations and system IO that can return an error such as methods on os.File and net.Conn. Any other blocking operations do not panic and will continue to block as usual.

@davecheney

What are your criteria for identifying zombie goroutines?

This proposal adds goroutine reference types that are returned by go invocations, a goroutine is considered to be a 'zombie' when no references to it exist. If I want a goroutine to exist, I keep a reference to it somewhere. If I have no references to the goroutine then it might as well not exist to me

What is preventing those zombie goroutines from exiting?

Some 'zombie' goroutines may exit just fine. Others may be blocked on a IO operations that are taking a very long time or are waiting to recieve on a channel that has no senders anymore. In other words, they think they still have work to do.
Collecting a goroutine doesn't terminate it, the runtime just moves it along so to speak because what it's doing is no longer important. It does this by preventing IO & channel operations so that the goroutine eventually 'gives up' and exits.

@davecheney
Copy link
Contributor

@davecheney davecheney commented Sep 16, 2020

This proposal adds goroutine reference types that are returned by go invocations, a goroutine is considered to be a 'zombie' when no references to it exist. If I want a goroutine to exist, I keep a reference to it somewhere. If I have no references to the goroutine then it might as well not exist to me

So this is the flow of logic

ref := go something()
ref = nil // kills goroutine

Yea? Something like this?

Others may be blocked on a IO operations that are taking a very long time or are waiting to recieve on a channel that has no senders anymore. In other words, they think they still have work to do.

This doesn't sound like the reference to a goroutine model you described earlier. It sounds different.

Collecting a goroutine doesn't terminate it, the runtime just moves it along so to speak because what it's doing is no longer important.

Could you be more precise about "just moves it along". Do you mean send some kind of signal too it?

@networkimprov
Copy link

@networkimprov networkimprov commented Sep 16, 2020

To leverage this, we'd probably keep a map of active goroutines, and therefore name them. Terminating a goroutine then looks like delete(grMap, "process1"). Do you agree, or envision a different scheme?

Wouldn't it be essential in some cases to discover what "process1" is blocked on before deleting it?

The proposal doesn't reference any of the recent discussion re interrupting blocking file ops and/or syscalls (e.g. #41054), but seems to incorporate that goal. Is that accurate?

runtime just moves it along ... by preventing IO & channel operations so that the goroutine eventually 'gives up' and exits.

That could become a CPU-intensive loop if any code is retrying on error. And you can't continue past a mutex that didn't lock.

@Splizard
Copy link
Author

@Splizard Splizard commented Sep 16, 2020

@davecheney

So this is the flow of logic

Yes, this is what I was illustrating with the example:

//goroutine is a new builtin 'non-comparable reference type' that is returned by go calls.
var job goroutine = go DoSomething()

//when a value of the goroutine type is *collected* by the garbage collector, 
// 1. blocking IO operations inside that goroutine return an "end of goroutine" error.
// 2. subsequent blocking IO operations inside that goroutine panic with "end of goroutine".
// 3. blocking channel operations panic with "end of goroutine".
job = nil

This doesn't sound like the reference to a goroutine model you described earlier. It sounds different.

I gave some examples of why a goroutine wouldn't exit on its own. Whether or not a goroutine will exit is a halting problem and does not have anything to do with being a 'zombie' in this context.
The goroutine is classified as a 'zombie' or 'collectable' strictly when there are zero references to it.

Could you be more precise about "just moves it along". Do you mean send some kind of signal too it?

Not a signal. Once a goroutine has been 'collected', closed or marked as a 'zombie' under this proposal, it is not allowed to perform IO operations such as read/write on files or connections. It is also not allowed to block during a send or recieve channel operation.

Any IO operations that the goroutine makes will return an error such as "end of goroutine".
Any blocking channel operations (read or write) that the goroutine makes will panic with a message such as "end of goroutine".

The rationale here is that the goroutine is able to handle these errors/panics by cleaning up and exiting.
Whoever has the reference to the goroutine effectively has control over its lifetime and when references are freed, goroutines will be able to stop without any extra coordination or channels having to be passed throughout the call stack.
It also addresses the common kinds of goroutine leaks that developers make when working with goroutines.

Here is an example from https://www.ardanlabs.com/blog/2018/11/goroutine-leaks-the-forgotten-sender.html

// [Adapted from Listing 1 in the article]
// leak would normally be a buggy function. It launches a goroutine that
// blocks receiving from a channel. Nothing will ever be
// sent on that channel and the channel is never closed so
// that goroutine would be blocked forever.
func leak() {
	ch := make(chan int)

	//The returned reference is not stored anywhere so the goroutine will be collected.
	//The channel operation will panic and the goroutine will exit, preventing the goroutine leak.
	_ := go func() {
		defer recover() //Recover is added to prevent the panic from bringing down the entire program.
		val := <-ch
		fmt.Println("We received a value:", val)
	}()
}

@networkimprov

To leverage this, we'd probably keep a map of active goroutines, and therefore name them. Terminating a goroutine then looks like delete(grMap, "process1"). Do you agree, or envision a different scheme?

Yes for globally scoped routines that could make sense, or a map to a slice of goroutine values so that you can associate a job with all the goroutines that are related to that job. If you wanted to cancel the job, you delete it from the map and all the goroutines would get collected and if written correctly would cleanup and exit.

Wouldn't it be essential in some cases to discover what "process1" is blocked on before deleting it?

I think you need to provide an example of such a case.

The proposal doesn't reference any of the recent discussion re interrupting blocking file ops and/or syscalls (e.g. #41054), but seems to incorporate that goal. Is that accurate?

Yes, this proposal can address situations where a goroutine is blocked for a long time on a resource without any SetTimeout methods, set a timeout in the map of active jobs for deletion and the goroutine will return an error on the blocking resource and the goroutine should exit.

That could become a CPU-intensive loop if any code is retrying on error. And you can't continue past a mutex that didn't lock.

Well, if the goroutine repeats a blocking IO operation after an error was returned, then it should panic. A "end of goroutine" needs to be handled correctly the first time. Edit: I have added this to the proposal.

@networkimprov
Copy link

@networkimprov networkimprov commented Sep 16, 2020

Wouldn't it be essential in some cases to discover what "process1" is blocked on before deleting it?

For example, a process that alternately waits on two resources, one which blocks indefinitely, the other which blocks for short times, but can become stalled due to external failure. You only want to "collect" this goroutine in the latter case.

The app could keep a status variable alongside the goroutine reference, and update it before & after every blocking op, or the runtime could let you inspect what any goroutine is currently blocked on.

@Splizard
Copy link
Author

@Splizard Splizard commented Sep 16, 2020

For example, a process that alternately waits on two resources, one which blocks indefinitely, the other which blocks for short times, but can become stalled due to external failure. You only want to "collect" this goroutine in the latter case.

Process the resource that could stall inside of a timeout and pass a buffered status channel so that you can detect if the goroutine was ended due to timeout.

package main

import (
	"fmt"
	"io/ioutil"
	"os"
	"time"
)

//UnreliableResource represents a function that could block for a long time.
func UnreliableResource(status chan error) {
	f, err := os.Open("/path/to/unreliable/resource")
	if err != nil {
		status <- err
		return
	}

	b, err := ioutil.ReadAll(f)
	if err != nil {
		status <- err
		return
	}

	status <- err
	return
}

//timeout ends the given goroutine after the given duration
//timeout must be given the only reference to g, ie go timeout(time.Second, go DoSomething())
//otherwise this function does nothing.
func timeout(duration time.Duration, g goroutine) {
	time.Sleep(duration)
}

func main() {
	var status = make(chan error, 1)
	go timeout(30 * time.Second, go UnreliableResource(status))

	//if there is a timeout, error will be "end of goroutine"	
	if err := <-status; err != nil {
		fmt.Println(err)
	}
}
@bcmills
Copy link
Member

@bcmills bcmills commented Sep 16, 2020

Compare #19702.

@bcmills
Copy link
Member

@bcmills bcmills commented Sep 16, 2020

I don't think this proposal is semantically appropriate. The purpose of garbage collection is to “[simulate] a computer with an infinite amount of memory.” On such a system, the blocked goroutines would never be collected, and therefore would never panic or return an error.

We violate that abstraction today using finalizers, but finalizers introduce a lot of other problems (such as #9442, #41362, #7546). I think we should seek to minimize their use, not double-down on them as a mechanism for cancellation.

From that standpoint, #19702 was a stronger proposal: it sought to make the behavior of the program more like a computer with infinite memory, not less so.

@Splizard
Copy link
Author

@Splizard Splizard commented Sep 16, 2020

@bcmills
I understand the concern, it's related to the disccusion in #38057

To avoid losing the abstraction, goroutines could be explicitly closed with close (then the developer is free to catch leaking goroutines with runtime.SetFinalizer) but this starts to look like a different proposal.

var job = go DoSomething()

//when a value of the goroutine type is closed.
// 1. blocking IO operations inside that goroutine return an "goroutine closed" error.
// 2. subsequent blocking IO operations inside that goroutine panic with "goroutine closed".
// 3. blocking channel operations panic with "goroutine closed".
close(job)

This reads nicely and a timeout function is easier to reason about.

//timeout closes the given goroutine after the given duration
func timeout(duration time.Duration, job goroutine) {
	time.Sleep(duration)
 	close(job)
}
go timeout(time.Second, go UnreliableWork())

Then there is no need for a special goroutine type, the go invocation could instead return a 'returning channel' that will return the result of the function. Closing this channel would close the goroutine. Although you would need multi-value channels for multiple return values which is a little odd.

func DoSomething() (*Result, error) {
 	return nil, errors.New("failed to do anything")
}

var status chan(*Result, error) = go DoSomething()

result, err <- status

This is more like a builtin context.Context.
Should we continue this disccusion here, in #28342 or is it worth creating a new proposal?

@cristaloleg
Copy link

@cristaloleg cristaloleg commented Sep 16, 2020

Why not to use context ?

ctx, cancel := context.WithCancel(context.Background())
go DoSomething(ctx)

time.Sleep(timeout)
cancel()

(btw, DoSomething can be easily found in context package documentation, coincidence? https://godoc.org/context)

@Splizard
Copy link
Author

@Splizard Splizard commented Sep 16, 2020

Why not to use context?

See #28342, there are a number of challenges with context, for example:

  • It creates two classes of functions, ones that know about context and ones that don't.
  • Every function that needs to be cancellable needs to have a ctx context.Context as an argument.
  • Unlike with errors, it has no builtin support for its usecase.
  • ctx context.Context stutters.
  • The name can be confusing and doesn't imply that any kind of cancellation is involved.
  • Is easy to abuse as a defacto untyped goroutine local-storage.
  • Needs to be handled and used correctly otherwise there will be goroutine leaks.

Inexerienced Go developers can easily leak goroutines. If DoSomething is leaky as below, context will not prevent the leak:

func DoSomething(ctx context.Context) {
    var leak chan struct{}
    <- leak
}

Closing or collecting the goroutine as described in this proposal will.

@Splizard Splizard changed the title proposal: Go 2: goroutine collector to prevent goroutine leaks proposal: Go 2: goroutine collector or closer to prevent goroutine leaks Sep 16, 2020
@egonelbre
Copy link
Contributor

@egonelbre egonelbre commented Sep 17, 2020

Is the rough goal to ensure that you don't get goroutine leaks?

If that's the case, then you could have approaches to solving it. For example https://github.com/egonelbre/antifreeze implements a runtime checker for goroutines that were stuck waiting for too long.

In a similar fashion you could have something that monitors goroutines to finish in time, e.g. https://play.golang.org/p/SbBEi4OYOH7.

And a more complicated version that also prints the stack frame https://play.golang.org/p/L1xiObvrT4a

@networkimprov
Copy link

@networkimprov networkimprov commented Sep 17, 2020

Wouldn't it be essential in some cases to discover what "process1" is blocked on before deleting it?
For example, a process that alternately waits on two resources, one which blocks indefinitely, the other which blocks for short times, but can become stalled due to external failure. You only want to "collect" this goroutine in the latter case.

Process the resource that could stall inside of a timeout and pass a buffered status channel so that you can detect if the goroutine was ended due to timeout.

I expect synchronous code like we'd write today. Besides that, before you deliberately terminate a goroutine, you may want to log a stack trace. You could also use that info to decide whether to terminate at all.

grMap["process1"] = go func(a, b tResource) {
   for {
      v, err := waitForever(a)
      if err != nil { break }
      err = waitaMoment(v)
      if err != nil { return }
      v, err = waitForever(b)
      ...
      err = waitaMoment(v)
      ...
   }
}(x, y)

t := time.NewTicker(d)
for {
   <-t.C
   for key, proc := range grMap {
      blkr := proc.Blocker()
      if blkr.Time > max && blkr.Top != waitForever { // .Top is function at top of stack
         delete(grMap, key)
      }
      proc = nil
   }
}
@bcmills
Copy link
Member

@bcmills bcmills commented Sep 17, 2020

@egonelbre, note that a runtime check for permanently-blocked goroutines was proposed (and, IMO unfortunately, rejected) in #37681.

@RLH
Copy link
Contributor

@RLH RLH commented Sep 17, 2020

@networkimprov
Copy link

@networkimprov networkimprov commented Sep 20, 2020

@Splizard you could also note in the issue that for stalled I/O syscalls, the leak encompasses an OS thread, which is typically a lot more memory than a goroutine.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Sep 29, 2020

The proposal here seems too subtle to me.

It seems odd to have a different behavior based on whether the result of the go statement is assigned to a variable. Presumably if the result is not assigned to a variable it is as though it is assigned to a variable that is never changed, rather than having the value discarded (and therefore garbage collected). That seems like the opposite of the natural default.

It seems confusing to have the behavior change specifically for channel operations and for I/O syscalls. The latter isn't all that well defined in any case.

There may well be ways to improve goroutine cancellation and leaks. But this specific proposal seems to me to be tweaking the language to support a particular use case, rather than providing a general and orthogonal mechanism.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 6, 2020

Based on the discussion above, this is a likely decline. Leaving open for four weeks for final comments.

@Splizard
Copy link
Author

@Splizard Splizard commented Oct 7, 2020

It seems odd to have a different behavior based on whether the result of the go statement is assigned to a variable.

This is addressed by explicitly closing goroutines instead of implicitly collecting them, so if the collector option of this proposal is a likely decline, should I create a new proposal about explicitly closeable goroutines?

var job = go DoSomething()

//when a value of the goroutine type is closed.
// 1. blocking Read/Write operations inside that goroutine return an "goroutine closed" error.
// 2. subsequent blocking Read/Write operations inside that goroutine panic with "goroutine closed".
// 3. blocking channel operations panic with "goroutine closed".
close(job)

It seems confusing to have the behavior change specifically for channel operations and for I/O syscalls.

The rationale is that a closed goroutine no longer has permission to block and it should terminate as soon as possible.
Readers and Writers should return an error if they are running on a closed goroutine. Channel operations should panic.
This is the bare minumum that is required to be able to prevent the most common goroutine leaks.

EDIT:
I want to clarify here that IO operations strictly mean methods in the io package such as io.Reader, io.Writer, etc.
Not any blocking system-call. The requirement is that these methods should return an error when the goroutine is closed, if they are able to detect this.

to support a particular use case

What is the particular use case?

@davecheney
Copy link
Contributor

@davecheney davecheney commented Oct 7, 2020

The rationale is that a closed goroutine no longer has permission to block and it should terminate as soon as possible.
Readers and Writers should return an error if they are running on a closed goroutine. Channel operations should panic.
This is the bare minumum that is required to be able to prevent the most common goroutine leaks.

What should happen if a goroutine is blocked waiting for a lock? What should happen if a goroutine currently holds this lock, but the author has chosen not to use defer to free that lock?

@Splizard
Copy link
Author

@Splizard Splizard commented Oct 7, 2020

What should happen if a goroutine is blocked waiting for a lock?

It remains blocked. Unless the lock implementation uses channels, or blocking Read/Write calls in which case a panic or error should be raised and the lock will not be aquired.

What should happen if a goroutine currently holds this lock, but the author has chosen not to use defer to free that lock?

If they are reading or writing to an unbuffered channel while they are holding a lock, then the goroutine panics and the lock is leaked. Much like present Go behaviour, if a goroutine writes to a closed channel whilst holding a lock, it panics and the lock is leaked. The proposal extends this risk to reading from unbuffered channels too.

If you are holding a lock and you perform blocking channel operations, you must defer the release of the lock or catch any potential panics so that the lock doesn't leak.

@davecheney
Copy link
Contributor

@davecheney davecheney commented Oct 7, 2020

Thank you for explaining. The restrictions on locks appear to undermine the utility of this proposal — goroutines waiting for a lock will continue to “leak”.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 7, 2020

This is addressed by explicitly closing goroutines instead of implicitly collecting them, so if the collector option of this proposal is a likely decline, should I create a new proposal about explicitly closeable goroutines?

Let me first say clearly that I think that any proposal around closeable goroutines is unlikely to be accepted.

That said, if you want to pursue this, you should also address the other concerns, not just this one.

The rationale is that a closed goroutine no longer has permission to block and it should terminate as soon as possible.
Readers and Writers should return an error if they are running on a closed goroutine. Channel operations should panic.
This is the bare minumum that is required to be able to prevent the most common goroutine leaks.

Thanks, I think I understand that. The problem is that it is very specific and is not orthogonal. Certain cases work, other cases fail, and the distinction is based on the implementation rather than on any fundamental aspect of the language. There is no deep reason why we say a goroutine will return an error from a Reader but not from, say, a sync.Once that is waiting for the function to complete. I believe the effect will be confusing for users of the language, who will have to remember specific circumstances rather than a simple rule.

What is the particular use case?

Goroutines that only block for I/O or channel operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
9 participants
You can’t perform that action at this time.