Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: runtime: add WaitIdle function #65336

Closed
gh2o opened this issue Jan 29, 2024 · 18 comments
Closed

proposal: runtime: add WaitIdle function #65336

gh2o opened this issue Jan 29, 2024 · 18 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Proposal
Milestone

Comments

@gh2o
Copy link

gh2o commented Jan 29, 2024

Proposal Details

I propose adding a new function WaitIdle to runtime.

The function will have the following function signature:

func WaitIdle() <-chan struct{}

WaitIdle will return a channel that will be closed at the next moment that no goroutines are runnable, i.e. all P's are idle. This channel would be woken up when pidleput() is called for the last active P.

The main use case of the WaitIdle function is for tests, where a user or library may need to wait for all downstream effects of a given wakeup to complete before proceeding with the test case.

For example, a hypothetical mock time implementation may be used by hypothetical user code in a test harness as follows:

ticker := mocktime.Ticker(time.Second)
for {
  <-ticker
  expensiveComputationHere()
  t := mocktime.Now()
  fmt.Println(t)
}

In order for the call to mocktime.Now() to be deterministic, the implementation must wait until the effects of waking up the ticker channel have completed, e.g. waiting until the goroutine with the user code is once again waiting on ticker or some other channel. Some mock time implementations use a stand-in method such as time.Sleep(1 * time.Millisecond) to wait for user goroutines to settle; however, this is unreliable for CPU-heavy goroutines, and slow for goroutines that execute minimal code upon wakeup. As long as the user code does not call the real time.* functions or other I/O functions, using WaitIdle() here would provide both determinism and a performance benefit.

Implementation of proposal here.

@gh2o gh2o added the Proposal label Jan 29, 2024
@gopherbot gopherbot added this to the Proposal milestone Jan 29, 2024
@mauri870 mauri870 added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jan 29, 2024
@bcmills
Copy link
Contributor

bcmills commented Jan 29, 2024

For reference, there is currently a similar function (but with a narrower scope) in the main repo:
https://cs.opensource.google/go/go/+/master:src/runtime/pprof/pprof_test.go;l=1023-1026;drc=9cdcb01320d9a866e46a2daedb9bde16e0d51278.

This also demonstrates that it is currently possible to implement this function as a third-party library, albeit a somewhat fragile one — it depends on the exact format of runtime.Stack goroutine dumps.

@seankhliao
Copy link
Member

would this be the same as #44026 ?

@gh2o
Copy link
Author

gh2o commented Jan 29, 2024

@bcmills I'm currently using a similar implementation, but created this proposal to see if there is interest in adding first-class support for it to the runtime, seeing that it could be a common use case.

@seankhliao This is different from a deadlock detector, which detects the case where all goroutines are in the waiting state, there are no active timers or netpoll sockets, and really only works when cgo is disabled. This only considers the case where all goroutines are in the waiting state, which can be useful for multiple real-world uses:

  • For tests, advancing mock time once all timers have fired, and all effects of firing such timers have settled
  • For tests, ensuring that there are no spinning / busy-looping background goroutines after ending the test case
  • Running idle work when the runtime has quiesced and there are spare CPU cycles

@gh2o
Copy link
Author

gh2o commented Mar 6, 2024

This proposal was mostly inspired by the this function in the Rust Tokio library, which implements a feature that I found immensely helpful for writing deterministic unit tests: it waits until the runtime has no work to do, then auto-advances the simulated runtime timer to the the next wakeup time.

Adding WaitIdle as proposed would allow for reliably implementing this feature for mock time implementations. Test code that purely utilizes the mock time implementation without touching time.* functions would be able to run deterministically.

@gh2o
Copy link
Author

gh2o commented May 15, 2024

Hi all, just wanted to bump this proposal, and please let me know if there's anything that needs to be done on my end :)

@rsc
Copy link
Contributor

rsc commented May 15, 2024

@neild has been investigating similar issues. There's something useful to do here, but I'm not sure this exact API is it.

@neild
Copy link
Contributor

neild commented May 15, 2024

WaitIdle as proposed here is global. I think that anything we provide should be scoped to a specific set of goroutines, so that tests can be isolated from one another.

You can implement something like this today using runtime.Stack and some horribly fragile parsing. See https://go-review.googlesource.com/c/net/+/584895/2/http2/sync_test.go for an example of this approach. It's fragile and fiddly to implement, but extremely useful in testing concurrent code.

Based on my experience with that approach, I think my ideal API might be something like:

// A Group is a collection of cooperating goroutines.
type Group struct {}

// Go runs f in a new goroutine, and returns a Group containing the new goroutine and its descendents.
func Go(f func()) *Group {}

// Wait blocks until every goroutine in f is idle.
// A goroutine is idle if it is blocked on a channel receive,
// mutex acquisition,
// or select statement with no cases.
// A goroutine is not idle if it is blocked on a timer or I/O operation.
func (*Group) Wait() {}

I don't really like the "Group" name, but I don't have a better one.

Idle-wait operates on a tree of goroutines (a goroutine, and all goroutines transitively created by it). This seems like the right approach for testing concurrent functions.

Perhaps more controversially, Wait does not consider a goroutine which is sleeping or blocked on IO to be idle. A goroutine which is sleeping will eventually wake: It is not idle. A goroutine reading from a network connection may or may not wake in the future, but the runtime cannot say that it will not. In particular, if we write to a local network socket and then immediately wait, a goroutine reading from the other end of the socket might appear to be briefly idle until the data traverses the kernel.

Limiting Wait to cases when a goroutine is deterministically and permanently blocked should make it a bit harder to produce flaky tests with it, and could allow for telling the race detector that it creates a happens-before relationship.

@rsc rsc moved this from Incoming to Active in Proposals May 15, 2024
@rsc
Copy link
Contributor

rsc commented May 15, 2024

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@ChrisHines
Copy link
Contributor

A goroutine is idle if it is blocked on a channel receive,

A goroutine which is sleeping will eventually wake: It is not idle.

These two seem to be in conflict for a goroutine that does <-time.After(...)? That looks like a channel read that is considered idle, but is semantically the same as time.Sleep(...) which would not. Is it intended to treat timer driven channels differently?

@neild
Copy link
Contributor

neild commented May 15, 2024

These two seem to be in conflict for a goroutine that does <-time.After(...)?

In my ideal world, we'd treat timer channels differently. Whether that's practically possible, I don't know.

@neild
Copy link
Contributor

neild commented May 16, 2024

Filed #67434 with an alternate idea, covering both idle detection and fake timers.

@gh2o
Copy link
Author

gh2o commented May 18, 2024

WaitIdle as proposed here is global. I think that anything we provide should be scoped to a specific set of goroutines, so that tests can be isolated from one another.

For the uses cases defined here, I do think it's acceptable for this to be a global construct because restricting the behavior over a set of goroutines effectively gives each goroutine an "identity", which the Go runtime ideologically tries really hard to avoid, e.g. the library intentionally avoids exposing a goroutine ID for this reason.

In the case where WaitIdle() is called by the mock time implementation from multiple unit tests running in parallel, each unit test would effectively proceed in lockstep: the closure of the channel returned by WaitIdle() would wake up multiple goroutines. In theory, this is slightly more pessimistic than a per-group WaitIdle(), but in practice I expect the performance impact of doing so to be negligible.

You can implement something like this today using runtime.Stack and some horribly fragile parsing. See https://go-review.googlesource.com/c/net/+/584895/2/http2/sync_test.go for an example of this approach. It's fragile and fiddly to implement, but extremely useful in testing concurrent code.

I am currently using this approach with my mock timer implementation, but runtime.Stack(b, true) is a very expensive call due to having to Stop The World. Adding WaitIdle() to the runtime would substantially reduce the overhead while reducing the fragility.

@rsc
Copy link
Contributor

rsc commented May 23, 2024

It seems like closing this as a duplicate of #67434 is reasonable.

@gh2o
Copy link
Author

gh2o commented May 28, 2024

@rsc

Though the general idea of this proposal and #67434 are similar, I believe it is worth keeping this proposal open and discussing both options in parallel due to some major differences:

  • Goroutine groups vs global wait
  • Hooking into standard time package vs not touching standard library
  • Definition of "idle": all P's idle vs goroutines waiting on channel/mutex/select

@ianlancetaylor
Copy link
Member

Those are differences, but are they advantages? What kind of tests would be easier to write with this proposal than with #67434?

@rsc
Copy link
Contributor

rsc commented May 29, 2024

I agree that the two proposals have differences, but it seems clear we are only going to do one. The issues you list seem like downsides of this proposal to me:

  • Global wait is harder to scope.
  • Hooking into time makes sure that production code won't use this functionality.
  • All P's idle sounds less robust than all goroutines waiting, but I might misunderstand that.

If you feel strongly that one of these details is wrong in #67434, please comment there.

@rsc
Copy link
Contributor

rsc commented May 30, 2024

Based on the discussion above, this proposal seems like a likely decline.
— rsc for the proposal review group

@rsc rsc moved this from Active to Likely Decline in Proposals May 30, 2024
@rsc
Copy link
Contributor

rsc commented Jun 5, 2024

No change in consensus, so declined.
— rsc for the proposal review group

@rsc rsc moved this from Likely Decline to Declined in Proposals Jun 5, 2024
@rsc rsc closed this as completed Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Proposal
Projects
Status: Declined
Development

No branches or pull requests

9 participants