-
Notifications
You must be signed in to change notification settings - Fork 17.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/sync/errgroup: propagate panics and Goexits through Wait #53757
Comments
Change https://go.dev/cl/416555 mentions this issue: |
(CC @neild @rogpeppe @kevinburke from CL 134395.) |
Want to try writing out the doc update for |
See https://go.dev/cl/416555 for a draft. 😁 |
I'm not sure about the compatibility claim. If the program was exiting due to unrecovered panic before, now it's not, and it might be in a more broken state as it keeps running. I'd feel better if this was only about panic and not Goexit. Nothing should be using Goexit. Is the problem that errgroup is being used and the functions call t.Fatal or t.FailNow? |
This proposal has been added to the active column of the proposals project |
That's true, although in general the program won't continue running for long anyway: if the group has an associated I suppose that one alternative would be to propagate the panic to |
Not that it is, but that it isn't — But today |
It seems plausible, although it also seems plausible that it might break things. Does anyone object to this change? |
Based on the discussion above, this proposal seems like a likely accept. |
There were a couple objections on the original CL 134395, mostly relating to the complexity of the "double defer sandwich" technique it used, which appears to still be used in the new CL. Have those objections been withdrawn? It is not clear to me how this new proposal differs from the old draft CL. |
This depends on the notion of "broken" we use. It is certainly correct if we use "broken" to mean "crashing". But changing a crashing program into a non-crashing (but still broken) one is often worse. It essentially changes a crash-failure model into a byzantine-failure model, which is harder to reason about. I don't know how often this will be relevant in practice. But as staunch a defender of "panic should crash the program" I feel a bit queasy. I would not be happy if I ever happened on code which I intended to crash which instead continues to run in a broken state.
I disagree with this. If the errgroup is used in an |
@magical, the comments on the original CL are code-review comments. This is a proposal, and part of the point of the proposal process is to resolve the sorts of design questions raised on that CL. |
@Merovius, I can't reasonably consider a Go program that exits by unhandled-panic to be anything other than “broken”. That said, I agree that it's important for Go users to be able to diagnose broken programs, because we all can and do make mistakes. And you're correct that propagating panics in this way would convert some (unknown) fraction of programs that currently fail by unhandled-panic to instead fail by deadlock, which I agree is often harder to diagnose.
Under this proposal, the Some subset of programs that would have terminated by panic would now recover from that panic (due to a However, it's worth pointing out that all of those subsets are contained within the broader set of “programs affected by unexpected-panic bugs”, and — especially now that we have native fuzzing as of Go 1.18! — those unexpected-panic bugs are themselves tractable to uncover and fix through testing. It's also worth noting that the only thing that could hold up the @rsc and I discussed the implementation a bit last week, and we think it's possible to structure the panic-recovery such that the panicking goroutine remains alive (but blocked) until the panic has propagated to |
While I agree that this change is quite desirable, I'm also not convinced by the backward compatibility of such a change. Nowhere in the errgroup docs we say that One could argue that not calling
This usage is perfectly legitimate but would delay the panic by a potentially unbounded amount of time (when the channel is closed, that could happen e.g. when the worker is shutting down gracefully at process exit). We could say that not just As I wrote, I am in almost complete agreement this would be desirable - maybe in a v2 of the library? or as an opt-in behavior in the current version? - but in v1 I would argue changing the default behavior may silently break valid assumptions made by users in potentially very dangerous ways. |
@bcmills To me, while both "concurrently running goroutines will continue to execute while a panic unwinds the stack" and "the program will continue to run when something upstack from I'm not vehemently opposed to this change. I think the cases where it would lead to real, practical problems are very rare. But I don't think the change in behavior can just be swiped away as "eh, it's broken anyways". Moreover, I'm personally not convinced that this is a good change in and off itself.
I would argue that this is simply observing that calling This change would make it not-a-bug in some cases, but we certainly can't make it not-a-bug in all cases. So the restriction would have to stay in the docs and people will continue to have to be aware of it. Arguably, to someone who is aware of it,
I don't understand the scenario of concern.
In general, ISTM that there are two different mindsets regarding I find that frustrating. And to me, this change muddies the waters even further. But, as I said, I'm not vehemently opposed. If it happens, it happens. Without a consistent decision about which mindset to apply, the waters will stay muddy either way, this change on its own won't really move the situation much in either direction. |
This is an interesting point. I will leave this as 'likely accept' for another week to allow the discussion to see if we can come to an agreement about how much this matters. Thanks. |
I support the addition of the "recover" behavior to the errgroup as an option. func recoveredFn(f func() error) func() error {
return func() (err error) {
defer func() {
if r := recover(); r != nil {
buf := make([]byte, 64<<10)
buf = buf[:runtime.Stack(buf, false)]
err = fmt.Errorf("errgroup: panic recovered: %s\n%s", r, buf)
}
}()
return f()
}
} |
It seems to me that if the behavior is optional, there won't be many users who enable it. How would the option look like? Field on errgroup? Context value? Global environment variable? Something else? Field on the errgroup requires the user to add code at the errgroup call site. If you need to do that, You could put a value into context to control the behavior. This would work across library boundaries. However, Global environment variable would control the behavior for a single run of the program and would work across library |
Moving back to active. I think the Wait issue needs more discussion and an explicit decision. |
This proposal has been added to the active column of the proposals project |
That's true, but until very recently (https://go.dev/cl/405174 / #27837) the only point to using an
That's true, although I think the I'd be more worried about the cases that use the zero |
If this proposal is accepted, then I intend to file a separate issue to clarify the
The main motivating scenario is: a library accepts a callback and executes it one or more times sequentially. The library is changed to also do some other concurrent work. The concurrency properties of the callback are otherwise unchanged: all of the happens-before relationships that used to exist continue to do so. Why should the callback need to dictate whether it is called on the “main” goroutine or some other goroutine?
I agree. However, I would argue that that conflict leads to a clear conclusion for library authors: a library cannot presume either behavior. A library must not panic in normal operation (because the panic might crash the program), but also must not assume that a panic from (or through) the library will not be recovered (so, for example, it must The purpose of this proposal is to make it easier to use |
Keying off the context doesn't seem like the right way to define semantics here. Given that x/sync/errgroup is in x and can be rolled back easily, it seems like maybe we should try the change, with the "breakage", and see if anyone runs into problems. If they do, we'll know that we can't move forward with this change. And otherwise, we'll have done it. Thoughts? |
Based on the discussion above, this proposal seems like a likely accept. |
To at least partially address the "not calling This won't help in case the GC is disabled, or in some of the other issues raised (my worker loop example, or its extension to the zero errgroup) but it would at least help in more ordinary cases. Alternatively we could document that Wait must be called, or the panic will be silently swallowed. And also that the panic will be delayed until the Wait returns, so new work submitted may start executing even if a work item has already paniced.
For the zero errgroup especially I can't say what would be the least surprising behavior, but I would argue that not propagating the panic would be quite surprising. |
Let's not drag finalizers in. If this turns out to be a real problem, we should probably think about undoing the change. @aclements points out that we could also do a vet check for errgroups that are declared but Wait is never called. |
No change in consensus, so accepted. 🎉 |
What's the status of this? |
The proposal has been accepted and there is a work-in-progress change at https://go.dev/cl/416555. |
use simplified handling while waiting for golang/go#53757 Signed-off-by: Artsiom Koltun <artsiom.koltun@intel.com>
While go team is not in a hurry, I created a panic-safe drop-in replacement for the errgroup package. See https://github.com/kucherenkovova/safegroup |
Any date for the release of this? |
two years later, why we still do not release them .... |
Background
The handling of panics and calls to
runtime.Goexit
inx/sync/errgroup
has come up several times in its history:panicgroup
API to propagate or handle panicst.Fatal
and/ort.Skip
within aGroup
in a test will generally result in either a hard-to-diagnose deadlock or an awkward half-aborted test, instead of skipping or failing the test immediately as expected.runtime.Goexit
calls) back to the caller's goroutine. (Otherwise, a concurrent call that panics would terminate the program, while a sequential call that panics would be recoverable!)Proposal
I propose that:
The
(*Group).Wait
method should continue to wait for all goroutines in the group to exit, However, once that condition is met, if any of the goroutines in the group terminated with an unrecoveredpanic
,Wait
should panic with a value wrapping the first panic-value recovered from a goroutine in the group. Otherwise, if any of the goroutines exited viaruntime.Goexit
Wait
should invokeruntime.Goexit
on its own goroutine.panic
byWait
should include a best-effort stack dump for the goroutine that initiated the panic.recover
for error-handling (despite our advice to the contrary), if the recovered value implements theerror
interface, the value passed topanic
byWait
should also implement theerror
interface, and should wrap the recovered error (so that it can be retrieved byerrors.Unwrap
).The
Context
value returned byerrgroup.WithContext
should be canceled as soon as any function call in the group returns a non-nil error, panics, or exits viaruntime.Goexit
.Wait
has an abnormal status to report, and thus should shut down all work associated with theGroup
so that the abnormal status can be reported quickly.Specifically, if
Wait
panics, the panic-value would have either typePanicValue
or typePanicError
, defined as follows:Compatibility
Any program that today initiates an unrecovered
panic
within aGo
orTryGo
callback terminates due to that unrecovered panic, So recovering and propagating such apanic
can only change broken programs into non-broken ones; it cannot break any program that was not already broken.A valid program could in theory call
runtime.Goexit
from within aGo
callback today. However, the vast majority of calls toruntime.Goexit
are viatesting.T
methods, and according to the documentation for those methods today they “must be called from the goroutine running the test or benchmark function, not from other goroutines created during the test.” Moreover, it would be possible to implement the documentederrgroup.Group
API today in a way that would causeWait
to always deadlock ifruntime.Goexit
were called, so any caller relying on the existingruntime.Goexit
behavior is assuming an implementation detail that is not guaranteed.In light of the above, I believe that the proposed changes are backward-compatible.
The text was updated successfully, but these errors were encountered: