Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/exec: add fields for managing termination signals and pipes #50436

Open
bcmills opened this issue Jan 4, 2022 · 40 comments
Open

os/exec: add fields for managing termination signals and pipes #50436

bcmills opened this issue Jan 4, 2022 · 40 comments

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Jan 4, 2022

Background

#23019 (accepted but not yet implemented; CC @ianlancetaylor @bradfitz) proposed to change exec.Cmd.Wait to stop the goroutines that are copying I/O to and from a completed exec.Cmd; see that proposal for further background on the problem it aims to address. However, as noted in #23019 (comment) and #23019 (comment), any feasible implementation of the proposal requires the use of an arbitrary timeout, and the proposal does not include a mechanism to adjust that timeout. (Given our history with the Go project's builders, I am extremely skeptical that any particular hard-coded timeout can strike an appropriate balance between robustness and latency.)

#31774, #22757, and #21135 proposed to allow users of exec.CommandContext to customize the signal sent to the command when the context is canceled. They were all declined due to lack of concrete demand for the feature (#21135 (comment), #22757 (comment), #31774 (comment)). We have since accrued a number of copies of functions that work around the feature's absence. In the Go project alone, we have:

I'm attempting to add yet another variation (in CL 373005) in order to help diagnose #50014. However, for this variation (prompted by discussions with @aclements and @prattmic) I have tried to make this variation a minimally-invasive change on top of the exec.Cmd API.

I believe I have achieved that goal: the API requires the addition of only 2–3 new fields and no new methods or top-level functions. You can view (and try out) a prototype as github.com/bcmills/more/os/moreexec, which provides a drop-in replacement for a subset of the exec.Cmd API.

Proposal

I propose the addition of the following fields to the exec.Cmd struct, along with their corresponding implementation:

	// Context is the context that controls the lifetime of the command
	// (typically the one passed to CommandContext).
	Context context.Context

	// If Interrupt is non-nil, Context must also be non-nil and Interrupt will be
	// sent to the child process when Context is done.
	//
	// If the command exits with a success code after the Interrupt signal has
	// been sent, Wait and similar methods will return Context.Err()
	// instead of nil.
	//
	// If the Interrupt signal is not supported on the current platform
	// (for example, if it is os.Interrupt on Windows), Start may fail
	// (and return a non-nil error).
	Interrupt os.Signal

	// If WaitDelay is non-zero, the command's I/O pipes will be closed after
	// WaitDelay has elapsed after either the command's process has exited or
	// (if Context is non-nil) Context is done, whichever occurs first.
	// If the command's process is still running after WaitDelay has elapsed,
	// it will be terminated with os.Kill before the pipes are closed.
	//
	// If the command exits with a success code after pipes are closed due to
	// WaitDelay and no Interrupt signal has been sent, Wait and similar methods
	// will return ErrWaitDelay instead of nil.
	//
	// If WaitDelay is zero (the default), I/O pipes will be read until EOF,
	// which might not occur until orphaned subprocesses of the command have
	// also closed their descriptors for the pipes.
	WaitDelay time.Duration

The new Context field is exported only in order to simplify the documentation for the Interrupt and WaitDelay fields. (It was requested and rejected in #46699, but the objection there was my own — due to concerns about the interactions with the API in this proposal. It could be excised from this proposal without damaging anything but documentation clarity.)

The new Interrupt field sets the signal to be sent when the Context is done. exec.CommandContext explicitly sets it to os.Kill in order to maintain the existing behavior of exec.CommandContext, but I expect many users on Unix platforms will want to set it to os.Interrupt or syscall.SIGQUIT instead.

The new WaitDelay field sets the interval to wait for input and output after process termination or an interrupt signal. That interval turns out to be important for many testing applications (such as the Go Playground implementation and the cmd/go test suite). It also generalizes nicely to the use-cases in #23019: setting WaitDelay without Context provides bounded I/O wait times without sending a preceding signal.

Compatibility

I believe that this proposal is entirely backward-compatible (in contrast with #23019). The zero-values for the new fields provide exactly the same behavior as a Cmd returned by exec.Command or exec.CommandContext today.

Caveats

This proposal does not address graceful shutdown on Windows (#22757 (comment); CC @mvdan). However, it may be possible to extend it to do so by providing special-case Windows behavior when the Interrupt field is set to os.Interrupt, or by adding an InterruptFunc func(*Cmd) callback that would also be invoked when Context is done.

The proposed API also does not provide a mechanism to send an Interrupt signal followed by os.Kill after a delay but still wait for subprocesses to close all I/O pipes. I believe the use-cases for that scenario are sufficiently niche to be provided only by third-party libraries: sending SIGKILL to the parent process makes it likely that subprocesses will not know to shut down, so in the vast majority of cases users should either not send SIGKILL at all (WaitDelay == 0), forcibly terminate the pipes to try to kill the subprocesses with SIGPIPE (WaitDelay > 0), or do something platform-specific to try to forcibly shut down an entire process group (outside the scope of this proposal).

Alternatives considered

In #31774 (comment), @bradfitz suggested a field Kill func(*os.Process), which would presumably be added instead of the Interrupt field in this proposal. However, I believe that such a field would be simultaneously too complex and not powerful enough:

  • The Kill field would be too complex for most Unix applications, which overwhelmingly only need to send one of SIGTERM, SIGINT, SIGQUIT, or SIGKILL — why pass a whole callback when you really just want to say which signal you need?

  • A *os.Process callback would still not be powerful enough for Windows applications. If I understand the discussion in #6720 correctly (CC @alexbrainman), CTRL_BREAK_EVENT is sent to an entire process group, not a single *os.Process, so Windows users would also need a mechanism for creating (or determining) such a group, or some completely separate out-of-band way to request that the process terminate (such as by sending it a particular input or IPC message).

Given the above, the Interrupt field seems more ergonomic: it gives the right behavior for Unix users, and if Windows users want to do something more complex they can set Interrupt to nil and start a separate goroutine in between the calls to (*Cmd).Start and (*Cmd).Wait to implement whatever custom logic they want.

@gopherbot gopherbot added this to the Proposal milestone Jan 4, 2022
@gopherbot
Copy link

@gopherbot gopherbot commented Jan 4, 2022

Change https://golang.org/cl/373005 mentions this issue: internal/moreexec: add a utility package for manipulating os/exec.Cmd

@seebs
Copy link
Contributor

@seebs seebs commented Jan 4, 2022

// If the command's process is still running after WaitDelay has elapsed,
// it will be terminated with os.Kill before the pipes are closed.

I think this may be innately race-prone in that theoretically, the process can terminate and be replaced by a new process with that PID between the check and the kill. I have vague recollections of someone having told me there's a fancy new feature in some Linuxes that could bypass this, and it's admittedly a very small window for a very large amount of stuff to happen.

@seebs
Copy link
Contributor

@seebs seebs commented Jan 4, 2022

I'm a bit unclear on the meaning of the exported fields. Are they required to be set prior to launching a command, or can they be changed during the lifespan of the command? What happens if I change c.Interrupt after starting a process? What happens if I write a new value into c.Context while the process is running, and then after doing that close the previous context? etc.

I'm not sure what the answers should be, I'm pretty sure that any reasonably plausible answer will be fine, but I think it should be explicit. I think I'd favor "Context is a read-only field once the command is started, Interrupt and WaitDelay are both writeable".

@ianlancetaylor ianlancetaylor added this to Incoming in Proposals Jan 4, 2022
@bcmills
Copy link
Member Author

@bcmills bcmills commented Jan 4, 2022

the process can terminate and be replaced by a new process with that PID between the check and the kill.

That is #13987, which seems orthogonal.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Jan 4, 2022

I'm a bit unclear on the meaning of the exported fields. Are they required to be set prior to launching a command, or can they be changed during the lifespan of the command?

They are like the other (existing) exported fields of exec.Cmd: they must be set prior to Start (or equilavent) and then not modified until Wait (or equivalent) returns.

@seebs
Copy link
Contributor

@seebs seebs commented Jan 4, 2022

It is the same issue as 13987, but it's a new circumstance in which we hit that which might be non-obvious to a user, who is assuming that if we are killing "the" process, we'll pick the right one.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Jan 4, 2022

That's true, but applies equally to all of the (numerous) workarounds that build the same behavior on top of os/exec.Cmd today. (I don't think this proposal makes that race any worse for anybody than it already is.)

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jan 4, 2022

theoretically, the process can terminate and be replaced by a new process with that PID between the check and the kill. I have vague recollections of someone having told me there's a fancy new feature in some Linuxes that could bypass this, and it's admittedly a very small window for a very large amount of stuff to happen.

The Go standard library already uses that feature (the waitid or wait6 system call with an option of WNOWAIT) on Linux, DragonFly, and FreeBSD. (For a while we used waitid on macOS but stopped doing so due to #19314.)

@mvdan
Copy link
Member

@mvdan mvdan commented Jan 4, 2022

However, as noted in #23019 (comment) and #23019 (comment), any feasible implementation of the proposal requires the use of an arbitrary timeout, and the proposal does not include a mechanism to adjust that timeout. (Given our history with the Go project's builders, I am extremely skeptical that any particular hard-coded timeout can strike an appropriate balance between robustness and latency.)

I share this concern enthusiastically. Depending on what program you're executing, and in what environment and context, you may need very different timeouts. I also agree that we should let the user decide. As much as I would love for Go to just "do the right thing" without having to extend the API, I don't think we can do that without introducing footguns.

I believe that this proposal is entirely backward-compatible (in contrast with #23019). The zero-values for the new fields provide exactly the same behavior as a Cmd returned by exec.Command or exec.CommandContext today.

I had originally given #23019 a thumbs up, but I've moved my thumbs up here now instead :) I even think #23019 might qualify as a breaking change, given how it may break some unlucky programs where the timeout may lead to losing some output.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Jan 4, 2022

Thinking about this some more, there is one other interesting caveat: error behavior.

If we stop reading the program's output before EOF, it may be useful for the caller to know that. Stopping output before EOF seems orthogonal to the exit status of the program, which is what the Wait and similar methods currently report.

Should setting WaitDelay cause Wait and similar methods to return a different error if end up forcibly closing I/O pipes?

@henvic
Copy link
Contributor

@henvic henvic commented Jan 4, 2022

The Interrupt field seems super useful to me.

However, IMHO adding the Context field to simplify the documentation might end up hurting more than helping as now I've to understand why there are two ways to pass context. Especially after reading this:

// If Interrupt is non-nil, Context must also be non-nil and Interrupt will be
// sent to the child process when Context is done.

@henvic
Copy link
Contributor

@henvic henvic commented Jan 4, 2022

Someone might ask themselves: is it only if I pass Context as a field or also when using CommandContext?

@bcmills
Copy link
Member Author

@bcmills bcmills commented Jan 5, 2022

IMHO adding the Context field to simplify the documentation might end up hurting more than helping as now I've to understand why there are two ways to pass context.

Note that without the Context field, we would still have to explain that Interrupt can only be used (or only has an effect) with Cmd instances returned by CommandContext. So I'm not sure that removing it would help with that confusion much. 😅

Someone might ask themselves: is it only if I pass Context as a field or also when using CommandContext?

Probably we could update the CommandContext doc comment to explicitly state that it sets the Context and Interrupt fields?

@bcmills
Copy link
Member Author

@bcmills bcmills commented Jan 5, 2022

Should setting WaitDelay cause Wait and similar methods to return a different error if end up forcibly closing I/O pipes?

I gave this some thought and arrived at:

  • Setting WaitDelay should not change the exit code if the command fails. The fact that the command failed is almost always enough for the caller to know that its output may be suspecct.
  • However, if the command exits with a successful status (that is, if Wait would otherwise return nil) but we had to forcibly close the pipes, we should return a distinguished error (perhaps a new exec.ErrWaitDelay?) so that the caller knows the input and/or output may have been truncated.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Jan 5, 2022

I thought some more about the appropriate Windows behaviors. I added an “Alternatives considered” section and removed the future option for CTRL_BREAK_EVENT, since per #6720 it seems not to be feasible.

Instead, on Windows the Start method should fail with an error wrapping syscall.EWINDOWS if the specified Interrupt signal is not supported (that is, if it is not either nil or os.Kill).

@rsc rsc moved this from Incoming to Active in Proposals Jan 5, 2022
@rsc
Copy link
Contributor

@rsc rsc commented Jan 5, 2022

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc
Copy link
Contributor

@rsc rsc commented Jan 19, 2022

It seems like the docs need to distinguish what we call cmd.Stdin from cmd.Stdout/cmd.Stderr. cmd.Stdin is already closed when we finish writing whatever was standard input, independent of exiting/waiting/cancellation.

I'm also confused by the WaitDelay description mixing cancellation and ordinary exits.

Maybe it would help to see the diff that would implement this? I looked at CL 373005 but it's a whole package, not a diff.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Jan 19, 2022

It seems like the docs need to distinguish what we call cmd.Stdin from cmd.Stdout/cmd.Stderr. cmd.Stdin is already closed when we finish writing whatever was standard input, independent of exiting/waiting/cancellation.

A hang due to stdin copying analogous to the one involving stdout and stderr can occur if the command spawns its own subprocess with cmd.Stdin = os.Stdin and then exits without either process reading stdin to completion. (The copying goroutine is started in (*Cmd).stdin, and it only closes the pipe once copying is complete — which may be arbitrarily long if the writes don't fit in whatever buffer the kernel provides.)

So, at least in that case, we really do need to treat the I/O pipes symmetrically.

@rsc
Copy link
Contributor

@rsc rsc commented Jan 26, 2022

Thanks for the clarification about stdin.

I still feel like it would help me to see the diff in the package, as opposed to a fork as in CL 373005. I'm still not 100% sure I understand the proposal.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Feb 2, 2022

Found another “run or kill” function in the main Go repo, and this one doesn't even bother using an appropriate signal (it skips straight to SIGKILL):
https://cs.opensource.google/go/go/+/master:test/run.go;l=834-854;drc=f4aa021985e9ae4a9a395f8fbe32ad08d2bfda3b

That manifests as a less-than-useful failure mode in #50973.

@rsc
Copy link
Contributor

@rsc rsc commented Feb 2, 2022

I'm still a little confused about the implications here. I'd like to see a concrete CL to understand it better. Will leave this open for this week, but let me know if you'd rather we put it on hold for now.

@gopherbot
Copy link

@gopherbot gopherbot commented Apr 23, 2022

Change https://go.dev/cl/401934 mentions this issue: os/exec: use a TestMain to avoid hijacking stdout for helper commands

@bcmills bcmills moved this from Hold to Active in Proposals Apr 23, 2022
gopherbot pushed a commit that referenced this issue Apr 26, 2022
This makes clearer that skipStdinCopyError is always defined and never
overridden in tests.

Secondarily, it may also help reduce init-time work and allow the
linker and/or inliner to better optimize this package.

(Noticed while prototyping #50436.)

Change-Id: I4f3c1bc146384a98136a4039f82165ed106c14b8
Reviewed-on: https://go-review.googlesource.com/c/go/+/401897
Run-TryBot: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
gopherbot pushed a commit that referenced this issue Apr 26, 2022
The previous implementation of helperCommand relied on running a
well-known Test function which implemented all known commands.

That not only added Skip noise in the test's output, but also (and
more importantly) meant that the commands could not write directly to
stdout in the usual way, since the testing package hijacks os.Stdout
for its own use.

The new implementation addresses the above issues, and also ensures
that all registered commands are actually used, reducing the risk of
an unused command sticking around after refactoring.

It also sets the subprocess environment variable directly in the test
process, instead of on each individual helper command's Env field,
allowing helper commands to be used without an explicit Env.

Updates #50599.
(Also for #50436.)

Change-Id: I189c7bed9a07cfe47a084b657b88575b1ee370b9
Reviewed-on: https://go-review.googlesource.com/c/go/+/401934
Run-TryBot: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
@gopherbot
Copy link

@gopherbot gopherbot commented May 2, 2022

Change https://go.dev/cl/403534 mentions this issue: cmd/godoc: skip TestWeb if waitForServerReady fails

gopherbot pushed a commit to golang/tools that referenced this issue May 2, 2022
This test fails frequently, with a failure mode that is difficult to
diagnose. (golang/go#50436 may help with that eventually.)

For now, skip the test to reduce noise on the build dashboard.

For golang/go#50014.

Change-Id: I182be5c705846631c983bd5b6c51ab90b71a216a
Reviewed-on: https://go-review.googlesource.com/c/tools/+/403534
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
Auto-Submit: Bryan Mills <bcmills@google.com>
gopls-CI: kokoro <noreply+kokoro@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
@rsc
Copy link
Contributor

@rsc rsc commented May 4, 2022

Do we need to / want to export the 'Context context.Context' field? We typically don't have mutable context in structs. Why do we need it here?

@bcmills
Copy link
Member Author

@bcmills bcmills commented May 4, 2022

The exported Context field isn't strictly necessary.

It does make some call sites a bit cleaner (you can use the same helper function to poke in a Context along with the corresponding signal and delay), and it makes the documentation for the other fields simpler (they can refer to the Context field instead of using phrases like “the context with which the Cmd was created”), but if folks feel strongly about it it could be dropped from this proposal without damaging the rest of the API.

gopherbot pushed a commit that referenced this issue May 6, 2022
This provides clearer synchronization invariants: if it occurs at all,
the call to c.Process.Kill always occurs before Wait returns. It also
allows any unexpected errors from the goroutine to be propagated back
to Wait.

For #50436.

Change-Id: I7ddadc73e6e67399596e35393f5845646f6111ab
Reviewed-on: https://go-review.googlesource.com/c/go/+/401896
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Auto-Submit: Bryan Mills <bcmills@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@rsc
Copy link
Contributor

@rsc rsc commented May 18, 2022

It does seem like we should split the Context part out, at least for now.

@rsc
Copy link
Contributor

@rsc rsc commented May 18, 2022

The new API in exec.Cmd is:

	// If Interrupt is non-nil, Context must also be non-nil and Interrupt will be
	// sent to the child process when Context is done.
	//
	// If the command exits with a success code after the Interrupt signal has
	// been sent, Wait and similar methods will return Context.Err()
	// instead of nil.
	//
	// If the Interrupt signal is not supported on the current platform
	// (for example, if it is os.Interrupt on Windows), Start may fail
	// (and return a non-nil error).
	Interrupt os.Signal

	// If WaitDelay is non-zero, the command's I/O pipes will be closed after
	// WaitDelay has elapsed after either the command's process has exited or
	// (if Context is non-nil) Context is done, whichever occurs first.
	// If the command's process is still running after WaitDelay has elapsed,
	// it will be terminated with os.Kill before the pipes are closed.
	//
	// If the command exits with a success code after pipes are closed due to
	// WaitDelay and no Interrupt signal has been sent, Wait and similar methods
	// will return ErrWaitDelay instead of nil.
	//
	// If WaitDelay is zero (the default), I/O pipes will be read until EOF,
	// which might not occur until orphaned subprocesses of the command have
	// also closed their descriptors for the pipes.
	WaitDelay time.Duration

Does anyone object to this new API?

@henvic
Copy link
Contributor

@henvic henvic commented May 18, 2022

It looks fine to me, but I wonder if there is a better name for the Interrupt field (thinking that someone might confuse thinking it's specifically about SIGINT somehow). Not sure if anything such as StopSignal, KillSignal, or ExitSignal is any better, though...

@bcmills
Copy link
Member Author

@bcmills bcmills commented May 18, 2022

I chose Interrupt because the signal is intended to interrupt the process in some way, and it is indeed intended to suggest os.Interrupt as a reasonable choice (at least on non-Windows platforms).

@rsc
Copy link
Contributor

@rsc rsc commented May 25, 2022

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

@rsc rsc moved this from Active to Likely Accept in Proposals May 25, 2022
@bcmills bcmills assigned bcmills and unassigned bcmills May 27, 2022
@rsc rsc moved this from Likely Accept to Accepted in Proposals Jun 1, 2022
@rsc
Copy link
Contributor

@rsc rsc commented Jun 1, 2022

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

@rsc rsc changed the title proposal: os/exec: add fields for managing termination signals and pipes os/exec: add fields for managing termination signals and pipes Jun 1, 2022
@rsc rsc removed this from the Proposal milestone Jun 1, 2022
@rsc rsc added this to the Backlog milestone Jun 1, 2022
@bcmills bcmills self-assigned this Jun 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Proposals
Accepted
Development

No branches or pull requests

8 participants
@rsc @seebs @henvic @ianlancetaylor @mvdan @bcmills @gopherbot and others