New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/exec: consider changing Wait to stop copying goroutines rather than waiting for them #23019

Open
ianlancetaylor opened this Issue Dec 6, 2017 · 21 comments

Comments

Projects
None yet
@ianlancetaylor
Contributor

ianlancetaylor commented Dec 6, 2017

When the Stdin, Stdout, or Stderr fields of os/exec.Cmd are set to anything other than nil or a *os.File (a common case is *bytes.Buffer), we call os.Pipe to get a pipe and create goroutines to copy data in or out. The (*Cmd).Wait method first waits for the subprocess to exit, then waits for those goroutines to finish copying data.

If the subprocess C1 itself starts a subsubprocess C2, and if C1 passes any of its stdin/stdout/stderr descriptors to C2, and if C1 exits without waiting for C2 to exit, then C2 will hold an open end of the pipes created by the os/exec package. The (*Cmd).Wait method will wait until the goroutines complete, which means waiting until those pipes are closed, which in practice means waiting until C2 exits. This is confusing, because the user sees that C1 is done, and doesn't understand why their program is still waiting for it.

This confusion has been filed as an issue multiple times, at least #7378, #18874, #20730, #21922, #22485.

It doesn't have to work this way. Although the current goroutines call io.Copy, we could change them to use a loop. After every Read, the loop could check whether the process has been waited for. Then Wait could wait for the child, tell the goroutines to stop, give them a chance for a final write, and then return. The stdout/stderr goroutines would close their end of the pipe. There wouldn't be any race and there wouldn't be any unexpected waits. But in cases where there is a C2 process, not all of the standard output and standard error output that we currently collect would be available.

To be clear, programmers can already handle these cases however they like by calling os.Pipe themselves and using the pipe in the Cmd struct. That is true today and it would be true if change how it works.

The questions I want to raise are: would making this change be less confusing for people? Can we make this change without breaking the Go 1 guarantee?

CC @bradfitz

@karkason

This comment has been minimized.

karkason commented Feb 24, 2018

This change may break existing code, but it does fix unspecified behavior in (*Cmd).Wait that is also confusing and unexpected and will definitely make the use of os.Cmd less confusing.

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Feb 28, 2018

Another case: #24050.

@eliasnaur

This comment has been minimized.

Contributor

eliasnaur commented Feb 28, 2018

FWIW, it took me quite a while to figure out that this issue was root cause for mobile tests to sometimes hang indefinitely. https://go-review.googlesource.com/#/c/42271/ is the workaround for the Android test harness.

@slrz

This comment has been minimized.

slrz commented Mar 2, 2018

I actually prefer the semantics we have today. That way, it's at least apparent that something is wrong instead of silently leaving long-running processes behind.

The underlying issue of held-open pipe FDs is explained in any basic OS course and can be debugged using readily available Unix tools. I think the alternative will result in more complex failure cases.

@tsl0922

This comment has been minimized.

tsl0922 commented Mar 3, 2018

@slrz It makes many people confusing that Wait do not return after the process exited. Although the document has been improved, but we still don't have a way to cancel the io.Copy in the internal goroutines unless using io.Pipe for the os.Cmd struct which will make code complex.

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Apr 25, 2018

In order to handle the case where C2 doesn't output anything, we would have to put a deadline on reading from the pipe, so that while waiting for output we periodically check whether C1 has exited.

@tv42

This comment has been minimized.

tv42 commented May 31, 2018

But in cases where there is a C2 process, not all of the standard output and standard error output that we currently collect would be available.

This sounds like a bug to me, and very much unexpected. I wouldn't ever expect Wait to be guaranteed to return until after I have consumed all the output; in this scenario, the output has not ended yet.

Also, I'm not sure if I understood the "give them a chance for a final write" part, but that sounds like there's a race that could also lose C1 output if C1 exits immediately after filling the pipe buffer; that's not guaranteed to come through in a single Read.

smola added a commit to smola/ci-tricks that referenced this issue Jun 1, 2018

upgrade to Go 1.10
A workaround for golang/go#23019
has been implemented to be able to execute tests on Go 1.10.
@kgadams

This comment has been minimized.

kgadams commented Jun 4, 2018

Would it be possible to get back the go1.9 behavior with a flag?
I have a use case where I need to cancel C1 now and where I really don't care about C2 output and where C2 can be left hanging. I have no control over C2. That used to work with go 1.9, but with go 1.10 it insists on waiting for C2 (which I can't control).

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Jun 5, 2018

@kgadams There was no intentional change in this area in 1.10 This issue is about a proposed change to os/exec; the change has not been implemented. If you are finding an unexpected change that you think is a bug, please open a new issue. Thanks.

@tv42

This comment has been minimized.

tv42 commented Jun 6, 2018

@kgadams But you are interested in C1 output, and there is no way to tell which is which.

@kgadams

This comment has been minimized.

kgadams commented Jun 7, 2018

My scenario is like this: we are writing an interpreter for a specific content. Other people write the content (I have no control over that). Part of that content is shell script snippets.
Now, during execution (which happens behind a UI), the end user can Cancel us at any time.

If at the point of Cancelation by the end user I am executing the shell (C1), and the shell is executing a long running command (C2), I do not care about C1 output nor about C2 output. All I care about is that the whole enchilada is killed and that the function executing C1 returns without waiting for C2.
I even can live with C2 continuing to run in the background.

Cancelation works by canceling the Context that is passed to exec.Cmd.

I'll provide an example later of a test that succeeded with go-1.9 but fails with go-1.10.

@kgadams

This comment has been minimized.

kgadams commented Jun 7, 2018

Hi,
this is the example that demonstrates the problem.
We read from the StdoutPipe.
Context is canceled after one second.
With go 1.9 Cmd.Wait() returns immediately.
With go 1.10 it waits till C2 (sleep 5) is done.

If you can tell me a way to get the go 1.9 behavior with go 1.10, my problem would be solved.

package cmd_test

import (
	"bufio"
	"context"
	"os/exec"
	"strings"
	"syscall"
	"testing"
	"time"

	"github.com/stretchr/testify/assert"
	"github.com/stretchr/testify/require"
)

func TestCmd(t *testing.T) {
	ctx, cancel := context.WithTimeout(context.Background(), time.Second)
	defer cancel()

	cmd := exec.CommandContext(ctx, "sh", "-c", "echo hello world && sleep 5")
	assert.NotNil(t, cmd)

	stdoutPipe, err := cmd.StdoutPipe()
	assert.Nil(t, err)
	assert.NotNil(t, stdoutPipe)

	start := time.Now()

	err = cmd.Start()
	assert.Nil(t, err)

	var stdout string
	go func() {
		buf := bufio.NewReader(stdoutPipe)
		for {
			line, err := buf.ReadString('\n')
			if len(line) > 0 {
				stdout = stdout + line + "\n"
			}
			if err != nil {
				return
			}
		}
	}()

	err = cmd.Wait()
	d := time.Since(start)

	if err != nil {
		exiterr, ok := err.(*exec.ExitError)
		require.True(t, ok)
		status, ok := exiterr.Sys().(syscall.WaitStatus)
		require.True(t, ok)
		assert.NotEqual(t, 0, status.ExitStatus())
	}
	assert.True(t, strings.HasPrefix(stdout, "hello world"), "Stdout: %v", stdout)

	assert.True(t, d.Seconds() < 3, "Duration was %v", d)
}
@davecheney

This comment has been minimized.

Contributor

davecheney commented Jun 8, 2018

@kgadams

This comment has been minimized.

kgadams commented Jun 8, 2018

Ok, thanks for pointing this out.
For me go test -race did not detect the problem. However moving stdout and the assert into the goroutine fixes the race (which was not germane to my problem: Wait() not returning in go 1.10, but in 1.9).

The go routine now looks like this:

go func() {
        var stdout string
        buf := bufio.NewReader(stdoutPipe)
        for {
            line, err := buf.ReadString('\n')
            if len(line) > 0 {
                stdout = stdout + line + "\n"
            }
            if err != nil {
                assert.True(t, strings.HasPrefix(stdout, "hello world"), "Stdout: %v", stdout)
                return
            }
        }
    }()
@kgadams

This comment has been minimized.

kgadams commented Jun 8, 2018

Sorry, I did not yet figure out how to correctly insert code blocks here. Any pointers?

@ysmolsky

This comment has been minimized.

@crvv

This comment has been minimized.

Contributor

crvv commented Jun 8, 2018

I guess you are talking about #24050, which is a change of go test.
Can you reproduce it without go test?

@kgadams

This comment has been minimized.

kgadams commented Jun 8, 2018

Here is the reproduction as a main function with correct markdown :-)
THX @ysmolsky!

Result does not change: works with go 1.9, breaks with go 1.10.

package main

import (
	"bufio"
	"context"
	"fmt"
	"os"
	"os/exec"
	"strings"
	"syscall"
	"time"
)

func fatal(format string, args ...interface{}) {
	fmt.Println(fmt.Sprintf(format, args...))
	os.Exit(1)
}

func main() {
	ctx, cancel := context.WithTimeout(context.Background(), time.Second)
	defer cancel()

	cmd := exec.CommandContext(ctx, "sh", "-c", "echo hello world && sleep 5")
	stdoutPipe, err := cmd.StdoutPipe()
	if err != nil {
		fatal("no pipe: %v", err)
	}
	start := time.Now()

	if err = cmd.Start(); err != nil {
		fatal("start failed: %v", err)
	}

	go func() {
		var stdout string
		buf := bufio.NewReader(stdoutPipe)
		for {
			line, err := buf.ReadString('\n')
			if len(line) > 0 {
				stdout = stdout + line + "\n"
			}
			if err != nil {
				if !strings.HasPrefix(stdout, "hello world") {
					fatal("wrong output: %q", stdout)
				}
				return
			}
		}
	}()

	err = cmd.Wait()
	d := time.Since(start)

	if err != nil {
		exiterr := err.(*exec.ExitError)
		status := exiterr.Sys().(syscall.WaitStatus)
		if status.ExitStatus() == 0 {
			fatal("wrong exit status: %v", status.ExitStatus())
		}
	}

	if d.Seconds() >= 3 {
		fatal("Cancelation took too long: %v", d)
	}
	fmt.Println("Success!")
}
@fraenkel

This comment has been minimized.

Contributor

fraenkel commented Jun 8, 2018

I get Success! with 1.10.2 and tip on my Ubuntu laptop.

@rsc rsc changed the title from os/exec: consider changing Wait to stop copying goroutines rather than waiting for them to proposal: os/exec: consider changing Wait to stop copying goroutines rather than waiting for them Jun 11, 2018

@gopherbot gopherbot added the Proposal label Jun 11, 2018

@rsc

This comment has been minimized.

Contributor

rsc commented Jun 11, 2018

Moved the 1.9 -> 1.10 change to #25835 for more investigation.

Based on discussion with proposal review, it sounds like it would be worth trying, very early in the Go 1.12 cycle, setting a timeout on the pipes after syscall.Wait returns indicating that the process has exited. That should guarantee the kernel will deliver anything the child process wrote, but then return promptly instead of waiting for grandchildren.

@rsc rsc modified the milestones: Proposal, Go1.12 Jun 11, 2018

@rsc rsc changed the title from proposal: os/exec: consider changing Wait to stop copying goroutines rather than waiting for them to os/exec: consider changing Wait to stop copying goroutines rather than waiting for them Jun 11, 2018

@rsc

This comment has been minimized.

Contributor

rsc commented Jun 11, 2018

Also this may not work on Plan 9 and that's OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment