Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testing: test with child process sometimes hangs on 1.10; -timeout not respected #24050

Open
psanford opened this issue Feb 22, 2018 · 15 comments
Open
Milestone

Comments

@psanford
Copy link

@psanford psanford commented Feb 22, 2018

What version of Go are you using (go version)?

go version go1.10 linux/amd64

What operating system and processor architecture are you using (go env)?

$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/psanford/.cache/go-build"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/psanford/projects/go"
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build953969108=/tmp/go-build -gno-record-gcc-switches"

What did you do?

After upgrading to 1.10 we had one test that started to hang intermittently. The test in question starts a child process which it kills by canceling a context object at the end of the test method. It does not do an explicit cmd.Wait().

Here is a minimal test case that demonstrates the problem:

https://play.golang.org/p/8rq41A5Khsm

I can get this to hang consistently by running it in a bash while loop:

$ while true; do go test -timeout 5s -v -count 1 .; sleep 0.1; done
=== RUN   TestOSExecNoWait
start
done
--- PASS: TestOSExecNoWait (0.01s)
PASS
ok      _/tmp   0.012s
=== RUN   TestOSExecNoWait
start
done
--- PASS: TestOSExecNoWait (0.01s)
PASS
ok      _/tmp   0.012s
=== RUN   TestOSExecNoWait
start
done
--- PASS: TestOSExecNoWait (0.01s)
PASS
ok      _/tmp   0.012s
=== RUN   TestOSExecNoWait
start
done
--- PASS: TestOSExecNoWait (0.01s)
PASS

<hangs here indefinitely>

If I explicitly call cmd.Wait() the test does not hang. If I don't attach the child process' Stdout and Stderr to os.Std{out,err} the test does not hang.

On 1.9.4 the test does not hang.

Its also interesting that even though I specified -timeout 5s the test runner hangs forever.

@crvv
Copy link
Contributor

@crvv crvv commented Feb 23, 2018

the test runner hangs forever.

It only hangs 60 seconds on my machine.

go test hangs at

done <- cmd.Wait()

And cmd.Wait() is waiting at

_, err := io.Copy(w, pr)

This io.Copy() is reading a pipe reader. It won't return until the pipe writer is closed.
The pipe writer will be closed when the sleep 60 exits. So it hangs 60 seconds.

func TestOSExecNoWait(t *testing.T) {
...
	cancel()
}

After the cancel() returns, the main function will also return.
The child process may or may not be killed, so it hangs intermittently.

@psanford
Copy link
Author

@psanford psanford commented Feb 23, 2018

It only hangs 60 seconds on my machine.

Yes in the example provided it only hangs for 60 seconds because the child process exits.

In the test where I first found this issue the child process never exits so it hangs forever.

@crvv
Copy link
Contributor

@crvv crvv commented Feb 24, 2018

I don't think this issue is a regression introduced in Go 1.10.
I can use Go 1.9 to reproduce it but with a different condition.

This difference was introduced in bd95f88.
bd95f88#diff-acaf53a9cd478507ebbcf85037940b4dL1080

If the go command needs a bytes.Buffer to save the output, os/exec will open a pipe.
And go will hang until the pipe is closed.

The timeout doesn't work because it isn't handled by go.
There is a testKillTimeout in go, but it also hangs at cmd.Wait().

@gopherbot
Copy link

@gopherbot gopherbot commented Feb 28, 2018

Change https://golang.org/cl/97497 mentions this issue: cmd/go/internal/test: don't wait for pending I/O if child process has gone

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Feb 28, 2018

See also #23019.

@FiloSottile
Copy link
Member

@FiloSottile FiloSottile commented Apr 24, 2018

@gopherbot please open backport tracking issues. This might be a 1.10 regression, or also a 1.9 issue.

@gopherbot
Copy link

@gopherbot gopherbot commented Apr 24, 2018

Backport issue(s) opened: #25042 (for 1.10), #25043 (for 1.9).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

arvados-bot pushed a commit to arvados/arvados that referenced this issue Apr 26, 2018
In go 1.10.1, these seem to make "go test" hang sometimes.

golang/go#24050

No issue #

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <tclegg@veritasgenetics.com>
@ianlancetaylor ianlancetaylor modified the milestones: Go1.11, Go1.12 Jun 29, 2018
@andybons andybons modified the milestones: Go1.12, Go1.13 Feb 12, 2019
@andybons andybons modified the milestones: Go1.13, Go1.14 Jul 8, 2019
@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
distorhead added a commit to werf/werf that referenced this issue Dec 19, 2019
… hangs on windows

Recursive call to `ginkgo -r` should not interract with utils package.

Liveexec now uses pkg/testing/utils/gexec which is a pached version of github.com/onsi/gomega/gexec to handle the problem with an output handler hanging under windows: golang/go#24050.
@mvdan
Copy link
Member

@mvdan mvdan commented Sep 9, 2020

I've encountered the same problem; I was even thinking that our CI was completely broken, as I didn't understand how go test ./..., with a default timeout of 10m, could possibly run for six hours.

I have a way to reproduce this reliably on Linux:

https://play.golang.org/p/nvycpKfjzkL

$ go test -timeout=1s
panic: test timed out after 1s
[...]
[exits]
$ go test -timeout=1s .
[hangs for a long time]

It seems to me like the elements are:

  • Running go test with package arguments, so that the output is buffered for caching etc
  • The test executes a program which hangs or runs for a long time
  • The executed program's output is wired to the test's own output

So it seems to be exactly the same bug found previously in this issue. At a conceptual level, this seems fairly straightforward; if the test executable timeout is hit, all children processes should be killed, and any goroutines waiting for their output should be cancelled or stopped. But I assume the actual fix might be a bit tricky.

@bcmills @jayconrod @matloob as owners of cmd/go, any thoughts?

@mvdan
Copy link
Member

@mvdan mvdan commented Sep 9, 2020

Also, before anyone spends significant time into a fix, I think we could just fix #23019 instead. It would save far more time in the long run, because it's a change we likely want to do anyway and should fix other cascading bugs. But it's also a bit more controversial.

@bcmills
Copy link
Member

@bcmills bcmills commented Sep 9, 2020

I'm really not sure about #23019: the copying is a symptom (of child processes left running with open file handles), not the root cause here.

I think the main problem in this case is that the child process is not being terminated when the test process is. I think go test should make a best effort to terminate all subprocesses of the test process, unless they were explicitly started in a way that would suggest otherwise (such as by setting Setpgid in the SysProcAttr field).

However, I'm not sure what the most appropriate way to achieve that would be. I could imagine:

  • Hooking the os package from the testing package to have it set cmd.SysProcAttr.Pdeathsig implicitly where available (but I think “where available” is only Linux)?
  • Hooking the os package so that the testing.M.startAlarm callback can enumerate and signal all subprocesses started from the test?
  • Changing cmd/go to start each test in its own process group, forward signals to that process group explicitly, and signal the whole group on timeout?
  • Changing testing and/or cmd/go to enumerate child processes using procfs or similar?
  • Something else that I've missed?

And I'm not even sure where to start on Windows, since I assume there is no procfs there.

@mvdan
Copy link
Member

@mvdan mvdan commented Sep 9, 2020

I think you're right, we should terminate all children processes. It's a bit alarming that that's not the default behavior, and more so that there doesn't seem to be a portable way to do it.

You could blame this on the written test, since technically I could wrap exec calls with a timeout from testing.T.Deadline. Still, you'd be relying on the tests being written properly that way, which is not very realistic, particularly since that Deadline method is very new.

Shouldn't os/exec have some sort of option to apply this "kill all child processes" logic in the most portable way possible? It seems like such a common source of problems, and I've personally had to implement non-portable hacks like Pdeathsig before.

@bcmills
Copy link
Member

@bcmills bcmills commented Sep 9, 2020

I don't think the go test behavior should rely on folks using os/exec in any particular way. (I would rather we find a more transparent fix in cmd/go.)

@bcmills
Copy link
Member

@bcmills bcmills commented Sep 9, 2020

But yes: I think it would be handy for the os package to have some straightforward way to express “send this signal to this process and all of its children”. Maybe that's the process-group solution?

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Sep 9, 2020

The problem with a general "signal process and its children" mechanism on Unix systems is that the only options are "send signal to process" and "send signal to thread" and "send signal to process group". And we definitely don't want to start every new process in a different process group, as that will have surprising effects on the use of ^Z from the terminal. And of course we have no idea what processes a child process may start.

That said we could in principle invoke the ps program or (on GNU/Linux) read the /proc file system to look for all processes, build a process tree, and use that to identify the children of our child process. But that seems like a rather complex mechanism to implement a standard library function.

For testing specifically, it might be somewhat acceptable for go test to start the test binary in a separate process group, in which case it would be straightforward for the go command to kill the entire test binary process group.

@mvdan
Copy link
Member

@mvdan mvdan commented Sep 9, 2020

Interesting, I wasn't thinking of the implications of doing this in general. I agree that the use case for testing is pretty narrow, especially because I don't think we should ever leak processes or run in the background.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
9 participants
You can’t perform that action at this time.