Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/signal: TestTerminalSignal failures on Darwin builders #37329

Closed
bcmills opened this issue Feb 20, 2020 · 15 comments
Closed

os/signal: TestTerminalSignal failures on Darwin builders #37329

bcmills opened this issue Feb 20, 2020 · 15 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Darwin
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Feb 20, 2020

2020-02-19T20:37:54-b15fd6b/darwin-amd64-10_11
2019-10-31T21:47:08-ef03c44/darwin-amd64-race

--- FAIL: TestTerminalSignal (10.01s)
    signal_cgo_test.go:145: "PS1='prompt> '\r\n"
    signal_cgo_test.go:145: "bash-3.2$ PS1='prompt> '\r\n"
    signal_cgo_test.go:145: "prompt> GO_TEST_TERMINAL_SIGNALS=1 /var/folders/dx/k53rs1s93538b4x20g46cj_w00\r<GNALS=1 /var/folders/dx/k53rs1s93538b4x20g46cj_w000                         \b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b0gn/T/workdir-host-darwin\r<3rs1s93538b4x20g46cj_w0000gn/T/workdir-host-darwin-                         \b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b10_11/tmp/go-build2962210\r<gn/T/workdir-host-darwin-10_11/tmp/go-build29622109                         \b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b3/b143/signal.test -test.\r<0_11/tmp/go-build296221093/b143/signal.test -test.r                         \b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\bun=TestTerminalSignal\r\n"
    signal_cgo_test.go:145: "test program entering read\r\n"
    signal_cgo_test.go:145: "^Z\r\n"
    signal_cgo_test.go:145: "[1]+  Stopped                 GO_TEST_TERMINAL_SIGNALS=1 /var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir-host-darwin-10_11/tmp/go-build296221093/b143/signal.test -test.run=TestTerminalSignal\r\n"
    signal_cgo_test.go:145: "prompt> fg\r\n"
    signal_cgo_test.go:145: "GO_TEST_TERMINAL_SIGNALS=1 /var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir-host-darwin-10_11/tmp/go-build296221093/b143/signal.test -test.run=TestTerminalSignal\r\n"
    signal_cgo_test.go:145: "\r\n"
    signal_cgo_test.go:145: "[1]+  Stopped                 GO_TEST_TERMINAL_SIGNALS=1 /var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir-host-darwin-10_11/tmp/go-build296221093/b143/signal.test -test.run=TestTerminalSignal\r\n"
    signal_cgo_test.go:145: "prompt> \r\n"
    signal_cgo_test.go:145: "prompt> exit $?\r\n"
    signal_cgo_test.go:145: "exit\r\n"
    signal_cgo_test.go:145: "There are stopped jobs.\r\n"
    signal_cgo_test.go:128: "prompt> "
    signal_cgo_test.go:237: subprogram failed: signal: killed
FAIL
FAIL	os/signal	15.203s

CC @ianlancetaylor @cherrymui @bradfitz

@bcmills bcmills added OS-Darwin NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Feb 20, 2020
@bcmills bcmills added this to the Backlog milestone Feb 20, 2020
@bcmills
Copy link
Contributor Author

bcmills commented Feb 24, 2020

@bcmills
Copy link
Contributor Author

bcmills commented Mar 13, 2020

The test is also flaky on darwin-amd64-10_12, but with a different failure mode (a test timeout rather than signal: killed):

2020-03-13T00:19:08-85e87f9/darwin-amd64-10_12
2019-10-25T23:25:32-8c58615/darwin-amd64-10_12

@bcmills bcmills changed the title os/signal: TestTerminalSignal failure on Darwin builder os/signal: TestTerminalSignal failures on Darwin builders Apr 20, 2020
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022
@prattmic
Copy link
Member

This flake has been occurring again on the new darwin-amd64-*-aws builders. We will see if https://go.dev/cl/435735 stops the failures.

@bcmills bcmills added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Sep 28, 2022
@ZekeLu
Copy link
Contributor

ZekeLu commented Sep 29, 2022

I'm afraid that CL 435735 does not help.

I have confirmed that, on my local MacOS, when the test is run with go tool dist test go_test:os/signal, the same pattern of the \b characters are added. But the test did not fail. Then I think maybe the full path of the signal.test file matters. So I created the same path, copied the signal.test file into it, and then changed the test to send this path to bash, the test still did not fail. The output is attached to the end of this comment. The highlighted lines are totally the same as that in the first comment of this issue. And I have tried to run go tool dist test go_test:os/signal in a loop on the code without CL 43735 (with some modifications to run TestTerminalSignal only). I can not reproduce the issue in 5000+ runs.

 ##### Test execution environment.
 # GOARCH: amd64
 # CPU: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
 # GOOS: darwin
 # OS Version: Darwin 21.6.0 Darwin Kernel Version 21.6.0: Mon Aug 22 20:17:10 PDT 2022; root:xnu-8020.140.49~2/RELEASE_X86_64 x86_64
 
 ##### Testing packages.
 === RUN   TestTerminalSignal
 === PAUSE TestTerminalSignal
 === CONT  TestTerminalSignal
     signal_cgo_test.go:145: "PS1='prompt> '\r\n"
     signal_cgo_test.go:145: "\r\n"
     signal_cgo_test.go:145: "The default interactive shell is now zsh.\r\n"
     signal_cgo_test.go:145: "To update your account to use zsh, please run `chsh -s /bin/zsh`.\r\n"
     signal_cgo_test.go:145: "For more details, please visit https://support.apple.com/kb/HT208050.\r\n"
+    signal_cgo_test.go:145: "bash-3.2$ PS1='prompt> '\r\n"
+    signal_cgo_test.go:145: "prompt> GO_TEST_TERMINAL_SIGNALS=1 /var/folders/dx/k53rs1s93538b4x20g46cj_w00\r<GNALS=1 /var/folders/dx/k53rs1s93538b4x20g46cj_w000                         \b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b0gn/T/workdir-host-darwin\r<3rs1s93538b4x20g46cj_w0000gn/T/workdir-host-darwin-                         \b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b10_11/tmp/go-build2962210\r<gn/T/workdir-host-darwin-10_11/tmp/go-build29622109                         \b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b3/b143/signal.test -test.\r<0_11/tmp/go-build296221093/b143/signal.test -test.r                         \b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\bun=TestTerminalSignal\r\n"
+    signal_cgo_test.go:145: "test program entering read\r\n"
+    signal_cgo_test.go:145: "^Z\r\n"
+    signal_cgo_test.go:145: "[1]+  Stopped                 GO_TEST_TERMINAL_SIGNALS=1 /var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir-host-darwin-10_11/tmp/go-build296221093/b143/signal.test -test.run=TestTerminalSignal\r\n"
+    signal_cgo_test.go:145: "prompt> fg\r\n"
+    signal_cgo_test.go:145: "GO_TEST_TERMINAL_SIGNALS=1 /var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir-host-darwin-10_11/tmp/go-build296221093/b143/signal.test -test.run=TestTerminalSignal\r\n"
+    signal_cgo_test.go:145: "\r\n"
     signal_cgo_test.go:145: "read newline\r\n"
     signal_cgo_test.go:145: "prompt> exit $?\r\n"
     signal_cgo_test.go:145: "exit\r\n"
 --- PASS: TestTerminalSignal (0.13s)
 PASS
 ok      os/signal       0.499s

@prattmic
Copy link
Member

That's too bad, but thanks for taking a look.

@bcmills bcmills removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Sep 29, 2022
@ZekeLu
Copy link
Contributor

ZekeLu commented Sep 29, 2022

I will keep the test running for the next few days. If I can reproduce the issue with n runs while can't reproduce it after the CL is applied, then we can say that the CL at least reduce the chance of this failure. I will update in a few days.

@prattmic
Copy link
Member

prattmic commented Oct 4, 2022

I can reproduce this on a gomote (darwin-amd64-11-aws, specifically) simply by running the test many times: gomote run $(gomote-instance) ./go/bin/go test -run=TestTerminalSignal -v -count=100 -failfast os/signal.

IIUC, we are failing the race described here, and need to make that wait more robust.

@bcmills
Copy link
Contributor Author

bcmills commented Oct 4, 2022

Why does TestTerminalSignal need bash, anyway? (Can't we make the same system calls from Go that bash is making under the hood?)

Hmm... #22838 (comment) is not hopeful:

The fix took 10 seconds, the test case took five hours.

😅

@prattmic
Copy link
Member

prattmic commented Oct 4, 2022

Why does TestTerminalSignal need bash, anyway? (Can't we make the same system calls from Go that bash is making under the hood?)

Fundamentally, we probably could, though it might take a while to make sure we properly trigger the original interrupt that is under test.

Interestingly, the race I mentioned earlier appears to not be what we are hitting. I added a heartbeat to the child so we could definitively wait for it to continue, but the failure persists.

Instead, the issue seems to be sending "fg" too early. A short sleep prior sending "fg" makes the issue disappear. It seems that despite bash printing the child command line it does not successfully continue it if "fg" is sent too soon.

@prattmic
Copy link
Member

prattmic commented Oct 4, 2022

Why does TestTerminalSignal need bash, anyway? (Can't we make the same system calls from Go that bash is making under the hood?)

Fundamentally, we probably could, though it might take a while to make sure we properly trigger the original interrupt that is under test.

Brief testing shows that SIGSTOP causes EINTR from a pty, but not from a normal pipe. That is unfortunate for test simplicity, but writing a test using PTYs directly without involving bash may still be simpler than dealing with bash's oddities.

@prattmic
Copy link
Member

prattmic commented Oct 5, 2022

I finally managed to recreate this test without bash. It should be more reliable now, though I'm not sure how valuable this test is now that we universally retry on EINTR rather than just on Darwin.

For posterity, a read from a PTY returns EINTR on Darwin if:

  1. The PTY foreground process group is in read of the PTY when it is stopped by write of ^Z to the PTY (SIGTSTP).
  2. The parent process group takes over as foreground process group of the PTY.
  3. The parent process group makes the child foreground again prior to continuing the child.
  4. The child is continued with SIGCONT.

i.e., the foreground process group of the PTY must be changed during the read. I didn't actually double check if stopping the child is necessary, or if changing the foreground process group during read without a stop is sufficient.

@prattmic
Copy link
Member

prattmic commented Oct 5, 2022

Now that we know where to look:

  • Changing foreground process group triggers a wakeup of readers
  • Which I believe gets us here. Note that we isbackground is never true except when the process is stopped, but I suspect that the stop itself doesn't wake the reader, and this check runs before a check for stopped state (which is probably in a caller). Hence this still runs in stopped state.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/440220 mentions this issue: os/signal: rewrite TestTerminalSignal without bash

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/443066 mentions this issue: os/signal: add missing newlines to TestTerminalSignal

gopherbot pushed a commit that referenced this issue Oct 14, 2022
For #37329.
For #56233.

Change-Id: Iafcddaddafd2d27fa5d535b57aaefec387f0b3f0
Reviewed-on: https://go-review.googlesource.com/c/go/+/443066
Run-TryBot: Michael Pratt <mpratt@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
romaindoumenc pushed a commit to TroutSoftware/go that referenced this issue Nov 3, 2022
The existing version of this test contains several races it tries to
control with sleeps. Unfortunately, it is still flaky on darwin because
writing `fg` in bash too early can apparently result in failure to
actually continue the stopped child.

Rather than continuing to get perfect timing with bash, rewrite this to
eliminate bash and instead perform the same PTY operations that bash
would do.

This test is still quite complex because psuedo-terminals are
interminably complicated, but I believe it is no longer racy.
Technically there are still two races (waiting for child to enter read()
and waiting for the darwin kernel to wake the read after TIOCSPGRP), but
loss of either of these races should only mean we fail to test the
desired darwin EINTR case, not failure.

This test is skipped on DragonflyBSD, as it tickles a Wait hang bug
(golang#56132).

Updates golang#56132.
Fixes golang#37329.

Change-Id: I0ceaf5aa89f6be0f1bf68b2140f47db673cedb33
Reviewed-on: https://go-review.googlesource.com/c/go/+/440220
Run-TryBot: Michael Pratt <mpratt@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Bryan Mills <bcmills@google.com>
romaindoumenc pushed a commit to TroutSoftware/go that referenced this issue Nov 3, 2022
For golang#37329.
For golang#56233.

Change-Id: Iafcddaddafd2d27fa5d535b57aaefec387f0b3f0
Reviewed-on: https://go-review.googlesource.com/c/go/+/443066
Run-TryBot: Michael Pratt <mpratt@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@golang golang locked and limited conversation to collaborators Oct 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Darwin
Projects
None yet
Development

No branches or pull requests

4 participants