Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/signal: TestNohup flaky #33174

Open
bcmills opened this issue Jul 18, 2019 · 21 comments
Open

os/signal: TestNohup flaky #33174

bcmills opened this issue Jul 18, 2019 · 21 comments
Assignees
Milestone

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Jul 18, 2019

Seen on the solaris-amd64-oraclerel builder in https://build.golang.org/log/d5185d101eddc0f5a05103f11b74caf24421bf46:

--- FAIL: TestNohup (1.25s)
    signal_test.go:329: ran test with -send_uncaught_sighup=1 and it succeeded: expected failure.
        Output:
        PASS
FAIL
FAIL	os/signal	3.575s

See previously #8682, #5526.

CC @ianlancetaylor

@bcmills bcmills added this to the Unplanned milestone Jul 18, 2019
@bcmills
Copy link
Member Author

@bcmills bcmills commented Sep 11, 2019

@bcmills bcmills changed the title os/signal: TestNohup flake on solaris-amd64-oraclerel os/signal: TestNohup flaky Sep 11, 2019
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Sep 12, 2019

Note that the case in #33174 (comment) is not the same symptom, but rather the reverse:

--- FAIL: TestNohup (1.62s)
    signal_test.go:354: ran test with -send_uncaught_sighup=1 under nohup and it failed: expected success.
        Error: signal: user defined signal 1
        Output:
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Sep 12, 2019

I think all these cases could be explained if there is an unusually long time, more than 100 milliseconds, between calling syscall.Kill and the time the program actually receives the signal.

@gopherbot
Copy link

@gopherbot gopherbot commented Sep 12, 2019

Change https://golang.org/cl/194879 mentions this issue: os/signal: split up sleeps waiting for signal

@bcmills
Copy link
Member Author

@bcmills bcmills commented Feb 20, 2020

@bcmills bcmills reopened this Feb 20, 2020
@bcmills
Copy link
Member Author

@bcmills bcmills commented Mar 27, 2020

@bcmills bcmills modified the milestones: Unplanned, Backlog Mar 27, 2020
@bcmills
Copy link
Member Author

@bcmills bcmills commented Mar 27, 2020

Both of those failures are with -send_uncaught_sighup=2:

--- FAIL: TestNohup (1.77s)
    signal_test.go:377: ran test with -send_uncaught_sighup=2 and it succeeded: expected failure.
        Output:
        PASS
FAIL
@gopherbot
Copy link

@gopherbot gopherbot commented Mar 27, 2020

Change https://golang.org/cl/226138 mentions this issue: os/signal: rework test timeouts and concurrency

@bcmills bcmills self-assigned this Mar 27, 2020
@bcmills bcmills modified the milestones: Backlog, Go1.15 Mar 27, 2020
gopherbot pushed a commit that referenced this issue Mar 30, 2020
Use a uniform function (named “quiesce”) to wait for possible signals
in a way that gives the kernel many opportunities to deliver them.

Simplify channel usage and concurrency in stress tests.

Use (*testing.T).Deadline instead of parsing the deadline in TestMain.

In TestStop, sleep forever in a loop if we expect the test to die from
a signal. That should reduce the flakiness of TestNohup, since
TestStop will no longer spuriously pass when run as a subprocess of
TestNohup.

Since independent signals should not interfere, run the different
signals in TestStop in parallel when testing in short mode.

Since TestNohup runs TestStop as a subprocess, and TestStop needs to
wait many times for signals to quiesce, run its test subprocesses
concurrently and in short mode — reducing the latency of that test by
more than a factor of 2.

The above two changes reduce the running time of TestNohup on my
workstation to ~345ms, making it possible to run much larger counts of
the test in the same amount of wall time. If the test remains flaky
after this CL, we can spend all or part of that latency improvement on
a longer settle time.

Updates #33174

Change-Id: I09206f213d8c1888b50bf974f965221a5d482419
Reviewed-on: https://go-review.googlesource.com/c/go/+/226138
Run-TryBot: Bryan C. Mills <bcmills@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@gopherbot
Copy link

@gopherbot gopherbot commented Mar 30, 2020

Change https://golang.org/cl/226458 mentions this issue: os/signal: in TestStop, skip the final "unexpected signal" check for SIGUSR1 on Android

@gopherbot
Copy link

@gopherbot gopherbot commented Mar 30, 2020

Change https://golang.org/cl/226461 mentions this issue: os/signal: make TestStop resilient to initially-blocked signals

gopherbot pushed a commit that referenced this issue Mar 30, 2020
…SIGUSR1 on Android

In CL 226138, I updated TestStop to have more uniform behavior for its signals.
However, that test seems to always fail for SIGUSR1 on the Android ARM builders.

I'm not sure what's special about Android for this particular case,
but let's skip the test to unbreak the builders while I investigate.

For #38165
Updates #33174

Change-Id: I35a70346cd9757a92acd505a020bf95e6871405c
Reviewed-on: https://go-review.googlesource.com/c/go/+/226458
Run-TryBot: Bryan C. Mills <bcmills@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
gopherbot pushed a commit that referenced this issue Mar 31, 2020
For reasons unknown, SIGUSR1 appears to be blocked at process start
for tests on the android-arm-corellium and android-arm64-corellium
builders. (This has been observed before, too: see CL 203957.)
Make the test resilient to blocked signals by always calling Notify
and waiting for potential signal delivery after sending any signal
that is not known to be unblocked.

Also remove the initial SIGWINCH signal from testCancel. The behavior
of an unhandled SIGWINCH is already tested in TestStop, so we don't
need to re-test that same case: waiting for an unhandled signal takes
a comparatively long time (because we necessarily don't know when it
has been delivered), so this redundancy makes the overall test binary
needlessly slow, especially since it is called from both TestReset and
TestIgnore.

Since each signal is always unblocked while we have a notification
channel registered for it, we don't need to modify any other tests:
TestStop and testCancel are the only functions that send signals
without a registered channel.

Fixes #38165
Updates #33174
Updates #15661

Change-Id: I215880894e954b62166024085050d34323431b63
Reviewed-on: https://go-review.googlesource.com/c/go/+/226461
Run-TryBot: Bryan C. Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@gopherbot
Copy link

@gopherbot gopherbot commented Apr 8, 2020

Change https://golang.org/cl/227649 mentions this issue: os/signal: increase settle time in tests

gopherbot pushed a commit that referenced this issue Apr 8, 2020
I noticed a timeout in TestIgnore in
https://build.golang.org/log/52d83a72f3a5ea9a16eb5d670c729694144f9624,
which suggests that the settle time is currently set too low.

I've also added a check for the same GO_TEST_TIMEOUT_SCALE used in
TestTerminalSignal, so that if this builder remains too slow we can
increase the builder's scale factor rather than the test's baseline
running time.

Updates #33174

Change-Id: I18b10eaa3bb5ae2f604300aedaaf6f79ee7ad567
Reviewed-on: https://go-review.googlesource.com/c/go/+/227649
Run-TryBot: Bryan C. Mills <bcmills@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@bcmills
Copy link
Member Author

@bcmills bcmills commented Apr 14, 2020

Rats, still flaky even with a longer settle time:
https://build.golang.org/log/dd62939f6d3b512fe3e6147074a9c6db1144113f

I'm tempted to dramatically overinflate the settle time on this builder just to see if it's actually dropping the signal.

@ianlancetaylor

This comment was marked as resolved.

@bcmills

This comment was marked as resolved.

@gopherbot
Copy link

@gopherbot gopherbot commented Apr 15, 2020

Change https://golang.org/cl/228266 mentions this issue: os/signal: special-case test settle time on the solaris-amd64-oraclerel builder

@bcmills
Copy link
Member Author

@bcmills bcmills commented Apr 15, 2020

@rorth, are there any known signal-dropping bugs on Oracle Solaris that may account for this flakiness?

@rorth
Copy link

@rorth rorth commented Apr 15, 2020

gopherbot pushed a commit that referenced this issue Apr 15, 2020
…el builder

This is an attempt to distinguish between a dropped signal and
general builder slowness.

The previous attempt (increasing the settle time to 250ms) still
resulted in a timeout:
https://build.golang.org/log/dd62939f6d3b512fe3e6147074a9c6db1144113f

For #33174

Change-Id: I79027e91ba651f9f889985975f38c7b01d82f634
Reviewed-on: https://go-review.googlesource.com/c/go/+/228266
Run-TryBot: Bryan C. Mills <bcmills@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@bcmills
Copy link
Member Author

@bcmills bcmills commented May 5, 2020

No timeouts on solaris-amd64-oraclerel since CL 228266, so I think that one may just be a slow builder:
2020-04-13T20:22:26-6f3a951/solaris-amd64-oraclerel
2020-04-08T18:04:59-94d22d1/solaris-amd64-oraclerel
2019-09-09T23:04:30-a3a1bdf/solaris-amd64-oraclerel

Unfortunately, it seems that 100ms may be too short on the other builders:
2020-05-04T22:52:07-b5f7ff4/linux-ppc64-buildlet
2020-04-22T21:08:58-0ee4b13/darwin-amd64-10_14

@gopherbot
Copy link

@gopherbot gopherbot commented May 5, 2020

Change https://golang.org/cl/232378 mentions this issue: os/signal: retune the settleTime again

@ianlancetaylor ianlancetaylor modified the milestones: Go1.15, Go1.16 Jun 15, 2020
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jun 15, 2020

Changing milestone to 1.16.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.