Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/signal: timeouts in TestNotifyContextNotifications on solaris-amd64-oraclerel #51054

Open
bcmills opened this issue Feb 7, 2022 · 10 comments
Open
Labels
NeedsInvestigation OS-Solaris
Milestone

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Feb 7, 2022

goroutine 73 [syscall]:
runtime.syscall_wait4(0x4317, 0xc00022247c, 0x0, 0xc000260990)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/runtime/syscall_solaris.go:312 +0x86 fp=0xc0000bbd38 sp=0xc0000bbce0 pc=0x462026
syscall.Wait4(0x203000?, 0xc000204750?, 0xc000229608?, 0x40c945?)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/syscall/syscall_solaris.go:247 +0x2c fp=0xc0000bbd80 sp=0xc0000bbd38 pc=0x46e2cc
os.(*Process).wait(0xc000248330)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/os/exec_unix.go:43 +0x85 fp=0xc0000bbdd8 sp=0xc0000bbd80 pc=0x4832a5
os.(*Process).Wait(...)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/os/exec.go:132
os/exec.(*Cmd).Wait(0xc00025edc0)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/os/exec/exec.go:507 +0x54 fp=0xc0000bbe50 sp=0xc0000bbdd8 pc=0x4c7a94
os/exec.(*Cmd).Run(0x5?)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/os/exec/exec.go:341 +0x39 fp=0xc0000bbe70 sp=0xc0000bbe50 pc=0x4c6bf9
os/exec.(*Cmd).CombinedOutput(0xc00025edc0)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/os/exec/exec.go:567 +0x96 fp=0xc0000bbe98 sp=0xc0000bbe70 pc=0x4c7db6
os/signal.TestNotifyContextNotifications.func2(0xc000278ea0)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/os/signal/signal_test.go:749 +0x191 fp=0xc0000bbf70 sp=0xc0000bbe98 pc=0x4cf871
testing.tRunner(0xc000278ea0, 0xc0002184f0)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/testing/testing.go:1440 +0x102 fp=0xc0000bbfc0 sp=0xc0000bbf70 pc=0x4bb6c2
testing.(*T).Run.func1()
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/testing/testing.go:1487 +0x2a fp=0xc0000bbfe0 sp=0xc0000bbfc0 pc=0x4bc56a
runtime.goexit()
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc0000bbfe8 sp=0xc0000bbfe0 pc=0x4653a1
created by testing.(*T).Run
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/testing/testing.go:1487 +0x35f

greplogs --dashboard -md -l -e 'panic: test timed out .*\n(?:\ngoroutine \d+ .*:\n(.+\n)*)*^os/signal\.TestNotifyContextNotifications' --since=2021-01-01

2022-02-04T20:44:18-02224c8/solaris-amd64-oraclerel
2021-11-29T19:45:58-f598e29/solaris-amd64-oraclerel
2021-11-17T04:31:45-633d8c1/solaris-amd64-oraclerel
2021-10-31T15:36:50-243c5ae/solaris-amd64-oraclerel
2021-10-01T18:18:46-0d65c27/solaris-amd64-oraclerel
2021-10-01T17:18:34-dab16c1/solaris-amd64-oraclerel
2021-10-01T16:17:08-f19b2d5/solaris-amd64-oraclerel
2021-10-01T14:06:36-96d3ba8/solaris-amd64-oraclerel
2021-09-17T20:26:51-974b016/solaris-amd64-oraclerel
2021-08-16T23:32:14-8ff16c1/solaris-amd64-oraclerel
2021-08-16T14:26:45-b0fba64/solaris-amd64-oraclerel
2021-04-28T19:52:58-07e006d/solaris-amd64-oraclerel
2021-04-26T18:54:39-0d1280c/solaris-amd64-oraclerel
2021-04-23T21:42:57-c0e1301/solaris-amd64-oraclerel
2021-04-22T16:15:44-d3853fb/solaris-amd64-oraclerel
2021-04-16T16:36:57-2f0e5bf/solaris-amd64-oraclerel
2021-04-16T16:11:09-04e1176/solaris-amd64-oraclerel
2021-04-15T12:38:11-dba2eab/solaris-amd64-oraclerel
2021-04-06T04:48:09-a25c584/solaris-amd64-oraclerel
2021-03-18T14:09:57-db4adb1/solaris-amd64-oraclerel

This is the same test that was observed to be flaky in #41561 (CC @henvic, @ianlancetaylor, @cherrymui).

@bcmills bcmills added the NeedsInvestigation label Feb 7, 2022
@bcmills bcmills added this to the Go1.19 milestone Feb 7, 2022
@bcmills
Copy link
Member Author

@bcmills bcmills commented Feb 7, 2022

It's interesting that the test is blocked on this subprocess in particular, because we explicitly plumb in a value for the -test.timeout flag that should be short enough to complete before the test finishes:
https://cs.opensource.google/go/go/+/master:src/os/signal/signal_test.go;l=746-748;drc=e3eaedb5cf623d0836533573db4140749da42768

@bcmills bcmills changed the title os/signal: timeout in TestNotifyContextNotifications on solaris-amd64-oraclerel os/signal: timeouts in TestNotifyContextNotifications on solaris-amd64-oraclerel Feb 7, 2022
@henvic
Copy link
Contributor

@henvic henvic commented Feb 8, 2022

Thanks for pinging me, @bcmills.
I'll try to reproduce it on a Solaris VM during this weekend.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Mar 8, 2022

One more this month:

greplogs --dashboard -md -l -e 'panic: test timed out .*\n(?:\ngoroutine \d+ .*:\n(.+\n)*)*^os/signal\.TestNotifyContextNotifications' --since=2022-02-08

2022-03-07T18:19:44-7dc6c5e/solaris-amd64-oraclerel

I notice that we're not actually calling t.Parallel() in the subtests. I wonder if the test is actually timing out in the first subtest, and then starting the second subtest without actually having enough time left to clean up.

It occurs to me that the os/exec changes proposed in #50436 would probably help here too: if either subtest times out, we ought to terminate the subprocess with SIGQUIT to figure out what it's stuck on.

@henvic
Copy link
Contributor

@henvic henvic commented Mar 13, 2022

* t.Parallel() is called in TestNotifyContextNotifications immediately after the checkNotifyContext conditional.

I was able to reproduce this (building checked out @ commit 0261fa6) when running all.bash on a SunOS solaris 5.11 11.4.0.15.0 i86pc i386 i86pc virtual machine. In my case, it's happening about once every ~20 runs or so. I wasn't able to identify cause. Weirdly, increasing the memory of the virtual machine seems to have reduced the number of times this happened.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Mar 30, 2022

I wasn't able to identify cause. Weirdly, increasing the memory of the virtual machine seems to have reduced the number of times this happened.

Since solaris/amd64 is not a first class port, if we can't identify the cause of the failure it would be ok to skip the test on that platform.

greplogs --dashboard -md -l -e 'panic: test timed out .*\n(?:\ngoroutine \d+ .*:\n(.+\n)*)*^os/signal\.TestNotifyContextNotifications' --since=2022-03-08

2022-03-28T19:46:07-1a9972e/solaris-amd64-oraclerel

@gopherbot
Copy link

@gopherbot gopherbot commented Apr 5, 2022

Change https://go.dev/cl/398454 mentions this issue: os/signal: run TestNotifyContextnotifications subtests in parallel

@gopherbot
Copy link

@gopherbot gopherbot commented Apr 5, 2022

Change https://go.dev/cl/398454 mentions this issue: os/signal: run TestNotifyContextNotifications subtests in parallel

gopherbot pushed a commit that referenced this issue Apr 5, 2022
If we run out of time on the first subtest, we don't want to start the
second one with essentially no time remaining. (Moreover, there is no
compelling reason not to run these tests in parallel, since they send
signals to separate processes.)

For #51054.

Change-Id: I0424e08c3a9d2db986568d5a5c004859b52f7c51
Reviewed-on: https://go-review.googlesource.com/c/go/+/398454
Trust: Bryan Mills <bcmills@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Henrique Vicente de Oliveira Pinto <henriquevicente@gmail.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@bcmills
Copy link
Member Author

@bcmills bcmills commented Apr 5, 2022

CL 398454 might fix this. I'll keep an eye on it.

@bcmills bcmills added the WaitingForInfo label Apr 5, 2022
@bcmills
Copy link
Member Author

@bcmills bcmills commented Apr 19, 2022

The thrilling conclusion, from https://build.golang.org/log/efc839b64c4b5ef269ecf21fa28685fbeb600e18:

--- FAIL: TestNotifyContextNotifications (0.00s)
    --- FAIL: TestNotifyContextNotifications/once (0.00s)
        signal_test.go:741: starting test with less than 22s remaining
    --- FAIL: TestNotifyContextNotifications/multiple (0.00s)
        signal_test.go:741: starting test with less than 22s remaining
FAIL
FAIL	os/signal	158.438s

The large settleTime that we had to set on solaris is just too slow. I guess for the interim we can try scaling it back, but longer term we should rework the tests not to send signals to the main test process to begin with. 🤔

@bcmills bcmills removed the WaitingForInfo label Apr 19, 2022
@gopherbot
Copy link

@gopherbot gopherbot commented Apr 29, 2022

Change https://go.dev/cl/403199 mentions this issue: os/signal: scale back the solaris-amd64-oraclerel settle time

gopherbot pushed a commit that referenced this issue Apr 30, 2022
The settleTime is arbitrary. Ideally we should refactor the test to
avoid it (using subprocesses instead of sleeps to isolate tests from
each others' delayed signals), but as a shorter-term workaround let's
try scaling it back to match linux/ppc64 (the other builder that
empirically requires a longer settleTime).

For #51054.
Updates #33174.

Change-Id: I574fffaadd74c52c13d63974e87f20b6d3cf3c4b
Reviewed-on: https://go-review.googlesource.com/c/go/+/403199
TryBot-Result: Gopher Robot <gobot@golang.org>
Auto-Submit: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation OS-Solaris
Projects
None yet
Development

No branches or pull requests

3 participants