Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/signal: timeouts in TestNotifyContextNotifications on solaris-amd64-oraclerel #51054

Closed
bcmills opened this issue Feb 7, 2022 · 14 comments
Closed
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Solaris
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Feb 7, 2022

#!watchflakes
post <- pkg == "os/signal" && test == "TestNotifyContextNotifications"
goroutine 73 [syscall]:
runtime.syscall_wait4(0x4317, 0xc00022247c, 0x0, 0xc000260990)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/runtime/syscall_solaris.go:312 +0x86 fp=0xc0000bbd38 sp=0xc0000bbce0 pc=0x462026
syscall.Wait4(0x203000?, 0xc000204750?, 0xc000229608?, 0x40c945?)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/syscall/syscall_solaris.go:247 +0x2c fp=0xc0000bbd80 sp=0xc0000bbd38 pc=0x46e2cc
os.(*Process).wait(0xc000248330)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/os/exec_unix.go:43 +0x85 fp=0xc0000bbdd8 sp=0xc0000bbd80 pc=0x4832a5
os.(*Process).Wait(...)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/os/exec.go:132
os/exec.(*Cmd).Wait(0xc00025edc0)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/os/exec/exec.go:507 +0x54 fp=0xc0000bbe50 sp=0xc0000bbdd8 pc=0x4c7a94
os/exec.(*Cmd).Run(0x5?)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/os/exec/exec.go:341 +0x39 fp=0xc0000bbe70 sp=0xc0000bbe50 pc=0x4c6bf9
os/exec.(*Cmd).CombinedOutput(0xc00025edc0)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/os/exec/exec.go:567 +0x96 fp=0xc0000bbe98 sp=0xc0000bbe70 pc=0x4c7db6
os/signal.TestNotifyContextNotifications.func2(0xc000278ea0)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/os/signal/signal_test.go:749 +0x191 fp=0xc0000bbf70 sp=0xc0000bbe98 pc=0x4cf871
testing.tRunner(0xc000278ea0, 0xc0002184f0)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/testing/testing.go:1440 +0x102 fp=0xc0000bbfc0 sp=0xc0000bbf70 pc=0x4bb6c2
testing.(*T).Run.func1()
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/testing/testing.go:1487 +0x2a fp=0xc0000bbfe0 sp=0xc0000bbfc0 pc=0x4bc56a
runtime.goexit()
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc0000bbfe8 sp=0xc0000bbfe0 pc=0x4653a1
created by testing.(*T).Run
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/testing/testing.go:1487 +0x35f

greplogs --dashboard -md -l -e 'panic: test timed out .*\n(?:\ngoroutine \d+ .*:\n(.+\n)*)*^os/signal\.TestNotifyContextNotifications' --since=2021-01-01

2022-02-04T20:44:18-02224c8/solaris-amd64-oraclerel
2021-11-29T19:45:58-f598e29/solaris-amd64-oraclerel
2021-11-17T04:31:45-633d8c1/solaris-amd64-oraclerel
2021-10-31T15:36:50-243c5ae/solaris-amd64-oraclerel
2021-10-01T18:18:46-0d65c27/solaris-amd64-oraclerel
2021-10-01T17:18:34-dab16c1/solaris-amd64-oraclerel
2021-10-01T16:17:08-f19b2d5/solaris-amd64-oraclerel
2021-10-01T14:06:36-96d3ba8/solaris-amd64-oraclerel
2021-09-17T20:26:51-974b016/solaris-amd64-oraclerel
2021-08-16T23:32:14-8ff16c1/solaris-amd64-oraclerel
2021-08-16T14:26:45-b0fba64/solaris-amd64-oraclerel
2021-04-28T19:52:58-07e006d/solaris-amd64-oraclerel
2021-04-26T18:54:39-0d1280c/solaris-amd64-oraclerel
2021-04-23T21:42:57-c0e1301/solaris-amd64-oraclerel
2021-04-22T16:15:44-d3853fb/solaris-amd64-oraclerel
2021-04-16T16:36:57-2f0e5bf/solaris-amd64-oraclerel
2021-04-16T16:11:09-04e1176/solaris-amd64-oraclerel
2021-04-15T12:38:11-dba2eab/solaris-amd64-oraclerel
2021-04-06T04:48:09-a25c584/solaris-amd64-oraclerel
2021-03-18T14:09:57-db4adb1/solaris-amd64-oraclerel

This is the same test that was observed to be flaky in #41561 (CC @henvic, @ianlancetaylor, @cherrymui).

@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 7, 2022
@bcmills bcmills added this to the Go1.19 milestone Feb 7, 2022
@bcmills
Copy link
Contributor Author

bcmills commented Feb 7, 2022

It's interesting that the test is blocked on this subprocess in particular, because we explicitly plumb in a value for the -test.timeout flag that should be short enough to complete before the test finishes:
https://cs.opensource.google/go/go/+/master:src/os/signal/signal_test.go;l=746-748;drc=e3eaedb5cf623d0836533573db4140749da42768

@bcmills bcmills changed the title os/signal: timeout in TestNotifyContextNotifications on solaris-amd64-oraclerel os/signal: timeouts in TestNotifyContextNotifications on solaris-amd64-oraclerel Feb 7, 2022
@henvic
Copy link
Contributor

henvic commented Feb 8, 2022

Thanks for pinging me, @bcmills.
I'll try to reproduce it on a Solaris VM during this weekend.

@bcmills
Copy link
Contributor Author

bcmills commented Mar 8, 2022

One more this month:

greplogs --dashboard -md -l -e 'panic: test timed out .*\n(?:\ngoroutine \d+ .*:\n(.+\n)*)*^os/signal\.TestNotifyContextNotifications' --since=2022-02-08

2022-03-07T18:19:44-7dc6c5e/solaris-amd64-oraclerel

I notice that we're not actually calling t.Parallel() in the subtests. I wonder if the test is actually timing out in the first subtest, and then starting the second subtest without actually having enough time left to clean up.

It occurs to me that the os/exec changes proposed in #50436 would probably help here too: if either subtest times out, we ought to terminate the subprocess with SIGQUIT to figure out what it's stuck on.

@henvic
Copy link
Contributor

henvic commented Mar 13, 2022

* t.Parallel() is called in TestNotifyContextNotifications immediately after the checkNotifyContext conditional.

I was able to reproduce this (building checked out @ commit 0261fa6) when running all.bash on a SunOS solaris 5.11 11.4.0.15.0 i86pc i386 i86pc virtual machine. In my case, it's happening about once every ~20 runs or so. I wasn't able to identify cause. Weirdly, increasing the memory of the virtual machine seems to have reduced the number of times this happened.

@bcmills
Copy link
Contributor Author

bcmills commented Mar 30, 2022

I wasn't able to identify cause. Weirdly, increasing the memory of the virtual machine seems to have reduced the number of times this happened.

Since solaris/amd64 is not a first class port, if we can't identify the cause of the failure it would be ok to skip the test on that platform.

greplogs --dashboard -md -l -e 'panic: test timed out .*\n(?:\ngoroutine \d+ .*:\n(.+\n)*)*^os/signal\.TestNotifyContextNotifications' --since=2022-03-08

2022-03-28T19:46:07-1a9972e/solaris-amd64-oraclerel

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/398454 mentions this issue: os/signal: run TestNotifyContextnotifications subtests in parallel

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/398454 mentions this issue: os/signal: run TestNotifyContextNotifications subtests in parallel

gopherbot pushed a commit that referenced this issue Apr 5, 2022
If we run out of time on the first subtest, we don't want to start the
second one with essentially no time remaining. (Moreover, there is no
compelling reason not to run these tests in parallel, since they send
signals to separate processes.)

For #51054.

Change-Id: I0424e08c3a9d2db986568d5a5c004859b52f7c51
Reviewed-on: https://go-review.googlesource.com/c/go/+/398454
Trust: Bryan Mills <bcmills@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Henrique Vicente de Oliveira Pinto <henriquevicente@gmail.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@bcmills
Copy link
Contributor Author

bcmills commented Apr 5, 2022

CL 398454 might fix this. I'll keep an eye on it.

@bcmills bcmills added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Apr 5, 2022
@bcmills
Copy link
Contributor Author

bcmills commented Apr 19, 2022

The thrilling conclusion, from https://build.golang.org/log/efc839b64c4b5ef269ecf21fa28685fbeb600e18:

--- FAIL: TestNotifyContextNotifications (0.00s)
    --- FAIL: TestNotifyContextNotifications/once (0.00s)
        signal_test.go:741: starting test with less than 22s remaining
    --- FAIL: TestNotifyContextNotifications/multiple (0.00s)
        signal_test.go:741: starting test with less than 22s remaining
FAIL
FAIL	os/signal	158.438s

The large settleTime that we had to set on solaris is just too slow. I guess for the interim we can try scaling it back, but longer term we should rework the tests not to send signals to the main test process to begin with. 🤔

@bcmills bcmills removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Apr 19, 2022
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/403199 mentions this issue: os/signal: scale back the solaris-amd64-oraclerel settle time

gopherbot pushed a commit that referenced this issue Apr 30, 2022
The settleTime is arbitrary. Ideally we should refactor the test to
avoid it (using subprocesses instead of sleeps to isolate tests from
each others' delayed signals), but as a shorter-term workaround let's
try scaling it back to match linux/ppc64 (the other builder that
empirically requires a longer settleTime).

For #51054.
Updates #33174.

Change-Id: I574fffaadd74c52c13d63974e87f20b6d3cf3c4b
Reviewed-on: https://go-review.googlesource.com/c/go/+/403199
TryBot-Result: Gopher Robot <gobot@golang.org>
Auto-Submit: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
@ianlancetaylor
Copy link
Contributor

Rolling forward to 1.20. Please comment if you disagree. Thanks.

@gopherbot
Copy link
Contributor

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "os/signal" && test == "TestNotifyContextNotifications"
2023-01-09 17:35 solaris-amd64-oraclerel go@87105e5b os/signal.TestNotifyContextNotifications (log)
panic: test timed out after 3m0s

runtime.gopark(0xc00011bd68?, 0xc00011bdd0?, 0xe5?, 0xc8?, 0xc00011bdf8?)
	/opt/golang/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/runtime/proc.go:361 +0xd6 fp=0xc00011bd50 sp=0xc00011bd30 pc=0x436d56
runtime.chanrecv(0xc0000b0690, 0xc00011be4f, 0x1)
	/opt/golang/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/runtime/chan.go:577 +0x56c fp=0xc00011bde0 sp=0xc00011bd50 pc=0x405f2c
runtime.chanrecv1(0xc000098030?, 0x1?)
	/opt/golang/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/runtime/chan.go:440 +0x18 fp=0xc00011be08 sp=0xc00011bde0 pc=0x405958
testing.(*T).Run(0xc0000036c0, {0x5385d4?, 0x2?}, 0xc000092400)
	/opt/golang/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/testing/testing.go:1487 +0x37a fp=0xc00011bea8 sp=0xc00011be08 pc=0x4bc85a
os/signal.TestNotifyContextNotifications(0xc0001e4340?)
	/opt/golang/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/os/signal/signal_test.go:733 +0x1d1 fp=0xc00011bf70 sp=0xc00011bea8 pc=0x4cfb11
testing.tRunner(0xc0000036c0, 0x543928)

watchflakes

@bcmills bcmills added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Jan 17, 2023
@gopherbot
Copy link
Contributor

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

@bcmills bcmills reopened this Feb 17, 2023
@bcmills bcmills removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Feb 17, 2023
@bcmills bcmills reopened this Feb 17, 2023
@bcmills
Copy link
Contributor Author

bcmills commented May 12, 2023

Duplicate of #60152

@bcmills bcmills marked this as a duplicate of #60152 May 12, 2023
@bcmills bcmills closed this as not planned Won't fix, can't repro, duplicate, stale May 12, 2023
@golang golang locked and limited conversation to collaborators May 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Solaris
Projects
Status: Done
Development

No branches or pull requests

4 participants