runtime: TestNetpollBreak failures with "did not interrupt netpoll" on plan9 builders #39437

bcmills · 2020-06-06T05:16:45Z

2020-06-05T05:00:28-4489f4b/plan9-arm
2020-06-03T22:07:42-429d2c5/plan9-386-0intro
2020-06-03T14:51:34-5aaeda1/plan9-arm
2020-06-03T05:33:54-9b90491/plan9-arm
2020-06-02T18:36:30-e05695e/plan9-386-0intro

This batch of failures seems to coincide with CL 235820 (CC @0intro @millerresearch).

9pi · 2020-06-06T11:06:21Z

CL 235820 is just a coincidence. I can provoke the same failure on the release branch:

term% go version
go version go1.14.2 plan9/arm
term% go test -count 1000 -run TestNetpollBreak runtime
--- FAIL: TestNetpollBreak (5.80s)
    proc_test.go:1031: netpollBreak did not interrupt netpoll: slept for: 5.769816333s
FAIL

There's no network poller for Plan 9, just runtime/netpoll_stub.go which presumably exists to pretend to pass the tests. In essence this test is just one goroutine doing a notetsleep for 10 seconds, and another repeatedly doing a notewakeup with a 100 microsecond Usleep each time around the loop. How this can result in a >5 second delay is mysterious to me. I can see it's not swapping.

millerresearch · 2020-06-06T12:44:02Z

I can see it's not swapping.

Also, it's not a garbage collection delay: I tried with GOGC=off and still see failures.

gopherbot · 2020-06-13T18:28:49Z

Change https://golang.org/cl/237698 mentions this issue: runtime: avoid lock starvation in TestNetpollBreak on Plan 9

millerresearch · 2020-06-13T18:30:22Z

There's no network poller for Plan 9, just runtime/netpoll_stub.go which presumably exists to pretend to pass the tests.

My presumption was wrong: the stub "implementation" of netpoll is also used (currently on Plan 9 only) to support runtime timers. So the 10 second netpoll calls done by the test are contending with 10 minute netpoll calls which come from the overall go test timeout. The problem is that the runtime.lock which mediates this contention is unfair. When the 10 minute netpoll call is interrupted by netpollBreak it can be restarted and seize the lock before the 10 second netpoll call gets a chance. Repeated enough times, this starves the 10 second call sufficiently to time out the test.

CL 237698 inserts an osyield call to give the two callers a more even chance. It won't guarantee fairness, but a few hours running the test suggest that it does well enough.

bcmills added OS-Plan9 NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jun 6, 2020

bcmills added this to the Unplanned milestone Jun 6, 2020

gopherbot closed this as completed in 9340bd6 Jun 14, 2020

golang locked and limited conversation to collaborators Jun 14, 2021

gopherbot added the FrozenDueToAge label Jun 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: TestNetpollBreak failures with "did not interrupt netpoll" on plan9 builders #39437

runtime: TestNetpollBreak failures with "did not interrupt netpoll" on plan9 builders #39437

bcmills commented Jun 6, 2020

9pi commented Jun 6, 2020

millerresearch commented Jun 6, 2020

gopherbot commented Jun 13, 2020

millerresearch commented Jun 13, 2020

runtime: TestNetpollBreak failures with "did not interrupt netpoll" on plan9 builders #39437

runtime: TestNetpollBreak failures with "did not interrupt netpoll" on plan9 builders #39437

Comments

bcmills commented Jun 6, 2020

9pi commented Jun 6, 2020

millerresearch commented Jun 6, 2020

gopherbot commented Jun 13, 2020

millerresearch commented Jun 13, 2020