Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

time: TestZeroTimer failures #66006

Closed
gopherbot opened this issue Feb 28, 2024 · 19 comments
Closed

time: TestZeroTimer failures #66006

gopherbot opened this issue Feb 28, 2024 · 19 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Milestone

Comments

@gopherbot
Copy link

#!watchflakes
default <- pkg == "time" && test == "TestZeroTimer"

Issue created automatically to collect these failures.

Example (log):

--- FAIL: TestZeroTimer (120.77s)
    sleep_test.go:578: Expected time to get value from Timer channel in less than 2 sec, took 2m0.009812517s

watchflakes

@gopherbot gopherbot added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 28, 2024
@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "time" && test == "TestZeroTimer"
2024-02-28 16:44 linux-amd64-longtest go@d1e8dc25 time.TestZeroTimer (log)
--- FAIL: TestZeroTimer (120.77s)
    sleep_test.go:578: Expected time to get value from Timer channel in less than 2 sec, took 2m0.009812517s
2024-02-28 16:44 windows-amd64-longtest go@1df6db8e time.TestZeroTimer (log)
--- FAIL: TestZeroTimer (362.37s)
    sleep_test.go:578: Expected time to get value from Timer channel in less than 2 sec, took 2m0.0098644s
    sleep_test.go:578: Expected time to get value from Timer channel in less than 2 sec, took 2m0.0108545s
    sleep_test.go:578: Expected time to get value from Timer channel in less than 2 sec, took 2m0.0010985s

watchflakes

@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "time" && test == "TestZeroTimer"
2024-02-28 16:44 linux-amd64-longtest-race go@d1e8dc25 time.TestZeroTimer (log)
--- FAIL: TestZeroTimer (243.44s)
    sleep_test.go:578: Expected time to get value from Timer channel in less than 2 sec, took 2m0.004081005s
    sleep_test.go:578: Expected time to get value from Timer channel in less than 2 sec, took 2m0.00057097s

watchflakes

@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "time" && test == "TestZeroTimer"
2024-02-28 16:44 gotip-linux-386-longtest go@d1e8dc25 time.TestZeroTimer (log)
=== RUN   TestZeroTimer
    sleep_test.go:578: Expected time to get value from Timer channel in less than 2 sec, took 2m0.010132448s
--- FAIL: TestZeroTimer (122.94s)
2024-02-28 16:44 gotip-linux-amd64-longtest-race go@1df6db8e time.TestZeroTimer (log)
=== RUN   TestZeroTimer
    sleep_test.go:578: Expected time to get value from Timer channel in less than 2 sec, took 2m0.009149723s
--- FAIL: TestZeroTimer (127.47s)

watchflakes

@bcmills bcmills added release-blocker compiler/runtime Issues related to the Go compiler and/or runtime. labels Feb 29, 2024
@bcmills bcmills added this to the Go1.23 milestone Feb 29, 2024
@rsc
Copy link
Contributor

rsc commented Feb 29, 2024

I looked into these. I can reproduce on my laptop using go test -count=50 -run=Zero -timeout=3m in go/src/time. The failure rate is about 1 per 10 runs, so count=50 gets about 5 failures. Each failure takes 2m, so the 3m timeout speeds up the check. Without a failure the command takes 0.5m.

The failure is reliably introduced by one of my CLs (2fb5ef8) but then reliably fixed by a CL a few commits later (c6888d9) [transcript below].

The problem is that when I fixed modtimer, I corrected a deadlock (as noted in the CL 564125 description) by unlocking t before locking pp.timersLock. But that created a window when the timer was in state timerWaiting without actually being in a heap. Something about that state caused the timer not to be noticed as things idled, even though it got inserted into the heap shortly afterward. Moving the unlock back down after the doaddtimer reintroduces the hypothetical deadlock but also closes the window and fixes the test failure. Of course, we don't want to do that.

In CL 564127 I closed that window by setting t.pp before unlocking t, so that a future modification of t would at least set t.pp.timerModifiedEarliest (minNextWhen). That avoids the deadlock but still closes the window and fixes the test failure.

This window where t was in state timerWaiting but t.pp == nil is also the root cause of the ppc64le builder failure that I worried might be bad atomics. Phew.

In pending CL 564977, I reintroduce the window where the timer is not yet in the heap, but now the timer is not marked as being in a heap, so it's not a window of inconsistency like in the failures. I also wrote a func-based TestZeroTimer (since the channel-based one will be handled by the new channel special case in pending CL 512355), verified that it failed in the same failure window, and verified that it still passes with CL 546977. I will mail that out as a new CL and mark it as fixing this GitHub issue, but the flake is already gone.

/cc @aclements

% git foreach ed07e16ffd..7649c9b212 go test -count=50 -run=Zero -timeout=5m
f8654408d6 go/build: add check for allowed vendor packages

0728e2b139 runtime: simplify, speed up adjusttimers

77f0bd01fb runtime: delete clearDeletedTimers

8a493a6672 runtime: move all timer-locking code into time.go

4a3c3ec996 runtime: delete addtimer

b2bd7ebafc runtime: merge timerModifiedEarlier and timerModifierLater

09bfea95cf runtime: merge timerRemoving into timerModifying

3be36e9b20 runtime: merge timerMoving into timerModifying

ed0ad591d3 runtime: merge timerNoStatus into timerRemoved

db8c6c8c7a runtime: merge timerDeleted into timerModified

51df232b12 runtime: use timer.lock in deltimer

b5a99eaa94 runtime: add timer lock to lock ranking

2fb5ef889b runtime: use timer.lock in modtimer
FAIL
58911599e8 runtime: use timer.lock in cleantimers
FAIL
d1e8dc25ff runtime: use timer.lock in runtimer
FAIL
1df6db8e4f runtime: use timer.lock in adjusttimers
FAIL
c6888d9264 runtime: use timer.lock in moveTimers

a155a2f8a1 runtime: change timer.status to timer.state

370c8e983b runtime: use methods for timer

b6753baaed cmd/compile: compile len(ch) as call to runtime.chanlen

022d16dece runtime: move per-P timers state into its own struct

ea15ebfb38 time: garbage collect unstopped Tickers and Timers

98981caf4e runtime: rename timers fields for clarity

dc5b02899d runtime: update timers.len with Store instead of Add

63a5dfc55c runtime: fix timers.wakeTime inaccuracy race

7649c9b212 make.bat: find bootstrap toolchain on LUCI gomotes

@gopherbot
Copy link
Author

Change https://go.dev/cl/568255 mentions this issue: time: make sure tests avoid the special-case channel code

@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "time" && test == "TestZeroTimer"
2024-02-28 16:44 gotip-darwin-amd64-longtest go@58911599 time.TestZeroTimer (log)
=== RUN   TestZeroTimer
    sleep_test.go:578: Expected time to get value from Timer channel in less than 2 sec, took 2m0.061488161s
--- FAIL: TestZeroTimer (122.37s)

watchflakes

@rsc
Copy link
Contributor

rsc commented Feb 29, 2024

Not sure why this only just appeared, but 5891159 is one of the commits in the known bad range.

@mknyszek mknyszek closed this as completed Mar 6, 2024
gopherbot pushed a commit that referenced this issue Mar 8, 2024
Many of the tests in package time are about proper manipulation
of the timer heap. But now NewTimer bypasses the timer heap
except when something is blocked on the associated channel.
Make the tests test the heap again by using AfterFunc instead of
NewTimer.

In particular, adds a non-chan version of TestZeroTimer, which
was flaky-broken and then fixed by CLs in the cleanup stack.
This new tests makes sure we notice if it breaks again.

Fixes #66006.

Change-Id: Ib59fc1b8b85ef5a21e72fe418c627c9b8b8a083a
Reviewed-on: https://go-review.googlesource.com/c/go/+/568255
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Auto-Submit: Russ Cox <rsc@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "time" && test == "TestZeroTimer"
2024-03-08 21:09 windows-amd64-longtest go@85bbb121 time.TestZeroTimer (log)
--- FAIL: TestZeroTimer (122.22s)
    sleep_test.go:578: Expected time to get value from Timer channel in less than 2 sec, took 2m0.0054632s

watchflakes

@gopherbot gopherbot reopened this Mar 8, 2024
@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "time" && test == "TestZeroTimer"
2024-03-11 19:54 windows-amd64-longtest go@a18aa0e3 time.TestZeroTimer (log)
--- FAIL: TestZeroTimer (245.81s)
    --- FAIL: TestZeroTimer/impl=chan (122.13s)
        sleep_test.go:636: Expected time to get value from Timer channel in less than 2 sec, took 2m0.0028434s
    --- FAIL: TestZeroTimer/impl=func (123.68s)
        sleep_test.go:636: Expected time to get value from Timer channel in less than 2 sec, took 2m0.0058075s

watchflakes

@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "time" && test == "TestZeroTimer"
2024-03-12 16:49 linux-arm64-longtest go@f83102cf time.TestZeroTimer (log)
--- FAIL: TestZeroTimer (122.61s)
    --- FAIL: TestZeroTimer/impl=func (121.43s)
        sleep_test.go:636: Expected time to get value from Timer channel in less than 2 sec, took 2m0.007207148s

watchflakes

@aclements
Copy link
Member

@rsc, it looks like this is still happening.

@rsc
Copy link
Contributor

rsc commented Mar 13, 2024

🤔

@rsc
Copy link
Contributor

rsc commented Mar 13, 2024

Unlike the last time, I can't reproduce this with go test time -count=50 -run=Zero on my laptop. The timing here doesn't look like it is still happening so much as it recurred after a week of being fixed. The first new failure being on the day I submitted more time CLs probably indicates it is one of these:

2024-03-08 85bbb12 runtime: fix timers.wakeTime inaccuracy race
2024-03-08 3ad5cd4 runtime: update timers.len with Store instead of Add
2024-03-08 2171e62 runtime: rename timers fields for clarity
2024-03-08 e38be31 time: gracefully handle ts.zombies underflow

@rsc
Copy link
Contributor

rsc commented Mar 13, 2024

Reproduced under stress; seems to be 85bbb12, ironically. Does not reproduce in my local client, though. I will bisect it later today to see if it is really fixed.

@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "time" && test == "TestZeroTimer"
2024-03-13 13:58 linux-arm64-longtest go@6e5398ba time.TestZeroTimer (log)
--- FAIL: TestZeroTimer (122.57s)
    --- FAIL: TestZeroTimer/impl=chan (121.18s)
        sleep_test.go:636: Expected time to get value from Timer channel in less than 2 sec, took 2m0.005818595s

watchflakes

@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "time" && test == "TestZeroTimer"
2024-03-13 18:22 linux-amd64-longtest go@38723f2b time.TestZeroTimer (log)
--- FAIL: TestZeroTimer (121.77s)
    --- FAIL: TestZeroTimer/impl=func (121.02s)
        sleep_test.go:636: Expected time to get value from Timer channel in less than 2 sec, took 2m0.009710965s

watchflakes

@rsc
Copy link
Contributor

rsc commented Mar 13, 2024

Lots more reproduction. stress ./time.test -test.run=Zero -test.timeout=3m -test.count=20 consistently fails at least a few runs out of a hundred, even with all my pending CLs applied. Definitely started at 85bbb12 although I don't see why. Maybe my fault, maybe a latent bug I uncovered.

@rsc
Copy link
Contributor

rsc commented Mar 14, 2024

The bug is clear from the diff of 85bbb12 if you read it carefully enough. Oops. Fix mailed.

@gopherbot
Copy link
Author

Change https://go.dev/cl/571196 mentions this issue: runtime: fix lost sleep causing TestZeroTimer flakes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Projects
Archived in project
Development

No branches or pull requests

5 participants