Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

time: TestAfterQueuing fails on windows builders: #10680

Closed
alexbrainman opened this issue May 4, 2015 · 11 comments

Comments

Projects
None yet
6 participants
@alexbrainman
Copy link
Member

commented May 4, 2015

The TestAfterQueuing fails on windows builders recently:

http://build.golang.org/log/b9cb3c3357a4a8d874bf0669d6ee52f39f0be277
http://build.golang.org/log/62b0ebbe08b4b4b099b301a426e793286c2ed155
http://build.golang.org/log/bb3bbce19d8e09e3377629c491ca584faecc8ed0

with

--- FAIL: TestAfterQueuing (2.46s)
    sleep_test.go:235: attempt 0 failed: After(160ms) arrived at 532.1925ms, expected [150ms,360ms]
    sleep_test.go:235: attempt 1 failed: after slot 3, expected 1
    sleep_test.go:235: attempt 2 failed: after slot 3, expected 1
    sleep_test.go:239: after slot 3, expected 1
FAIL
FAIL    time    5.842s

The test does not fail for me here. But I suspect builders are CPU busy building Go and running tests - we are running both windows-386 and windows-amd64 in parallel. I added code similar to runtime/pprof/cpuHogger to TestAfterQueuing and I can make TestAfterQueuing fail on VM here (not on real hardware). I suggest we increase delta to 200 * Millisecond (It is 100 * Millisecond at this moment). That fixes my failure. It might make builders more reliable. How does that sound?

Alex

@minux

This comment has been minimized.

Copy link
Member

commented May 4, 2015

@bradfitz

This comment has been minimized.

Copy link
Member

commented May 4, 2015

Correct. Each Windows build is isolated.

@alexbrainman

This comment has been minimized.

Copy link
Member Author

commented May 5, 2015

Yes, builders are running in separate VMs now. Thanks for correcting me @minux and @bradfitz. I am not sure why this test fails more often in VMs.

Like I mentioned before, I can break this test reliably if I add this:

diff --git a/src/time/sleep.go b/src/time/sleep.go
index e7a2ee2..673d433 100644
--- a/src/time/sleep.go
+++ b/src/time/sleep.go
@@ -97,8 +97,10 @@ func sendTime(c interface{}, seq uintptr) {
    // Used in NewTicker, dropping sends on the floor is
    // the desired behavior when the reader gets behind,
    // because the sends are periodic.
+   t := Now()
+   print("sendTime: t=", t.String(), "\n")
    select {
-   case c.(chan Time) <- Now():
+   case c.(chan Time) <- t:
    default:
    }
 }
diff --git a/src/time/sleep_test.go b/src/time/sleep_test.go
index 2fce753..508f689 100644
--- a/src/time/sleep_test.go
+++ b/src/time/sleep_test.go
@@ -220,10 +220,60 @@ func TestAfterStop(t *testing.T) {
    }
 }

+var (
+   salt1 = 0
+   salt2 = 0
+)
+
+func cpuHog1() {
+   foo := salt1
+   for i := 0; i < 1e5; i++ {
+       if foo > 0 {
+           foo *= foo
+       } else {
+           foo *= foo + 1
+       }
+   }
+   salt1 = foo
+}
+
+func cpuHog2() {
+   foo := salt2
+   for i := 0; i < 1e5; i++ {
+       if foo > 0 {
+           foo *= foo
+       } else {
+           foo *= foo + 2
+       }
+   }
+   salt2 = foo
+}
+
+func cpuHogger(f func()) {
+   // We only need to get one 100 Hz clock tick, so we've got
+   // a 25x safety buffer.
+   // But do at least 500 iterations (which should take about 100ms),
+   // otherwise TestCPUProfileMultithreaded can fail if only one
+   // thread is scheduled during the 250ms period.
+   t0 := Now()
+   for i := 0; i < 500 || Since(t0) < 5000*Millisecond; i++ {
+       f()
+   }
+}
+
 func TestAfterQueuing(t *testing.T) {
+   go func() {
+       c := make(chan int)
+       go func() {
+           cpuHogger(cpuHog1)
+           c <- 1
+       }()
+       cpuHogger(cpuHog2)
+       <-c
+   }()
    // This test flakes out on some systems,
    // so we'll try it a few times before declaring it a failure.
-   const attempts = 3
+   const attempts = 1
    err := errors.New("!=nil")
    for i := 0; i < attempts && err != nil; i++ {
        delta := 100 * Millisecond
@@ -247,7 +297,9 @@ type afterResult struct {
 }

 func await(slot int, result chan<- afterResult, ac <-chan Time) {
-   result <- afterResult{slot, <-ac}
+   t := <-ac
+   print("await: t=", t.String(), "\n")
+   result <- afterResult{slot, t}
 }

 func testAfterQueuing(t *testing.T, delta Duration) error {

the output

C:\>u:\test -test.short -test.v -test.run=TestAfterQueuing
=== RUN TestAfterQueuing
sendTime: t=2015-05-04 22:44:37.4667308 -0700 PDT
sendTime: t=2015-05-04 22:44:37.4667308 -0700 PDT
sendTime: t=2015-05-04 22:44:37.4677073 -0700 PDT
sendTime: t=2015-05-04 22:44:37.4677073 -0700 PDT
sendTime: t=2015-05-04 22:44:37.4677073 -0700 PDT
sendTime: t=2015-05-04 22:44:37.4677073 -0700 PDT
sendTime: t=2015-05-04 22:44:37.4686838 -0700 PDT
sendTime: t=2015-05-04 22:44:37.4686838 -0700 PDT
sendTime: t=2015-05-04 22:44:37.4686838 -0700 PDT
sendTime: t=2015-05-04 22:44:37.4686838 -0700 PDT
sendTime: t=2015-05-04 22:44:37.4686838 -0700 PDT
sendTime: t=2015-05-04 22:44:37.4696603 -0700 PDT
sendTime: t=2015-05-04 22:44:37.4696603 -0700 PDT
await: t=2015-05-04 22:44:37.4667308 -0700 PDT
await: t=2015-05-04 22:44:37.4686838 -0700 PDT
await: t=2015-05-04 22:44:37.4677073 -0700 PDT
await: t=2015-05-04 22:44:37.4686838 -0700 PDT
await: t=2015-05-04 22:44:37.4686838 -0700 PDT
await: t=2015-05-04 22:44:37.4686838 -0700 PDT
await: t=2015-05-04 22:44:37.4677073 -0700 PDT
await: t=2015-05-04 22:44:37.4667308 -0700 PDT
await: t=2015-05-04 22:44:37.4677073 -0700 PDT
await: t=2015-05-04 22:44:37.4686838 -0700 PDT
await: t=2015-05-04 22:44:37.4696603 -0700 PDT
await: t=2015-05-04 22:44:37.4677073 -0700 PDT
await: t=2015-05-04 22:44:37.4696603 -0700 PDT
--- FAIL: TestAfterQueuing (0.23s)
        sleep_test.go:285: attempt 0 failed: After calls returned out of order: [0 5 3 6 6 6 1 1 2 7 9 4 8]
        sleep_test.go:289: After calls returned out of order: [0 5 3 6 6 6 1 1 2 7 9 4 8]
FAIL

Timers are fired fine - I tested this by adding debug prints in runtime/time.go (not shown). But then when sendTime puts timer events in the channel (returned by time.After), the events are received (in await) out of order - see the output. I think it just the way our scheduler works when there is some runnable code or, perhaps, when in VM. I don't know why this only happens on windows. Perhaps it is to do with Windows time granularity being around 15ms.

The only thing I can think of is to increase delta further from 100ms to 200ms or even more. This makes adjusted test PASS.

Looking for suggestions. Thank you.

Alex

@dvyukov @randall77

@mdempsky

This comment has been minimized.

Copy link
Member

commented May 5, 2015

OpenBSD is still flaky too, even after bumping retries up to 100ms (#9903, d12b532). It probably makes sense that for a second retry (i.e., third attempt) we can try an even longer delta period, like 500ms or a full second. Since the tests are only occasionally flaky right now, an extra conservative delta duration shouldn't affect our average testing times much.

Overall, the test as written is just inherently flaky. There's no guarantee the await goroutines will execute in the desired order. They might get scheduled across multiple CPUs or even just get scheduled out of order on a single CPU.

An alternative non-flaky way to write the test would be to select on all of the time.After channels (probably using reflect.Select) and make sure that if the runtime chooses to receive from time.After(N*delta) then we need to either have already received from all of the shorter duration timeouts or receiving from them now should not block.

@alexbrainman

This comment has been minimized.

Copy link
Member Author

commented May 5, 2015

Thank you @mdempsky for the input. I agree with your conclusions, that it is reasonable to increase delta up to 0.5s or 1s. These flukes are exception to the rule and shouldn't increase test time too much. I will see what others say and send the change after.

Alex

@randall77

This comment has been minimized.

Copy link
Contributor

commented May 5, 2015

It looks like general heavy load can throw this test off. Is there any way to get this test run in isolation? Maybe in test/... somewhere instead of a Test* (which gets run in parallel with lots of other stuff?).

@minux

This comment has been minimized.

Copy link
Member

commented May 5, 2015

@alexbrainman

This comment has been minimized.

Copy link
Member Author

commented May 5, 2015

Can we use testing.TestMain somehow?

Alex

@minux

This comment has been minimized.

Copy link
Member

commented May 5, 2015

@bradfitz

This comment has been minimized.

Copy link
Member

commented May 5, 2015

The problem is that on multiple core machines, go test -short std
will build and run multiple tests in parallel.

I'm not sure that's happening on our builders, but regardless the test shouldn't care.

Let's focus on making the test robust and not running it in isolation.

@gopherbot

This comment has been minimized.

Copy link

commented May 6, 2015

CL https://golang.org/cl/9795 mentions this issue.

@bradfitz bradfitz closed this in b4bc7b4 May 6, 2015

@golang golang locked and limited conversation to collaborators Jun 25, 2016

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.