New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: TestStackGrowth "finalizer did not run" flakes #19381

Open
bradfitz opened this Issue Mar 3, 2017 · 21 comments

Comments

Projects
None yet
5 participants
@bradfitz
Member

bradfitz commented Mar 3, 2017

@broady noticed TestStackGrowth "finalizer did not run" on freebsd-386 in #18598 (comment)

I just cleared all the Solaris builds from the dashboard and rebuilt them with 4x the CPU and they're now happy, except for:

https://build.golang.org/log/84bc7b4f864592e44e7964d8892ef2d9752f2754

--- FAIL: TestStackGrowth (49.50s)
	stack_test.go:114: finalizer did not run
FAIL
FAIL	runtime	102.054s
@aclements

This comment has been minimized.

Member

aclements commented Jun 7, 2017

Seems to have migrated to freebsd/arm. It's happening quite often:

$ greplogs -dashboard -md -E "finalizer did not run" -l | tail -n 10
2017-06-05T19:55:31-6a42568/freebsd-arm-paulzhol
2017-06-05T19:56:34-51711d1/freebsd-arm-paulzhol
2017-06-05T19:59:33-555d1e3/freebsd-arm-paulzhol
2017-06-05T19:59:33-555d1e3/linux-arm-arm5spacemonkey
2017-06-05T19:59:49-b1af539/freebsd-arm-paulzhol
2017-06-05T22:42:48-2d86f49/freebsd-arm-paulzhol
2017-06-05T22:42:48-2d86f49/linux-mips64
2017-06-06T15:48:54-3a27f28/freebsd-arm-paulzhol
2017-06-06T19:50:00-3c1914f/freebsd-arm-paulzhol
2017-06-06T20:20:41-a838191/freebsd-arm-paulzhol

It happens with some regularity on linux-arm-arm5spacemonkey, which has a gomote.

I haven't been able to reproduce the failure yet, though I have noticed that TestStackGrowth takes ~2.9 seconds to run on linux-arm-arm5spacemonkey, compared to 0.1 seconds on my laptop.

@aclements

This comment has been minimized.

Member

aclements commented Jun 7, 2017

TestStackGrowth takes ~2.9 seconds to run on linux-arm-arm5spacemonkey, compared to 0.1 seconds on my laptop.

This appears to be because TestStackGrowth GC's ~770 times, but on linux-arm each GC scans an 8 MB heap, while on my linux/amd64 laptop the heap is under 1 MB.

@aclements

This comment has been minimized.

Member

aclements commented Jun 7, 2017

This appears to be because TestStackGrowth GC's ~770 times, but on linux-arm each GC scans an 8 MB heap, while on my linux/amd64 laptop the heap is under 1 MB.

I believe this is a red herring. The 8MB comes from numerators in vlop_arm_test.go, but it's all non-scannable anyway. Adding // +build ignore to this file doesn't speed up TestStackGrowth.

@aclements

This comment has been minimized.

Member

aclements commented Jun 8, 2017

There's some useful discussion of this problem at #19805 (which I'm closing as a dup). @eliasnaur summarized it as:

Increasing the timeout in stack_test.go:114 to 30 seconds or checking out 44ed88a (the parent of CL 37520) both make the problem go away.

CL 37520 made runtime.GC concurrent. It's odd that this would matter, since the test is effectively single-threaded (spread across a few synchronized goroutines). It's possible this is just slightly slower than STW GC and the ~770 GCs done between triggering the finalizer and passing the test take too long on slow hardware. Or it's possible this somehow perturbed finalizers, so the single forced GC cycle that's supposed to start the finalizer doesn't.

@gopherbot

This comment has been minimized.

gopherbot commented Jun 8, 2017

CL https://golang.org/cl/45138 mentions this issue.

gopherbot pushed a commit that referenced this issue Jun 8, 2017

runtime: more diagnostics for TestStackGrowth
This adds diagnostics so we can tell if the finalizer has started, in
addition to whether or not it has finished.

Updates #19381.

Change-Id: Icb7b1b0380c9ad1128b17074828945511a6cca5d
Reviewed-on: https://go-review.googlesource.com/45138
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@aclements

This comment has been minimized.

Member

aclements commented Jun 9, 2017

Looking at the few builds on freebsd/arm since 45138 went in, all but one of them only say "finalizer did not run", which means the finalizer did start and it's just taking a long time to run.

@aclements

This comment has been minimized.

Member

aclements commented Jun 9, 2017

@eliasnaur, since you can reproduce this, could I get you to do a test for me on android/arm? Could you check out 44ed88a and run the runtime test with -v to get the run time of TestStackGrowth? I'd like to know how long TestStackGrowth took before CL 37520 to get a sense for whether it was just below the threshold before or if there was a more serious regression.

@eliasnaur

This comment has been minimized.

Contributor

eliasnaur commented Jun 10, 2017

Of course. I ran GOOS=android GOARCH=arm ../bin/go test -short -v runtime and got (among others):

--- PASS: TestStackGrowth (3.43s)

If there's anything else you need, please ping me again.

@aclements

This comment has been minimized.

Member

aclements commented Jul 10, 2017

@eliasnaur, can you reproduce this if you just comment out the t.Parallel() at the top of TestStackGrowth?

@aclements

This comment has been minimized.

Member

aclements commented Jul 10, 2017

For my own record, this is what I've been using to try (unsuccessfully) to reproduce this:

VM=$(gomote create android-arm-wiko-fever)
gomote push $VM
gomote run $VM /bin/sh -c '
PATH=$PATH:$PWD/go/bin && 
cd go/src &&
./make.bash &&
cd runtime &&
go test -c &&
adb shell mkdir -p /data/local/tmp/goroot &&
adb push runtime.test /data/local/tmp/goroot/runtime.test'
gomote run $VM /bin/sh -c 'adb shell /data/local/tmp/goroot/runtime.test -test.short -test.cpu=1,2,4 -test.v' |& tee /tmp/x

I did notice once that on -cpu 4, TestStackGrowth took ~25 seconds (but did not timeout). The builder has been really uncooperative, though, so I've only managed to run that once.

@eliasnaur

This comment has been minimized.

Contributor

eliasnaur commented Jul 11, 2017

Initially I couldn't reproduce the problem on tip nor on the version from my #19805 (comment) comment. Then I remembered that the Android builder had a problem around the same time where stuck Go test processes would slowly clog up the device resources, slowing it down and triggering timing sensitive tests such as TestStackGrowth.

So, in summary: I can not reproduce the problem on a freshly rebooted device but I can reproduce the problem if I run CPU-heavy processes on it in parallel with the Go tests.

To answer your question, commenting out t.Parallel does help, even on a loaded device: 6 consecutive successful runs without t.Parallel and with t.Parallel the tests failed on the first run thereafter.

I since modified the android builder to reboot before every build, so I'm not sure there is more to do on this. If not, I'm sorry to have led you on a wild goose chase.

@aclements

This comment has been minimized.

Member

aclements commented Jul 11, 2017

To answer your question, commenting out t.Parallel does help, even on a loaded device: 6 consecutive successful runs without t.Parallel and with t.Parallel the tests failed on the first run thereafter.

Thanks!

I think this all makes sense. The test started failing when I made runtime.GC() concurrent because it's a parallel test: before I made that change all of the expensive things in that test were STW so it wasn't really parallel, but after that change it was fighting with all of the other tests. Since concurrent GC claims 25% of the CPU, it probably started taking ~4 times longer to run.

I since modified the android builder to reboot before every build, so I'm not sure there is more to do on this. If not, I'm sorry to have led you on a wild goose chase.

Not at all. It's still failing pretty regularly on the dashboard and it sounds like we now have a solution.

@gopherbot

This comment has been minimized.

gopherbot commented Jul 11, 2017

CL https://golang.org/cl/48110 mentions this issue.

gopherbot pushed a commit that referenced this issue Jul 11, 2017

runtime: make TestStackGrowth a serial test
TestStackGrowth is currently a parallel test. However, it depends on a
20 second timeout, which is already dubious in a parallel test, and
became really problematic on slow builders when runtime.GC switched to
triggering concurrent GC instead of STW GC. Before that change, the
test spent much of its time in STW GC, so it wasn't *really* parallel.
After that change, it was competing with all of the other parallel
tests and GC likely started taking ~4 times longer. On most builders
the whole test runs in well under a second, but on the slow builders
that was enough to push it over the 20 second timeout.

Fix this by making the test serial.

Updates #19381 (probably fixes it, but we'll have to wait and see).

Change-Id: I21af7cf543ab07f1ec1c930bfcb355b0df75672d
Reviewed-on: https://go-review.googlesource.com/48110
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Elias Naur <elias.naur@gmail.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@aclements

This comment has been minimized.

Member

aclements commented Jul 17, 2017

This dropped from happening to a third of the commits to not happening at all with 286599f.

Thanks for the help @eliasnaur!

First observed d3f5e36 07 Mar 03:26 2017 (1482 commits ago)
Last observed  58ae050 11 Jul 14:01 2017 (53 commits ago)
0.00% chance failure is still happening
36% failure probability (515 of 1430 commits)

@aclements aclements closed this Jul 17, 2017

@bradfitz bradfitz reopened this Dec 1, 2017

@bradfitz bradfitz modified the milestones: Go1.9, Go1.10 Dec 1, 2017

@bradfitz

This comment has been minimized.

Member

bradfitz commented Dec 1, 2017

@aclements, this is back.

Here's a NetBSD one: https://build.golang.org/log/7b248c375b767afdaa5d716c075734fed740af6d

Did any recent change to runtime.GC() affect this test?

        // in finalizer                                                                                                                                                                   
        wg.Add(1)
        go func() {
                defer wg.Done()
                done := make(chan bool)
                var started uint32
                go func() {
                        s := new(string)
                        SetFinalizer(s, func(ss *string) {
                                atomic.StoreUint32(&started, 1)
                                growStack()
                                done <- true
                        })
                        s = nil
                        done <- true
                }()
                <-done
                GC()
                select {
                case <-done:
                case <-time.After(20 * time.Second):
                        if atomic.LoadUint32(&started) == 0 {
                                t.Log("finalizer did not start")
                        }
                        t.Error("finalizer did not run")
                        return
                }
        }()
        wg.Wait()

If GC blocks until finalizers are all done, should the done channel be buffered?

@aclements

This comment has been minimized.

Member

aclements commented Dec 1, 2017

Finalizers are still run asynchronously. runtime.GC doesn't block for them.

I will note that the failure indicates that the finalizer did start, it's just not done yet. Maybe we should add some debugging that prints how long the finalizer has been running for and how much progress it's made so we can tell whether it's just taking a long time or if it started really late.

Here are the recent failures:

2017-09-22T20:07:52-f2ab41b/linux-386-clang
2017-09-22T20:08:03-eb32dbc/linux-386-clang
2017-09-22T20:08:16-0c01487/linux-386-clang
2017-09-27T13:48:21-90fdc45/freebsd-arm-paulzhol
2017-10-06T06:49:40-07e36af/linux-mips64
2017-10-10T01:02:27-bb0bfd0/linux-mips
2017-11-21T19:13:52-2951f90/freebsd-arm-paulzhol
2017-11-23T16:45:24-b72678f/solaris-amd64-oraclerel
2017-12-01T01:16:45-7e394a2/netbsd-amd64-8branch
2017-12-01T07:09:54-f22cf71/netbsd-amd64-8branch

Are these builders just slow/overloaded?

@gopherbot

This comment has been minimized.

gopherbot commented Dec 1, 2017

Change https://golang.org/cl/81556 mentions this issue: runtime: even more TestStackGrowth timeout debugging

@bradfitz

This comment has been minimized.

Member

bradfitz commented Dec 1, 2017

@aclements, the arm and mips builders might be slow, but shouldn't be overloaded. No clue about solaris. But netbsd-amd64 is a fresh isolated VM per build, and linux-386-clang is a fresh kubernetes pod per build. It shouldn't be oversubscribed or anything.

gopherbot pushed a commit that referenced this issue Dec 1, 2017

runtime: even more TestStackGrowth timeout debugging
This adds logging for the expected duration of a growStack, plus
progress information on the growStack that timed out.

Updates #19381.

Change-Id: Ic358f8350f499ff22dd213b658aece7d1aa62675
Reviewed-on: https://go-review.googlesource.com/81556
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>

@ianlancetaylor ianlancetaylor modified the milestones: Go1.10, Go1.11 Jan 3, 2018

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Jul 9, 2018

Spot checking a few I see things like

stack_test.go:135: finalizer started 21.053487s ago and finished 210 iterations
    stack_test.go:137: first growStack took 25.316309s
    stack_test.go:138: finalizer did not run
--- FAIL: TestStackGrowth (58.21s)
    stack_test.go:135: finalizer started 22.339328s ago and finished 242 iterations
    stack_test.go:137: first growStack took 14.320874s
    stack_test.go:138: finalizer did not run

The "first growStack" numbers suggest that 20 seconds is simply not enough given ordinary variation, at least not on an emulator.

But I don't know why the test no longer fails on ordinary systems. It seems to have stopped failing after CL 81556, but I don't know why.

@ianlancetaylor ianlancetaylor modified the milestones: Go1.11, Unplanned Jul 9, 2018

@gopherbot

This comment has been minimized.

gopherbot commented Jul 9, 2018

Change https://golang.org/cl/122587 mentions this issue: runtime: scale timeout in TestStackGrowth

gopherbot pushed a commit that referenced this issue Jul 9, 2018

runtime: scale timeout in TestStackGrowth
Updates #19381

Change-Id: I62b8b0cd7170941af77281eb3aada3802623ec27
Reviewed-on: https://go-review.googlesource.com/122587
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment