Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: NetBSD Sleep hang #22981

Closed
bradfitz opened this issue Dec 4, 2017 · 7 comments

Comments

Projects
None yet
4 participants
@bradfitz
Copy link
Member

commented Dec 4, 2017

Here's a recent failure from netbsd/amd64 that is after @zoulasc's 66fcf45 (which itself had passed at https://build.golang.org/):

https://build.golang.org/log/a692b2772f336f248b150c4d8df6b77f1f5a3d90

panic: test timed out after 3m0s

goroutine 43 [running]:
testing.(*M).startAlarm.func1()
	/tmp/workdir/go/src/testing/testing.go:1195 +0xe8
created by time.goFunc
	/tmp/workdir/go/src/time/sleep.go:172 +0x44

goroutine 1 [chan receive, 3 minutes]:
testing.(*T).Run(0xc4200de1e0, 0x5e59ea, 0x15, 0x5ef7d0, 0x473a01)
	/tmp/workdir/go/src/testing/testing.go:822 +0x2eb
testing.runTests.func1(0xc4200de000)
	/tmp/workdir/go/src/testing/testing.go:1048 +0x64
testing.tRunner(0xc4200de000, 0xc420041df8)
	/tmp/workdir/go/src/testing/testing.go:774 +0xc4
testing.runTests(0xc4200903a0, 0x6e8aa0, 0x15, 0x15, 0x40f9c9)
	/tmp/workdir/go/src/testing/testing.go:1046 +0x2bb
testing.(*M).Run(0xc4200d5f80, 0x0)
	/tmp/workdir/go/src/testing/testing.go:963 +0x15c
main.main()
	_testmain.go:82 +0x151

goroutine 198 [chan receive, 3 minutes]:
testing.(*T).Parallel(0xc4200de0f0)
	/tmp/workdir/go/src/testing/testing.go:694 +0x1e3
runtime/pprof.TestEmptyCallStack(0xc4200de0f0)
	/tmp/workdir/go/src/runtime/pprof/pprof_test.go:828 +0x40
testing.tRunner(0xc4200de0f0, 0x5ef868)
	/tmp/workdir/go/src/testing/testing.go:774 +0xc4
created by testing.(*T).Run
	/tmp/workdir/go/src/testing/testing.go:821 +0x2ca

goroutine 40 [sleep, 3 minutes]:
time.Sleep(0x2faf080)
	/tmp/workdir/go/src/runtime/time.go:101 +0x170
runtime/pprof.TestAtomicLoadStore64(0xc4200de1e0)
	/tmp/workdir/go/src/runtime/pprof/pprof_test.go:900 +0x21a
testing.tRunner(0xc4200de1e0, 0x5ef7d0)
	/tmp/workdir/go/src/testing/testing.go:774 +0xc4
created by testing.(*T).Run
	/tmp/workdir/go/src/testing/testing.go:821 +0x2ca

goroutine 41 [sleep, 3 minutes]:
time.Sleep(0x5f5e100)
	/tmp/workdir/go/src/runtime/time.go:101 +0x170
runtime/pprof.profileWriter(0x608060, 0xc42000c0a0)
	/tmp/workdir/go/src/runtime/pprof/pprof.go:748 +0x6d
created by runtime/pprof.StartCPUProfile
	/tmp/workdir/go/src/runtime/pprof/pprof.go:733 +0x10e

goroutine 42 [runnable]:
runtime.Gosched()
	/tmp/workdir/go/src/runtime/proc.go:263 +0x1e
runtime/pprof.TestAtomicLoadStore64.func1(0xc4202036f0, 0xc4200bc000)
	/tmp/workdir/go/src/runtime/pprof/pprof_test.go:896 +0x24
created by runtime/pprof.TestAtomicLoadStore64
	/tmp/workdir/go/src/runtime/pprof/pprof_test.go:894 +0x20d
FAIL	runtime/pprof	207.526s

Note that goroutine 40 has been in a sleep for 3 minutes, but that sleep is supposed to be only 50 milliseconds:

// Check that there is no deadlock when the program receives SIGPROF while in                                                                             
// 64bit atomics' critical section. Used to happen on mips{,le}. See #20146.                                                                              
func TestAtomicLoadStore64(t *testing.T) {
        f, err := ioutil.TempFile("", "profatomic")
        if err != nil {
                t.Fatalf("TempFile: %v", err)
        }
        defer os.Remove(f.Name())
        defer f.Close()
        
        if err := StartCPUProfile(f); err != nil {
                t.Fatal(err)
        }
        defer StopCPUProfile()

        var flag uint64
        done := make(chan bool, 1)
        
        go func() {
                for atomic.LoadUint64(&flag) == 0 {
                        runtime.Gosched()
                }
                done <- true
        }()
        time.Sleep(50 * time.Millisecond)
        atomic.StoreUint64(&flag, 1)
        <-done
}

The GCE VM that runs each build & its tests has 4 cores, FWIW.

I can reproduce with gomote. The test will succeed for 3 or 7 or 20 times before hanging.

bradfitz@gdev:~/go/src/runtime$ gomote run user-bradfitz-netbsd-amd64-8branch-0 go/bin/go test -short -v -run=TestAtomicLoadStore64 -count=5000 runtime/pprof
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.28s)
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.33s)
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.28s)
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.29s)
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.28s)
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.30s)
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.30s)
=== RUN   TestAtomicLoadStore64
^C
bradfitz@gdev:~/go/src/runtime$ gomote run user-bradfitz-netbsd-amd64-8branch-0 go/bin/go test -short -v -run=TestAtomicLoadStore64 -count=5000 runtime/pprof
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.28s)
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.30s)
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.28s)
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.27s)
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.30s)
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.28s)
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.27s)
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.27s)
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.29s)
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.27s)
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.33s)
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.28s)
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.28s)
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.28s)
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.28s)
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.27s)
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.30s)
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.30s)
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.28s)
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.28s)
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.28s)
=== RUN   TestAtomicLoadStore64
--- PASS: TestAtomicLoadStore64 (0.34s)
=== RUN   TestAtomicLoadStore64
^C

/cc @bsiegert

@bradfitz bradfitz added this to the Go1.10 milestone Dec 4, 2017

@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Dec 4, 2017

@aclements, any clues on places to look?

@krytarowski

This comment has been minimized.

Copy link
Contributor

commented Dec 4, 2017

This looks like fixed in a patch committed to HEAD (NetBSD-current) and pending on https://releng.netbsd.org/cgi-bin/req-8.cgi as ticket 417.

@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Dec 4, 2017

@krytarowski, I saw that patch but it didn't seem related to this issue. At least, the stack traces above don't obviously imply it. Which system call are we getting stuck in above?

Also, how often does the NetBSD releng team approve pullups? I've been waiting for 413 for 5 days now.

/cc @zoulasc

@krytarowski

This comment has been minimized.

Copy link
Contributor

commented Dec 4, 2017

We don't have deadlines for pullups, as currently every person in releng is a volunteer in his spare time. The 413 cannot be processed by Martin, so we are waiting for Soren. Today Soren processed OpenSSH update.. so next should be virtio.

@krytarowski

This comment has been minimized.

Copy link
Contributor

commented Dec 4, 2017

Module Name:	src
Committed By:	snj
Date:		Mon Dec  4 19:47:05 UTC 2017

Modified Files:
	src/sys/arch/i386/conf [netbsd-8]: GENERIC

Log Message:
Pull up following revision(s) (requested by martin in ticket #413):
	sys/arch/i386/conf/GENERIC: revision 1.1171
PR port-i386/52774: vioscsi missing in i386 kernel


To generate a diff of this commit:
cvs rdiff -u -r1.1156.2.6 -r1.1156.2.7 src/sys/arch/i386/conf/GENERIC
@zoulasc

This comment has been minimized.

Copy link
Contributor

commented Dec 4, 2017

This is new and different :-) Here's what's going on. I believe that the test expects to timeout at some point, but the profiling timer keeps interrupting _lwp_park and thus restarting it from scratch... Only to be interrupted again by the profiling timer before it times out... Here's a fix with an explanation...
http://www.netbsd.org/~christos/lwp-park-eintr.diff

@gopherbot

This comment has been minimized.

Copy link

commented Dec 4, 2017

Change https://golang.org/cl/81855 mentions this issue: runtime: fix NetBSD CPU spin in lwp_park when CPU profiling is active

@gopherbot gopherbot closed this in 2ff2eab Dec 5, 2017

@golang golang locked and limited conversation to collaborators Dec 5, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.