Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/pprof: frequent test timeouts on builders since mid-August #41014

Closed
bcmills opened this issue Aug 24, 2020 · 16 comments
Closed

runtime/pprof: frequent test timeouts on builders since mid-August #41014

bcmills opened this issue Aug 24, 2020 · 16 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Milestone

Comments

@bcmills
Copy link
Member

bcmills commented Aug 24, 2020

2020-08-22T03:46:17-13e41bc/android-386-emu
2020-08-19T03:49:06-bd519d0/android-386-emu

Also https://storage.googleapis.com/go-build-log/1be3e58d/openbsd-amd64-64_e5acea9b.log.

CC @hyangah

@bcmills bcmills added Testing An issue that has been verified to require only test changes, not just a test failure. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Aug 24, 2020
@bcmills bcmills added this to the Backlog milestone Aug 24, 2020
@bcmills bcmills changed the title runtime/pprof: test timing out on slow builders runtime/pprof: test occasionally timing out on slow builders Aug 24, 2020
@bcmills
Copy link
Member Author

bcmills commented Aug 24, 2020

2020-08-24T20:22:16-5c76382/linux-386-sid

Given the rash of failures in August, this is looking like a 1.16 regression, so marking as release-blocker.

@bcmills bcmills modified the milestones: Backlog, Go1.16 Aug 24, 2020
@bcmills bcmills removed the Testing An issue that has been verified to require only test changes, not just a test failure. label Aug 24, 2020
@bcmills bcmills changed the title runtime/pprof: test occasionally timing out on slow builders runtime/pprof: frequent test timeouts on builders since mid-August Aug 24, 2020
@bcmills
Copy link
Member Author

bcmills commented Sep 1, 2020

https://storage.googleapis.com/go-build-log/8921b4e8/android-amd64-emu_69e10eb0.log (a TryBot run)

@golang/osp-team, could we bisect and revert the CL that introduced this apparent regression?

@bcmills
Copy link
Member Author

bcmills commented Sep 9, 2020

@bcmills
Copy link
Member Author

bcmills commented Sep 9, 2020

Output from @aclements' findflakes tool (github.com/aclements/go-misc/findflakes):

~$ findflakes --grep='(?ms)Test killed.*: ran too long.*FAIL\s+(runtime|net/http)/pprof'
First observed 797124f 17 Aug 19:45 2020 (209 commits ago)
Last observed  bdb480f 09 Sep 07:50 2020 (3 commits ago)
60% chance failure is still happening
16% failure probability (33 of 207 commits)
Likely culprits:
   16% 797124f cmd/go/internal/test: keep looking for go command flags after ambiguous test flag
   13% 1b86bdb cmd/test2json: do not emit a final Action if the result is not known
   11% f30044a cmd/go/internal: remove some users of par.Work
    9% 2ac4bf3 cmd/go: add span for modload.LoadBuildList
    8% ebccba7 cmd/go: process -debug-trace flag for cmd/test and cmd/vet
    7% 38fea3a cmd/go: add tracing instrumentation to load.TestPackagesFor
    6% 15b98e5 cmd/go: mark trace flows between actions
    5% a26d687 cmd/go: propagate context into Action.Func calls
    4% 023d497 cmd/go: add trace events for each action
    3% 49003da cmd/go/internal/trace: add function to distinguish goroutines
Past failures:
  796786c 14 May 16:20 2020 (isolated failure)
  80fb18e 05 Oct 22:17 2019 to 20bf6a4 26 Nov 12:03 2019
    0.53% failure probability (5 of 750 commits)

None of those “likely culprits” seem all that likely to me given the failure mode. (Perhaps it's more subtle than I expect, or perhaps the actual culprit didn't make the list.)

@hyangah
Copy link
Contributor

hyangah commented Sep 9, 2020

Interestingly, the stack trace looks so broken. :-(
from https://storage.googleapis.com/go-build-log/39e65426/linux-amd64-race_d8c01c0c.log

SIGQUIT: quit
PC=0x4a33e1 m=0 sigcode=0

goroutine 0 [idle]:
runtime.futex(0x8048a8, 0x80, 0x0, 0x0, 0x7ffc00000000, 0x4a2fbc, 0x22b, 0x36bca9af, 0x7ffcb85e5718, 0x43dd3f, ...)
	/workdir/go/src/runtime/sys_linux_amd64.s:585 +0x21
runtime.futexsleep(0x8048a8, 0x0, 0xffffffffffffffff)
	/workdir/go/src/runtime/os_linux.go:44 +0x46
runtime.notesleep(0x8048a8)
	/workdir/go/src/runtime/lock_futex.go:159 +0x9f
runtime.stopm()
	/workdir/go/src/runtime/proc.go:1910 +0xc5
runtime.findrunnable(0xc000023000, 0x0)
	/workdir/go/src/runtime/proc.go:2471 +0xa7f
runtime.schedule()
	/workdir/go/src/runtime/proc.go:2674 +0x2d7
runtime.park_m(0xc000582300)
	/workdir/go/src/runtime/proc.go:2842 +0x9d
runtime.mcall(0x0)
	/workdir/go/src/runtime/asm_amd64.s:318 +0x5b

goroutine 1 [chan receive, 3 minutes]:
testing.(*T).Run(0xc000001680, 0x6acb97, 0x16, 0x6b8820, 0x1)
	/workdir/go/src/testing/testing.go:1203 +0x5f4
testing.runTests.func1(0xc000001680)
	/workdir/go/src/testing/testing.go:1475 +0xa7
testing.tRunner(0xc000001680, 0xc000073ce0)
	/workdir/go/src/testing/testing.go:1157 +0x203
testing.runTests(0xc00000e3e0, 0x800a00, 0x1d, 0x1d, 0xbfce2a3db123de41, 0x29e90a1585, 0x804280, 0xc00006c148)
	/workdir/go/src/testing/testing.go:1473 +0x625
testing.(*M).Run(0xc000124600, 0x0)
	/workdir/go/src/testing/testing.go:1381 +0x3b4
main.main()
	_testmain.go:99 +0x237

goroutine 13 [running]:
	goroutine running on other thread; stack unavailable
created by testing.(*T).Run
	/workdir/go/src/testing/testing.go:1202 +0x5bc

goroutine 14 [running]:
	goroutine running on other thread; stack unavailable
created by runtime/pprof.StartCPUProfile
	/workdir/go/src/runtime/pprof/pprof.go:784 +0x18b

rax    0xca
rbx    0x804760
rcx    0x4a33e3
rdx    0x0
rdi    0x8048a8
rsi    0x80
rbp    0x7ffcb85e56e0
rsp    0x7ffcb85e5698
r8     0x0
r9     0x0
r10    0x0
r11    0x286
r12    0x3
r13    0x8045e0
r14    0xc000582300
r15    0x0
rip    0x4a33e1
rflags 0x286
cs     0x33
fs     0x0
gs     0x0
*** Test killed with quit: ran too long (4m0s).

@ianlancetaylor
Copy link
Contributor

These two recent cases have an interesting stack trace:

2020-09-02T23:39:37-0e48c67/linux-arm64-packet
2020-09-01T21:23:24-6fc329b/linux-amd64-clang

goroutine 35140 [running]:
runtime.futex(0x6e5740, 0x200000080, 0x0, 0x0, 0xc000000000, 0x50, 0x48, 0x5b5ea0, 0xc0000a0eb0, 0x40bb8d, ...)
	/workdir/go/src/runtime/sys_linux_amd64.s:585 +0x21 fp=0xc0000a0e28 sp=0xc0000a0e20 pc=0x46e4e1
runtime.futexsleep(0x6e5740, 0x2, 0xffffffffffffffff)
	/workdir/go/src/runtime/os_linux.go:44 +0x46 fp=0xc0000a0e78 sp=0xc0000a0e28 pc=0x432506
runtime.lock2(0x6e5740)
	/workdir/go/src/runtime/lock_futex.go:106 +0x14d fp=0xc0000a0ec0 sp=0xc0000a0e78 pc=0x40bb8d
runtime.lockWithRank(...)
	/workdir/go/src/runtime/lockrank_off.go:22
runtime.lock(...)
	/workdir/go/src/runtime/lock_futex.go:47
runtime/pprof.readProfile(0x5f5e100, 0xc000308db0, 0xc000354b00, 0x1418, 0xc0005877a8, 0x4061cb, 0xc00034c240)
	/workdir/go/src/runtime/cpuprof.go:205 +0x45 fp=0xc0000a0f58 sp=0xc0000a0ec0 pc=0x466a65
runtime/pprof.profileWriter(0x5ff7b8, 0xc000308db0)
	/workdir/go/src/runtime/pprof/pprof.go:800 +0x6e fp=0xc0000a0fd0 sp=0xc0000a0f58 pc=0x55a5ae

They are trying to acquire cpuprof.lock. No other goroutine shows up as holding that lock. The fact that the code is calling runtime.futex tells us that the lock is either held or is in contention. Neither should be the case. The only code that calls readProfile is runtime/pprof.profileWriter, and it only calls it once every 100 milliseconds. There can only be one instance of profileWriter running at a time. Puzzling.

@ianlancetaylor
Copy link
Contributor

OK, how about this.

runtime/pprof.StopCPUProfile calls runtime.SetCPUProfileRate(0) which acquires cpuprof.lock and calls setcpuprofilerate(0) which calls setThreadCPUProfiler(0) which does nothing interesting and then grabs prof.signalLock and then calls setProcessCPUProfiler(0)which calls setitimer to turn off profiling signals. BUT what if we get a SIGPROF signal in between grabbing prof.signalLock and calling setProcessCPUProfiler?

In that case the signal handler will call sigprof which will see that prof.hz is not (yet) 0, and will call cpuprof.add, which will grab prof.signalLock. BUT prof.signalLock is currently held. So the grab will loop trying to get it. So the signal handler will never return. So setcpuprofilerate will never return and StopCPUProfile will never return and the test will hang.

There is a comment in setcpuprofilerate that says

	// Stop profiler on this thread so that it is safe to lock prof.
	// if a profiling signal came in while we had prof locked,
	// it would deadlock.

but I don't see anything there that actually stops the profiler on this thread. So I think we have the very deadlock that that comment warns about. Perhaps it used to work but somehow changed (in which case it is probably my fault).

@ianlancetaylor
Copy link
Contributor

Yes, this was my fault. Caused by https://golang.org/cl/240003.

@gopherbot
Copy link

Change https://golang.org/cl/253758 mentions this issue: runtime: ignore SIGPROF if profiling disable for thread

@bcmills
Copy link
Member Author

bcmills commented Sep 9, 2020

Thanks for tracking it down!

@golang golang locked and limited conversation to collaborators Sep 9, 2021
@erifan
Copy link
Contributor

erifan commented Apr 11, 2022

So the grab will loop trying to get it. So the signal handler will never return. So setcpuprofilerate will never return and StopCPUProfile will never return and the test will hang.

@ianlancetaylor Sorry for replying on an issue that has been closed for a long time. Here I don't understand why setcpuprofilerate will never return, it only calls setProcessCPUProfiler(0) and then setitimer(_ITIMER_PROF, &itimerval{}, nil) before releasing prof.signalLock, will these two functions block?

@ianlancetaylor
Copy link
Contributor

@erifan The problem would occur if a SIGPROF signal was received while the lock was held. That could lead to a deadlock as the signal handler tried to acquire the lock. The lock would never be released because the thread holding the lock was interrupted by the signal.

@erifan
Copy link
Contributor

erifan commented Apr 12, 2022

The lock would never be released because the thread holding the lock was interrupted by the signal.

Oh I got it, thanks @ianlancetaylor

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Projects
None yet
Development

No branches or pull requests

5 participants