Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: TestStackWrapperStackPanic failure on windows-amd64-2016 #35773

Closed
bcmills opened this issue Nov 22, 2019 · 17 comments
Closed

runtime: TestStackWrapperStackPanic failure on windows-amd64-2016 #35773

bcmills opened this issue Nov 22, 2019 · 17 comments

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Nov 22, 2019

https://storage.googleapis.com/go-build-log/07ca05df/windows-amd64-2016_8338c7a2.log

##### GOMAXPROCS=2 runtime -cpu=1,2,4 -quick
--- FAIL: TestStackWrapperStackPanic (0.00s)
    --- FAIL: TestStackWrapperStackPanic/sigpanic (0.00s)
        --- FAIL: TestStackWrapperStackPanic/sigpanic/CallersFrames (0.00s)
            stack_test.go:676: runtime.Callers
            stack_test.go:676: runtime_test.testStackWrapperPanic.func1.1
            stack_test.go:676: runtime.gopanic
            stack_test.go:676: runtime.panicmem
            stack_test.go:676: runtime.sigpanic
            stack_test.go:676: runtime.asyncPreempt
            stack_test.go:676: runtime_test.TestStackWrapperStackPanic.func1.1
            stack_test.go:676: runtime_test.testStackWrapperPanic.func1
            stack_test.go:676: testing.tRunner
            stack_test.go:676: runtime.goexit
            stack_test.go:684: panicking wrapper runtime_test.I.M missing from stack trace
FAIL
FAIL	runtime	27.302s

CC @aclements @cherrymui @mknyszek @danscales

@bcmills bcmills added this to the Go1.14 milestone Nov 22, 2019
@bcmills
Copy link
Member Author

@bcmills bcmills commented Nov 22, 2019

Marking as release-blocker since runtime.asyncPreempt is involved. (We can re-evaluate that once we better understand the root cause.)

@bcmills
Copy link
Member Author

@bcmills bcmills commented Nov 22, 2019

2019-11-21T14:55:12-8bbfc51/windows-amd64-2008
is the only occurrence in the dashboard.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Dec 5, 2019

The failure is always when running with -test.cpu=1,2,4.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Dec 5, 2019

The failure is always TestStackWrapperStackPanic/sigpanic.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Dec 5, 2019

The test makes a call to I.M(nil) which is expected to call nil and therefore panic. It looks like if a preemption signal occurs as we make the call that a call to asyncPreempt is inserted and that call immediately panics. The stack trace is then not what we expect. I'm not sure why.

This doesn't seem all that crucial, though it does mean that in some failure cases we will print the wrong stack trace.

CC @aclements @cherrymui

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Dec 6, 2019

If we preempt the thread right at the point of the exception, we may inject a asyncPreempt call, and then the exception handler runs (before asyncPreempt runs), which injects a sigpanic call. Then we could see sigpanic right on top of asyncPreempt. In traceback, normally we would not elide wrappers if it is the immediate caller of sigpanic, but we will if the immediate caller is asyncPreempt. If this is the case, we could see the stack trace like that.

I'm not familiar about the ordering of Windows' SuspendThread/ResumeThread and exception handling, so not sure if this could actually happen.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Dec 20, 2019

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jan 2, 2020

This hasn't happened on the builders since 2019-12-09. The findflakes program reports only a 0.14% chance that the failure is still happening.

That said I have no idea what would have changed to fix this.

@aclements
Copy link
Member

@aclements aclements commented Jan 8, 2020

I can easily reproduce this at current master (817afe8) and at f6774bc with the following patch and script:

--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -4469,6 +4469,7 @@ func sysmon() {
 		if delay > 10*1000 { // up to 10ms
 			delay = 10 * 1000
 		}
+		delay = 10
 		usleep(delay)
 		now := nanotime()
 		next := timeSleepUntil()
@@ -4563,7 +4564,8 @@ type sysmontick struct {
 
 // forcePreemptNS is the time slice given to a G before it is
 // preempted.
-const forcePreemptNS = 10 * 1000 * 1000 // 10ms
+//const forcePreemptNS = 10 * 1000 * 1000 // 10ms
+const forcePreemptNS = 10 * 1000 // 10us
 
 func retake(now int64) uint32 {
 	n := 0
cd $GOROOT/src/runtime
VM=$(gomote create windows-amd64-2016)
GOOS=windows go test -c && \
gomote put $VM runtime.test.exe && \
gomote run $VM ./runtime.test.exe -test.run TestStackWrapperStackPanic/sigpanic/CallersFrames -test.count 1000000

Running with ./runtime.test -test.run TestStackWrapperStackPanic/sigpanic/CallersFrames -test.v, we can see the traceback when the test passes:

runtime.Callers
runtime_test.testStackWrapperPanic.func1.1
runtime.gopanic
runtime.panicmem
runtime.sigpanic
runtime_test.I.M
runtime_test.TestStackWrapperStackPanic.func1.1
runtime_test.testStackWrapperPanic.func1
testing.tRunner
runtime.goexit

(normally it stops when it reaches runtime_test.I.M, but I tweaked it.)

What I don't fully understand is how we go from asyncPreempt to sigpanic in the bad traces. This is a nil panic, so the exception is synchronous. We should be injecting the sigpanic call at the exact PC/SP where the exception happens, so I don't see how asyncPreempt slips in there.

@aclements
Copy link
Member

@aclements aclements commented Jan 8, 2020

If we tweak the test to print the PC in the stack trace, we see that the runtime.asyncPreempt frame is at the very first instruction of asyncPreempt. This suggests that we injected that call and then immediately injected the sigpanic. Maybe it's possible in Windows for SuspendThread to stop a thread between when it causes an exception and when the VEH runs? If that's the case, preemptM would see the thread at the faulting instruction, tweak the frame to inject the asyncPreempt call, then resume the thread, which would run the VEH and tweak the frame again to inject the sigpanic call.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Jan 8, 2020

Maybe it's possible in Windows for SuspendThread to stop a thread between when it causes an exception and when the VEH runs?

This is also my previous guess. One possibility would be printing the wrapper function if the previous frame is asyncPreempt. This will make the wrapper included in the stack trace. But the stack trace will look like

sigpanic
asyncPreempt
wrapper
...

which looks like asyncPreempt itself panics. I think this can happen with nil pointer dereference panics in general.

Another possibility is that when injecting a sigpanic, if the PC is the entry PC of asyncPreempt, just overwrite the PC instead of pushing another frame. This will make the stack trace correct. Losing preemption is probably not a problem, as sigpanic enters the runtime, which will probably preempt it synchronously at some point.

@gopherbot
Copy link

@gopherbot gopherbot commented Jan 8, 2020

Change https://golang.org/cl/213879 mentions this issue: runtime: overwrite asyncPreempt PC when injecting sigpanic on Windows

@gopherbot
Copy link

@gopherbot gopherbot commented Jan 8, 2020

Change https://golang.org/cl/213957 mentions this issue: runtime: print wrapper function in stack trace if following asyncPreempt

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Jan 8, 2020

I implemented both approaches. Using @aclements's repro, both fix this.

Personally, I prefer option 2.

@aclements
Copy link
Member

@aclements aclements commented Jan 9, 2020

Thanks!

I assume by option 2 you mean overwriting the asyncPreempt frame. I certainly like that option better, as it's much more localized to the cause of the issue. Changing the traceback logic, on the other hand, is several dominoes away from the actual cause.

@gopherbot gopherbot closed this in 17e9732 Jan 9, 2020
@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Jan 9, 2020

Yes, by option 2 I meant overwriting the asyncPreempt frame. Thanks.

@golang golang locked and limited conversation to collaborators Jan 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.