-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: failure in TestRaiseException
on windows-amd64-2012
#49681
Comments
One more this week. Looking like a regression of some sort.
|
Hmmm, this might have been flaky for a while. I didn't realize it then, as I thought it was related to the crashdump change and it "went away" after a sync, but it was probably a flakiness I saw. My guess is that the recent regression is caused by changed timing due to the scavenger changes, but that's just a guess. The flakiness should be fixed anyway, I'll put it on the list. |
I spent some time staring at this.
I also run Go runtime calls I suspect in this failure Alex. |
I was suspecting that something was spooky here |
I don't understand what you are saying.
I guess, your testing confirms that we do have new problem here. How do I reproduce your "flaky" testing? Thank you. Alex |
You're totally right and I'm clearly wrong. I think you didn't understand what I was saying because what I was saying made no sense :) I was specifically throwing EXCEPTION_ACCESS_VIOLATION in my test program, why my theory was potentially a little less out there back then...
I just used a small program calling RaiseException through cgo:
It became much less probable to succeed in dumping a stack trace during 1.18 development, but as you can see, it already needed a sleep to work to begin with, so it wasn't water-proof from the start. However, after trying to reproduce now, I think it's a red herring, as after @randall77 's https://go-review.googlesource.com/c/go/+/364174, the program above consistently works, even without the sleep. In fact, after that change, none of my sometimes flaky tests programs trying to create exceptions in different ways ever fail to stack dump. That makes no sense to me, looking at the change itself. And also, it's completely opposite to what shows up in the builders, it rather started happening in the builders after that change, or a few days after. So now I have no idea, not even a bad one.
Thanks! Patrik |
Alex |
Yea, probably not. I'll see if I can make it happen on a gomote with the testcase itself, to see if there's any way to reproduce. |
I managed to get some repros with some logging. Not conclusive in any way, but, as you mostly had concluded already, it does reach exceptionhandler (in signal_windows.go), which does return _EXCEPTION_CONTINUE_SEARCH, but then neither firstcontinuehandler nor lastcontinuehandler are called, as far as I can see they are not crashing, they are not called at all. |
I don't know if it's really relevant, but it appears the change that I mentioned above, 364174, also is makes this error much more probable. I've not been able to reproduce it once before that change, but by spamming the gomotes, I can reproduce within a couple of minutes after that change. I still have no clue why that change would affect exception handling, of course. It might just change the timing or something. |
It seems that in a successful run, |
Yes. I don't know if the trampoline functions in asm are called though. There is some code there beyond my comprehension in sys_windows_amd64:sigtramp (no surprise there :)), also using tls, which was very vaguely related to the change 364174. It does do some stuff to make sure it's running on g0 stack, I don't know if that could be where the flakiness comes from? |
I think sigtramp is also not called (for Yeah, I also cannot reproduce before CL 364174 or at tip just reverting CL 364174. Weird. That code runs very early at startup. I wonder how it could cause nondeterministic failure. It also seems to fail more likely if I set GOGC=1. I tried disabling async preemption but it doesn't seem to have any effect. |
Ah, interesting. I tried (somewhat arbitrarily) to re-register the exception handlers before returning _EXCEPTION_CONTINUE_SERACH from exceptionhandler, that did not affect what happened at all (not sure what I expected though). It almost seems like the return value seen by windows exception handling mechanism is not seeing _EXCEPTION_CONTINUE_SEARCH, but something else? |
It seems if I increase the |
Heh, there's something going on there, definitely. If I make that specific change, the cgo program that started emitting stack dumps after 364174 stopped dumping stack dumps again. The change on that line clearly affects exception handlers, but I have no idea in which way or why. Can it just be some arbitrary memory corruption going on? The stack being slightly different just makes one thing more probable than another, but in fact, something else is broken? I'm completely confused by the switching between ABIs in sigtramp, and I don't really get what ABI is used to call the Go code. Is it called using ABI0? What makes it not being called using ABIInternal? Wrappers? I might need an ABI for dummies lesson :) |
|
I added some print and it doesn't seem we clobbered anything. For both success run and failed run the word right below SP at entry of rt0_go is 0, and in both case we stored the same value there (this store https://cs.opensource.google/go/go/+/master:src/runtime/asm_amd64.s;l=155), and the first a few words above that are never changed and the same for both cases. So it is not that we clobbered something on that stack. Still weird. |
Thanks for checking and explaining the ABI and the wrapper! At least now I understand sigtramp, which seems to do what's expected. This is indeed a weird one... |
I also tried to change sigtramp to always return 0 (which is not always correct but is all what we need for this test) and it can still fail. So it is not the return value. |
Yea, I printed out AX at the end of sigtramp and it showed the same thing. So I can't see anyway we're returning the wrong value. Unfortunately... I don't know what's up here. One thing is that the ones that are not run are "ContinueHandlers" and the one that's run is an "ExceptionHandlers". They are just run in sequence according to the docs I could find, but there might be some criteria for not running the continuehandlers that we hit. I couldn't find anything and the docs are sparse. MS docs I found basically say exceptionhandlers run, then if the last one returns EXCEPTION_CONTINUE_SEARCH, continuehandlers are run. But maybe there's some other criteria, like the stack has to look in a certain way (which would be kind of strange as they allegedly get their own stack frame). Maybe someone with access to the source could shed some light? @jstarks - do you by any chance have time to look at if there's some criteria that will skip continuehandlers? Also I've had problems reproducing with the latest dev version lately, there's absolutely zero stuff that should affect this checked in lately as I can see. I will keep digging. |
I think that's reasonable, but if we go that route should we add a Better still, if they test only happened to work before is there something we can do to increase the failure rate on the other builders? (IMO a skipped test with a high failure rate is more valuable than a flaky one with a low failure rate.) |
Sounds reasonable.
I can't come up with a way to increase the probability either, unfortunately. Maybe someone else has some idea? |
Change https://golang.org/cl/378254 mentions this issue: |
@bufflig, @bcmills , yes, i agree this is not a new problem, I think it may be caused by the fact that the specific version of windows is not friendly enough to the go stack. So I think it is not stable to throw the exception of c in lastcontinuehandler, maybe we should throw the exception and print the stack immediately when we find the exception of c in the exceptionhandler, no longer rely on VCH, do you agree? func exceptionhandler(info *exceptionrecord, r *context, gp *g) int32 {
if !isgoexception(info, r) {
sp := unsafe.Pointer(uintptr(r.rsp))
for true {
vsp := uint64(*((*uintptr)(sp)))
if vsp < getStackBase() && vsp >= getStackLimit() {
r.rip = *((*uint64)(unsafe.Pointer(uintptr(sp) - 8)))
r.rsp = *((*uint64)(unsafe.Pointer(sp)))
}
delta := uintptr(sys.StackAlign)
sp = add(sp, delta)
if uint64(uintptr(sp)) > uint64(uintptr(add(unsafe.Pointer(r.sp()), 8*50))) ||
uint64(uintptr(sp)) >= getStackBase() {
sp = add(sp, -delta)
break
}
}
return _EXCEPTION_CONTINUE_SEARCH
}
...
|
This is in relation to #49681 Change-Id: I32ad8b506cf8fb0a94b15c3cc8b1eaf5af728c59 Reviewed-on: https://go-review.googlesource.com/c/go/+/378254 Run-TryBot: Patrik Nyblom <pnyb@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Bryan Mills <bcmills@google.com> Trust: Patrik Nyblom <pnyb@google.com>
@hanchaoqun Yea, I was wondering if dumping the stack trace directly when !isgoexception was possible. I don't really know the reason we're using both exceptionhandlers and continuehandlers, but I assume there's a reason. Anyone has any knowledge of the history behind that? @bcmills - I did add skipping the test case as flaky, should we remove the release blocker tag? |
Do you mean printing stack trace and exiting in exceptionhandler function? I don't see how that is possible. exceptionhandler receive ALL process exceptions. Some of these exceptions are raised and handled in external to Go code. For example, DLL raises some exception, and then DLL exception handling code handles this exception and stops passing that exception further up the exception chain. Currently exceptions like that are ignored by Go runtime. Do you propose we crash the Go process instead? I hope that explains well enough. Alex |
Ah, right - I see how my idea/thought would not work. Thanks! |
This is in relation to golang#49681 Change-Id: I32ad8b506cf8fb0a94b15c3cc8b1eaf5af728c59 Reviewed-on: https://go-review.googlesource.com/c/go/+/378254 Run-TryBot: Patrik Nyblom <pnyb@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Bryan Mills <bcmills@google.com> Trust: Patrik Nyblom <pnyb@google.com>
Change https://go.dev/cl/444616 mentions this issue: |
windows-amd-2012 builder seems to have some problems handling exception thrown in external C code which is affecting TestVectoredHandlerExceptionInNonGoThread. The issue is known and discussed in #49681. This Cl skips the offending test on windows-amd-2012. Change-Id: I7ca4353c9e531f0d75ac6a8dbd809acfa1f15bf9 Reviewed-on: https://go-review.googlesource.com/c/go/+/444616 Reviewed-by: Bryan Mills <bcmills@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Michael Pratt <mpratt@google.com> Run-TryBot: Quim Muntal <quimmuntal@gmail.com> TryBot-Result: Gopher Robot <gobot@golang.org>
windows-amd-2012 builder seems to have some problems handling exception thrown in external C code which is affecting TestVectoredHandlerExceptionInNonGoThread. The issue is known and discussed in golang#49681. This Cl skips the offending test on windows-amd-2012. Change-Id: I7ca4353c9e531f0d75ac6a8dbd809acfa1f15bf9 Reviewed-on: https://go-review.googlesource.com/c/go/+/444616 Reviewed-by: Bryan Mills <bcmills@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Michael Pratt <mpratt@google.com> Run-TryBot: Quim Muntal <quimmuntal@gmail.com> TryBot-Result: Gopher Robot <gobot@golang.org>
Change https://go.dev/cl/451294 mentions this issue: |
Modify skip rule for TestRaiseException to trigger on both the base builder (windows-amd64-2012) and the newcc canary builder (windows-amd64-2012-newcc). Updates #49681. Change-Id: I132f9ddd102666b68ad04cc661fdcc2cd841051a Reviewed-on: https://go-review.googlesource.com/c/go/+/451294 Auto-Submit: Than McIntosh <thanm@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Than McIntosh <thanm@google.com> Reviewed-by: Bryan Mills <bcmills@google.com>
Change https://go.dev/cl/452436 mentions this issue: |
…d64-2012-* Modify skip rule for TestVectoredHandlerExceptionInNonGoThread to trigger on both the base builder (windows-amd64-2012) and the newcc canary builder (windows-amd64-2012-newcc). Updates #49681. Change-Id: I58109fc2e861b943cb66be0feec348671be84ab3 Reviewed-on: https://go-review.googlesource.com/c/go/+/452436 Run-TryBot: Than McIntosh <thanm@google.com> Auto-Submit: Than McIntosh <thanm@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com>
Modify skip rule for TestRaiseException to trigger on both the base builder (windows-amd64-2012) and the newcc canary builder (windows-amd64-2012-newcc). Updates golang#49681. Change-Id: I132f9ddd102666b68ad04cc661fdcc2cd841051a Reviewed-on: https://go-review.googlesource.com/c/go/+/451294 Auto-Submit: Than McIntosh <thanm@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Than McIntosh <thanm@google.com> Reviewed-by: Bryan Mills <bcmills@google.com>
…d64-2012-* Modify skip rule for TestVectoredHandlerExceptionInNonGoThread to trigger on both the base builder (windows-amd64-2012) and the newcc canary builder (windows-amd64-2012-newcc). Updates golang#49681. Change-Id: I58109fc2e861b943cb66be0feec348671be84ab3 Reviewed-on: https://go-review.googlesource.com/c/go/+/452436 Run-TryBot: Than McIntosh <thanm@google.com> Auto-Submit: Than McIntosh <thanm@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com>
FYI, this looks like it is happening on the 1.19 release branch as well. Should we back-port 452436? |
Yes, absolutely, we won't be able to cut a release otherwise. |
@gopherbot please consider for 1.19 backport |
Backport issue(s) opened: #56983 (for 1.19). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases. |
Change https://go.dev/cl/454035 mentions this issue: |
OK, backport CL is ready to go. |
greplogs --dashboard -md -l -e 'FAIL: TestRaiseException'
2021-11-18T02:19:50-f1cc529/windows-amd64-2012
The test dates from 2014. I can find only one such faillure in the logs, and it is very recent — marking as release-blocker until we can determine whether this is a regression (CC @bufflig).
The text was updated successfully, but these errors were encountered: