-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: scheduler change causes Delve's function call injection to fail intermittently #61732
Comments
I'm familiar with the AFAICT, here's what the code guarantees:
If this sounds right to you, then there's a bug I'm not seeing. If you expect AFAICT, RE: the change you linked, it's possible with the |
We only observe the program after it calls INT 3, so it would be the
I do expect that and it's possible that this is the problem (although I'm also seeing different failures that I can't explain like this, so maybe there's multiple problems? weird). If making everything run on the same thread is complicated I'm ok with signalling the starting goroutine explicitly in one of the ways I was describing in https://go.dev/cl/229299. |
Thanks for confirming.
Nah, I don't think it'd be too complicated. I think this just needs to prevent preemption until we get to riiiight before the dispatch. The dispatch call is nosplit, so we just need to get there without calling into the scheduler. I'll send a patch to try out shortly. |
Oh, also, is this relatively easy to reproduce for you? Is it as simple as just running that Delve test in a loop a bunch of times? |
Here's an attempt to close off the possible preemption points: https://go.dev/cl/515637 |
Yes. Just
I'll test it tomorrow morning. |
Change https://go.dev/cl/515637 mentions this issue: |
I can reproduce, trying out the change... |
I think it might fix it? I'm having a hard time telling because I'm also seeing a failure involving EDIT: I can make that |
Oh wow, I completely missed that. My apologies. I'll take another look. Thanks. |
@aarzilli I updated the CL to |
@mknyszek I haven't tried patchset 3 yet but if it's equivalent to patchset 2 with |
Patchset 3 just changes the commit message, so yay I think that means we have a fix. :) |
Right now debuggers like Delve rely on the new goroutine created to run a debugcall function to run on the same thread it started on, up until it hits itself with a SIGINT as part of the debugcall protocol. That's all well and good, except debugCallWrap1 isn't particularly careful about not growing the stack. For example, if the new goroutine happens to have a stale preempt flag, then it's possible a stack growth will cause a roundtrip into the scheduler, possibly causing the goroutine to switch to another thread. Previous attempts to just be more careful around debugCallWrap1 were helpful, but insufficient. This change takes everything a step further and always locks the debug call goroutine and the new goroutine it creates to the OS thread. For #61732. Change-Id: I038f3a4df30072833e27e6a5a1ec01806a32891f Reviewed-on: https://go-review.googlesource.com/c/go/+/515637 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Alessandro Arzilli <alessandro.arzilli@gmail.com> Reviewed-by: Cherry Mui <cherryyz@google.com>
@aarzilli @cherrymui Do you think this is worth backporting? I think it meets the bar since it's low-risk (it just impacts a specific part of the debugging experience) and the change is fairly small. |
In the past the entire debugCallV1 implementation for arm64 was backported, IIRC. |
Ha, OK. Backport it is. @gopherbot Please open a backport issue for Go 1.21. This issue causes significant debugger friction due to an issue that was latent in earlier releases but only really started cropping up in Go 1.21 due to a scheduler change in that release. The fix is very low risk, since it only impacts the debugger function call protocol, which should have no impact on production applications. The fix is also fairly small and straightforward. |
Backport issue(s) opened: #62509 (for 1.21). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases. |
Change https://go.dev/cl/526576 mentions this issue: |
Right now debuggers like Delve rely on the new goroutine created to run a debugcall function to run on the same thread it started on, up until it hits itself with a SIGINT as part of the debugcall protocol. That's all well and good, except debugCallWrap1 isn't particularly careful about not growing the stack. For example, if the new goroutine happens to have a stale preempt flag, then it's possible a stack growth will cause a roundtrip into the scheduler, possibly causing the goroutine to switch to another thread. Previous attempts to just be more careful around debugCallWrap1 were helpful, but insufficient. This change takes everything a step further and always locks the debug call goroutine and the new goroutine it creates to the OS thread. For #61732. Fixes #62509. Change-Id: I038f3a4df30072833e27e6a5a1ec01806a32891f Reviewed-on: https://go-review.googlesource.com/c/go/+/515637 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Alessandro Arzilli <alessandro.arzilli@gmail.com> Reviewed-by: Cherry Mui <cherryyz@google.com> (cherry picked from commit d9a4b24) Reviewed-on: https://go-review.googlesource.com/c/go/+/526576
Right now debuggers like Delve rely on the new goroutine created to run a debugcall function to run on the same thread it started on, up until it hits itself with a SIGINT as part of the debugcall protocol. That's all well and good, except debugCallWrap1 isn't particularly careful about not growing the stack. For example, if the new goroutine happens to have a stale preempt flag, then it's possible a stack growth will cause a roundtrip into the scheduler, possibly causing the goroutine to switch to another thread. Previous attempts to just be more careful around debugCallWrap1 were helpful, but insufficient. This change takes everything a step further and always locks the debug call goroutine and the new goroutine it creates to the OS thread. For golang#61732. Fixes golang#62509. Change-Id: I038f3a4df30072833e27e6a5a1ec01806a32891f Reviewed-on: https://go-review.googlesource.com/c/go/+/515637 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Alessandro Arzilli <alessandro.arzilli@gmail.com> Reviewed-by: Cherry Mui <cherryyz@google.com> (cherry picked from commit d9a4b24) Reviewed-on: https://go-review.googlesource.com/c/go/+/526576
Starting with change ad94306 (https://go.dev/cl/501976) I see frequent (but intermittent) failures of TestCallFunction of github.com/go-delve/delve/pkg/proc. The failure has a few different presentations, a frequent one is this one:
Produced passing
-v -log=fncall
to TestCallFunction.Delve keeps a map from goroutine IDs to call injection states, wherever it sees a SIGINT in one of the debugCall functions (runtime.debugCallV2, debugCall32, &c) it reads the goroutine id and searches the map for the corresponding call injection state. If none is found it looks for a call injection state that has just been started on the same thread.
The assumption here is that when the runtime spawns a new goroutine to handle the call injection it will use the same thread to run it, initially. This was agreed upon in https://go-review.googlesource.com/c/go/+/229299:
The logs above suggest that this invariant is being violated, the injection starts on thread 550064 goroutine 1 but the first stop is on thread 550069 goroutine 59.
I'm not 100% sure that the bug was introduced by this commit, I think I saw a similar failure in CI before but I could never reproduce it locally (or see it frequently enough in CI). It could be that the bug already existed and this change made it orders of magnitude more likely.
cc @prattmic @mknyszek @aclements
The text was updated successfully, but these errors were encountered: