Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: scheduler change causes Delve's function call injection to fail intermittently #61732

Closed
aarzilli opened this issue Aug 3, 2023 · 20 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Milestone

Comments

@aarzilli
Copy link
Contributor

aarzilli commented Aug 3, 2023

Starting with change ad94306 (https://go.dev/cl/501976) I see frequent (but intermittent) failures of TestCallFunction of github.com/go-delve/delve/pkg/proc. The failure has a few different presentations, a frequent one is this one:

    variables_test.go:1353: call "getAStruct(3).VRcvr(1)"
2023-08-03T10:27:16+02:00 info kind=fncall,layer=proc function call initiated <nil> frame size 65535 goroutine 1 (thread 550064)
2023-08-03T10:27:16+02:00 info kind=fncall,layer=proc findCallInjectionStateForThread thread=550069 goroutine=59
    variables_test.go:1367: call "getAStruct(3).VRcvr(1)": error "could not recover call injection state for goroutine 59 (thread 550069)"
--- FAIL: TestCallFunction (0.53s)

Produced passing -v -log=fncall to TestCallFunction.

Delve keeps a map from goroutine IDs to call injection states, wherever it sees a SIGINT in one of the debugCall functions (runtime.debugCallV2, debugCall32, &c) it reads the goroutine id and searches the map for the corresponding call injection state. If none is found it looks for a call injection state that has just been started on the same thread.

The assumption here is that when the runtime spawns a new goroutine to handle the call injection it will use the same thread to run it, initially. This was agreed upon in https://go-review.googlesource.com/c/go/+/229299:

More generally, though, it's guaranteed to keep the whole call injection protocol on one OS thread and it won't let anything else happen on that thread during the protocol, so if you know what G you're on when you start the injection and associate that with the OS thread, you can check which G it's on when it reaches debugCallNNNN and you know they're tied together.

The logs above suggest that this invariant is being violated, the injection starts on thread 550064 goroutine 1 but the first stop is on thread 550069 goroutine 59.

I'm not 100% sure that the bug was introduced by this commit, I think I saw a similar failure in CI before but I could never reproduce it locally (or see it frequently enough in CI). It could be that the bug already existed and this change made it orders of magnitude more likely.

cc @prattmic @mknyszek @aclements

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Aug 3, 2023
@dr2chase dr2chase added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 3, 2023
@dr2chase
Copy link
Contributor

dr2chase commented Aug 3, 2023

@mpratt @mknyszek

@mknyszek
Copy link
Contributor

mknyszek commented Aug 3, 2023

I'm familiar with the debugCallWrap path but not familiar with exactly the guarantees needed. What is the point until which the new goroutine needs to be running on the same thread?

AFAICT, here's what the code guarantees:

  • debugCallV2 through debugCallWrap is very careful to stay on the same thread.
  • Once debugCallWrap calls execute, starting the new goroutine to dispatch the function, all bets are off (mainly, debugCallWrap1 just isn't very careful). There could technically be a preemption point in between the goroutine starting to run and the debugCall32 or whatever being dispatched.

If this sounds right to you, then there's a bug I'm not seeing. If you expect debugCallWrap1 to be on the same M up until it calls debugCall32 (or whichever one) and hits the SIGINT, then that's probably where something is going wrong.

AFAICT, debugCallWrap1 could grow the stack, and if it happens to have a preemption flag set, it might call into the scheduler before dispatching the function. I'm not 100% sure how the preemption flag would be set, but we don't seem to clear it if a goroutine happens to die while the flag is set (which I think is harmless in general). If the g struct gets reused I suppose I could see the preemption flag still set? The stack bound is not going to be poisoned, though. It would have to try to grow the stack explicitly.

RE: the change you linked, it's possible with the wakep it's just more likely to end up on a different thread after a round-trip through the scheduler. In other words, the failure was rare before because most of the time it would just end up back on the same thread.

@aarzilli
Copy link
Contributor Author

aarzilli commented Aug 3, 2023

I'm familiar with the debugCallWrap path but not familiar with exactly the guarantees needed. What is the point until which the new goroutine needs to be running on the same thread?

We only observe the program after it calls INT 3, so it would be the BYTE $0xcc instruction in debugCall

If you expect debugCallWrap1 to be on the same M up until it calls debugCall32 (or whichever one) and hits the SIGINT, then that's probably where something is going wrong.

I do expect that and it's possible that this is the problem (although I'm also seeing different failures that I can't explain like this, so maybe there's multiple problems? weird). If making everything run on the same thread is complicated I'm ok with signalling the starting goroutine explicitly in one of the ways I was describing in https://go.dev/cl/229299.

@mknyszek
Copy link
Contributor

mknyszek commented Aug 3, 2023

Thanks for confirming.

If making everything run on the same thread is complicated

Nah, I don't think it'd be too complicated. I think this just needs to prevent preemption until we get to riiiight before the dispatch. The dispatch call is nosplit, so we just need to get there without calling into the scheduler. I'll send a patch to try out shortly.

@mknyszek
Copy link
Contributor

mknyszek commented Aug 3, 2023

Oh, also, is this relatively easy to reproduce for you? Is it as simple as just running that Delve test in a loop a bunch of times?

@mknyszek
Copy link
Contributor

mknyszek commented Aug 3, 2023

Here's an attempt to close off the possible preemption points: https://go.dev/cl/515637

@aarzilli
Copy link
Contributor Author

aarzilli commented Aug 3, 2023

Oh, also, is this relatively easy to reproduce for you? Is it as simple as just running that Delve test in a loop a bunch of times?

Yes. Just go test -run TestCallFunction github.com/go-delve/delve/pkg/proc in a loop. It usually fails within 20 iterations.

Here's an attempt to close off the possible preemption points: https://go.dev/cl/515637

I'll test it tomorrow morning.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/515637 mentions this issue: runtime: stay non-preemptible until dispatch in debugcall

@mknyszek
Copy link
Contributor

mknyszek commented Aug 3, 2023

I can reproduce, trying out the change...

@mknyszek
Copy link
Contributor

mknyszek commented Aug 3, 2023

I think it might fix it? I'm having a hard time telling because I'm also seeing a failure involving main.regabistacktest and I'm trying to confirm I didn't introduce that with the change.

EDIT: I can make that main.regabistacktest failure happen without the change, so I don't think it's new. I haven't yet reproduced the "could not recover call injection state" issue with the change applied.

@mknyszek mknyszek self-assigned this Aug 9, 2023
@mknyszek mknyszek added this to the Go1.22 milestone Aug 9, 2023
@cherrymui
Copy link
Member

@mknyszek any updates on this? This is a release blocker. As @aarzilli commented on the CL, it seems not completely fixed. Thanks.

@mknyszek
Copy link
Contributor

mknyszek commented Sep 6, 2023

Oh wow, I completely missed that. My apologies. I'll take another look. Thanks.

@mknyszek
Copy link
Contributor

mknyszek commented Sep 6, 2023

@aarzilli I updated the CL to lockOSThread. Please give it a shot. I tried to break the test you pointed me at locally with stress2 and haven't been able to.

@aarzilli
Copy link
Contributor Author

aarzilli commented Sep 6, 2023

@mknyszek I haven't tried patchset 3 yet but if it's equivalent to patchset 2 with lockOSThread(); defer unlockOSThread() added at the start of debugCallWrap then it works. I'll try it tomorrow.

@mknyszek
Copy link
Contributor

mknyszek commented Sep 6, 2023

Patchset 3 just changes the commit message, so yay I think that means we have a fix. :)

gopherbot pushed a commit that referenced this issue Sep 7, 2023
Right now debuggers like Delve rely on the new goroutine created to run
a debugcall function to run on the same thread it started on, up until
it hits itself with a SIGINT as part of the debugcall protocol.

That's all well and good, except debugCallWrap1 isn't particularly
careful about not growing the stack. For example, if the new goroutine
happens to have a stale preempt flag, then it's possible a stack growth
will cause a roundtrip into the scheduler, possibly causing the
goroutine to switch to another thread.

Previous attempts to just be more careful around debugCallWrap1 were
helpful, but insufficient. This change takes everything a step further
and always locks the debug call goroutine and the new goroutine it
creates to the OS thread.

For #61732.

Change-Id: I038f3a4df30072833e27e6a5a1ec01806a32891f
Reviewed-on: https://go-review.googlesource.com/c/go/+/515637
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Alessandro Arzilli <alessandro.arzilli@gmail.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
@mknyszek
Copy link
Contributor

mknyszek commented Sep 7, 2023

@aarzilli @cherrymui Do you think this is worth backporting? I think it meets the bar since it's low-risk (it just impacts a specific part of the debugging experience) and the change is fairly small.

@mknyszek mknyszek closed this as completed Sep 7, 2023
@aarzilli
Copy link
Contributor Author

aarzilli commented Sep 7, 2023

In the past the entire debugCallV1 implementation for arm64 was backported, IIRC.

@mknyszek
Copy link
Contributor

mknyszek commented Sep 7, 2023

Ha, OK. Backport it is.

@gopherbot Please open a backport issue for Go 1.21. This issue causes significant debugger friction due to an issue that was latent in earlier releases but only really started cropping up in Go 1.21 due to a scheduler change in that release. The fix is very low risk, since it only impacts the debugger function call protocol, which should have no impact on production applications. The fix is also fairly small and straightforward.

@gopherbot
Copy link
Contributor

Backport issue(s) opened: #62509 (for 1.21).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/526576 mentions this issue: [release-branch.go1.21] runtime: always lock OS thread in debugcall

gopherbot pushed a commit that referenced this issue Sep 25, 2023
Right now debuggers like Delve rely on the new goroutine created to run
a debugcall function to run on the same thread it started on, up until
it hits itself with a SIGINT as part of the debugcall protocol.

That's all well and good, except debugCallWrap1 isn't particularly
careful about not growing the stack. For example, if the new goroutine
happens to have a stale preempt flag, then it's possible a stack growth
will cause a roundtrip into the scheduler, possibly causing the
goroutine to switch to another thread.

Previous attempts to just be more careful around debugCallWrap1 were
helpful, but insufficient. This change takes everything a step further
and always locks the debug call goroutine and the new goroutine it
creates to the OS thread.

For #61732.
Fixes #62509.

Change-Id: I038f3a4df30072833e27e6a5a1ec01806a32891f
Reviewed-on: https://go-review.googlesource.com/c/go/+/515637
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Alessandro Arzilli <alessandro.arzilli@gmail.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
(cherry picked from commit d9a4b24)
Reviewed-on: https://go-review.googlesource.com/c/go/+/526576
bradfitz pushed a commit to tailscale/go that referenced this issue Sep 25, 2023
Right now debuggers like Delve rely on the new goroutine created to run
a debugcall function to run on the same thread it started on, up until
it hits itself with a SIGINT as part of the debugcall protocol.

That's all well and good, except debugCallWrap1 isn't particularly
careful about not growing the stack. For example, if the new goroutine
happens to have a stale preempt flag, then it's possible a stack growth
will cause a roundtrip into the scheduler, possibly causing the
goroutine to switch to another thread.

Previous attempts to just be more careful around debugCallWrap1 were
helpful, but insufficient. This change takes everything a step further
and always locks the debug call goroutine and the new goroutine it
creates to the OS thread.

For golang#61732.
Fixes golang#62509.

Change-Id: I038f3a4df30072833e27e6a5a1ec01806a32891f
Reviewed-on: https://go-review.googlesource.com/c/go/+/515637
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Alessandro Arzilli <alessandro.arzilli@gmail.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
(cherry picked from commit d9a4b24)
Reviewed-on: https://go-review.googlesource.com/c/go/+/526576
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Projects
None yet
Development

No branches or pull requests

6 participants