Skip to content

runtime: scheduler change causes Delve's function call injection to fail intermittently #61732

@aarzilli

Description

@aarzilli

Starting with change ad94306 (https://go.dev/cl/501976) I see frequent (but intermittent) failures of TestCallFunction of github.com/go-delve/delve/pkg/proc. The failure has a few different presentations, a frequent one is this one:

    variables_test.go:1353: call "getAStruct(3).VRcvr(1)"
2023-08-03T10:27:16+02:00 info kind=fncall,layer=proc function call initiated <nil> frame size 65535 goroutine 1 (thread 550064)
2023-08-03T10:27:16+02:00 info kind=fncall,layer=proc findCallInjectionStateForThread thread=550069 goroutine=59
    variables_test.go:1367: call "getAStruct(3).VRcvr(1)": error "could not recover call injection state for goroutine 59 (thread 550069)"
--- FAIL: TestCallFunction (0.53s)

Produced passing -v -log=fncall to TestCallFunction.

Delve keeps a map from goroutine IDs to call injection states, wherever it sees a SIGINT in one of the debugCall functions (runtime.debugCallV2, debugCall32, &c) it reads the goroutine id and searches the map for the corresponding call injection state. If none is found it looks for a call injection state that has just been started on the same thread.

The assumption here is that when the runtime spawns a new goroutine to handle the call injection it will use the same thread to run it, initially. This was agreed upon in https://go-review.googlesource.com/c/go/+/229299:

More generally, though, it's guaranteed to keep the whole call injection protocol on one OS thread and it won't let anything else happen on that thread during the protocol, so if you know what G you're on when you start the injection and associate that with the OS thread, you can check which G it's on when it reaches debugCallNNNN and you know they're tied together.

The logs above suggest that this invariant is being violated, the injection starts on thread 550064 goroutine 1 but the first stop is on thread 550069 goroutine 59.

I'm not 100% sure that the bug was introduced by this commit, I think I saw a similar failure in CI before but I could never reproduce it locally (or see it frequently enough in CI). It could be that the bug already existed and this change made it orders of magnitude more likely.

cc @prattmic @mknyszek @aclements

Metadata

Metadata

Assignees

Labels

FrozenDueToAgeNeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.compiler/runtimeIssues related to the Go compiler and/or runtime.release-blocker

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions