Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: "invalid traceback origin" in TestCgoPprofThread on linux/amd64 #43174

Closed
bcmills opened this issue Dec 14, 2020 · 7 comments
Closed

runtime: "invalid traceback origin" in TestCgoPprofThread on linux/amd64 #43174

bcmills opened this issue Dec 14, 2020 · 7 comments
Labels
NeedsInvestigation okay-after-beta1 release-blocker
Milestone

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Dec 14, 2020

2020-12-11T17:26:14-1341a3d/linux-amd64-wsl

--- FAIL: TestCgoPprofThread (0.76s)
    crash_cgo_test.go:335: [/tmp/workdir-host-linux-amd64-wsl/go/bin/go tool pprof -traces /tmp/workdir-host-linux-amd64-wsl/tmp/go-build1643386062/testprogcgo_.exe /tmp/workdir-host-linux-amd64-wsl/tmp/prof2131107081]:
        File: testprogcgo_.exe
        Build ID: 5bebc0f29d333075b2d2cb6ad739468825910989
        Type: cpu
        Time: Dec 12, 2020 at 1:35am (CST)
        Duration: 200.36ms, Total samples = 110ms (54.90%)
        -----------+-------------------------------------------------------
              10ms   cpuHogThread
                     cpuHogThread2
                     runtime.cgocall
                     main._Cfunc_runCPUHogThread
                     main.pprofThread
                     main.CgoPprofThread
                     main.main
                     runtime.main
        -----------+-------------------------------------------------------
             100ms   cpuHogThread
                     cpuHogThread2
        -----------+-------------------------------------------------------
    crash_cgo_test.go:347: invalid traceback origin: got=[cpuHogThread cpuHogThread2 runtime.cgocall main._Cfunc_runCPUHogThread main.pprofThread main.CgoPprofThread main.main runtime.main]; want=[cpuHogThread ... cpuHogThread2]
    crash_cgo_test.go:335: [/tmp/workdir-host-linux-amd64-wsl/go/bin/go tool pprof -traces /tmp/workdir-host-linux-amd64-wsl/tmp/prof2131107081]:
        File: testprogcgo_.exe
        Build ID: 5bebc0f29d333075b2d2cb6ad739468825910989
        Type: cpu
        Time: Dec 12, 2020 at 1:35am (CST)
        Duration: 200.36ms, Total samples = 110ms (54.90%)
        -----------+-------------------------------------------------------
              10ms   cpuHogThread
                     cpuHogThread2
                     runtime.cgocall
                     main._Cfunc_runCPUHogThread
                     main.pprofThread
                     main.CgoPprofThread
                     main.main
                     runtime.main
        -----------+-------------------------------------------------------
             100ms   cpuHogThread
                     cpuHogThread2
        -----------+-------------------------------------------------------
    crash_cgo_test.go:347: invalid traceback origin: got=[cpuHogThread cpuHogThread2 runtime.cgocall main._Cfunc_runCPUHogThread main.pprofThread main.CgoPprofThread main.main runtime.main]; want=[cpuHogThread ... cpuHogThread2]
FAIL
FAIL	runtime	21.528s

CC @hyangah @aclements @ianlancetaylor @mengzhuo

@bcmills bcmills added the NeedsInvestigation label Dec 14, 2020
@bcmills bcmills added this to the Backlog milestone Dec 14, 2020
@bcmills
Copy link
Member Author

@bcmills bcmills commented Sep 2, 2021

@bcmills
Copy link
Member Author

@bcmills bcmills commented Dec 2, 2021

The failure rate seems to be way up since the 1.18 cycle opened, and the failures are also now affecting other linux-amd64 builders. Marking as release-blocker for Go 1.18 as a likely 1.18 regression.

greplogs --dashboard -md -l -e 'FAIL: TestCgoPprofThread .*\n(?:\s+.+\n)*.*invalid traceback origin' --since=2021-09-03

2021-11-29T00:57:09-78af02e/linux-amd64-longtest
2021-11-16T17:13:33-29ec902/linux-amd64-jessie
2021-11-11T11:00:33-a01a6d6/linux-amd64
2021-10-28T22:38:00-af05d8b/linux-amd64-fedora
2021-10-11T15:28:50-702e337/linux-amd64-wsl
2021-10-09T01:04:29-e1c294a/linux-amd64-fedora
2021-10-06T19:45:35-e38ec96/linux-amd64-longtest
2021-10-05T19:15:03-ae83301/linux-amd64-staticlockranking
2021-09-15T03:29:39-9fc2889/linux-amd64-wsl
2021-09-13T18:02:42-e74e363/linux-amd64-wsl
2021-09-04T19:33:54-28dae3d/linux-amd64-wsl

@bcmills bcmills changed the title runtime: "invalid traceback origin" in TestCgoPprofThread on linux-amd64-wsl builder runtime: "invalid traceback origin" in TestCgoPprofThread on linux/amd64 Dec 2, 2021
@bcmills bcmills added okay-after-beta1 release-blocker labels Dec 2, 2021
@bcmills bcmills removed this from the Backlog milestone Dec 2, 2021
@bcmills bcmills added this to the Go1.18 milestone Dec 2, 2021
@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Dec 6, 2021

The uptick may be related to recent changes to how CPU profiles are obtained. CC @rhysh @prattmic ?

@rhysh
Copy link
Contributor

@rhysh rhysh commented Dec 7, 2021

The timer_create changes landed on 2021-09-27. It looks like the uptick began earlier, near the start of September.

From reading the test, its exact goal isn't clear to me. It looks for stacks in the profile with cpuHogThread at the top. From that set of matching stacks, it considers whichever appears in the text-format profile first. On that single stack, it then checks whether cpuHogThread2 is at the bottom.

The failures I've spot-checked have had two unique stacks with cpuHogThread at the top: one has cpuHogThread2 at the bottom, the other has additional frames below. Is the goal of the test to confirm that [cpuHogThread ... cpuHogThread2] appears somewhere in the profile (probably yes, with the current design)? Or to confirm that whenever cpuHogThread is at the top, that nothing other than cpuHogThread2 is at the bottom?

My first guess was that there should be only one unique stack. (If the test looked for that, it might fail more reliably.) But the cgo traceback is faked:

// pprofCgoThreadTraceback is passed to runtime.SetCgoTraceback.
// For testing purposes it pretends that all CPU hits in C code are in cpuHog.
void pprofCgoThreadTraceback(void* parg) {
struct cgoTracebackArg* arg = (struct cgoTracebackArg*)(parg);
arg->buf[0] = (uintptr_t)(cpuHogThread) + 0x10;
arg->buf[1] = (uintptr_t)(cpuHogThread2) + 0x4;
arg->buf[2] = 0;
}

The goroutine that starts the profile makes a cgo call to create a C-owned thread to burn CPU time. If the Go-owned thread gets a SIGPROF while it's in C code (in the process of creating the C-owned thread), that probably results in the longer stack seen in the failing tests. From what I see, the failures don't indicate a problem for production code.

if err := pprof.StartCPUProfile(f); err != nil {
fmt.Fprintln(os.Stderr, err)
os.Exit(2)
}
C.runCPUHogThread()

So it looks to me like either:

  • (probably) the test should change to pass if any of the stacks match [cpuHogThread ... cpuHogThread2]
  • or (maybe) the test should change to add a label to the goroutine that makes the first call into cgo, filter the profile to remove any samples that include that label, and then look for all of the stacks with cpuHogThread at the top to have cpuHogThread2 at the bottom.

@gopherbot
Copy link

@gopherbot gopherbot commented Dec 7, 2021

Change https://golang.org/cl/370135 mentions this issue: runtime: fix flake in TestCgoPprofThread

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Dec 7, 2021

@rhysh Oops. You're totally right. Sorry for pinging you on this, I should've checked the timeline of the pprof changes you made. In any case, thank you so much for the fix!

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Dec 8, 2021

The cpuHogThread2 stuff was added in https://golang.org/cl/151917 for #29034. I think that ensuring that both cpuHogThread and cpuHogThread2 appear together will suffice for that issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation okay-after-beta1 release-blocker
Projects
None yet
Development

No branches or pull requests

5 participants