Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: TestCrashDumpsAllThreads flakes #35356

Closed
FiloSottile opened this issue Nov 4, 2019 · 12 comments
Closed

runtime: TestCrashDumpsAllThreads flakes #35356

FiloSottile opened this issue Nov 4, 2019 · 12 comments

Comments

@FiloSottile FiloSottile added this to the Go1.14 milestone Nov 4, 2019
@bcmills

This comment has been minimized.

Copy link
Member

@bcmills bcmills commented Nov 6, 2019

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 6, 2019

My current guess is that this has changed due to signal preemption. The test only works if each running M is hit with a SIGQUIT while it is running a tight loop. The test is checking that the stack trace for each thread shows the loop. If the thread is preempted when it receives the SIGQUIT, it will instead show a track trace in the runtime with the thread stopped at gcstopm.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 6, 2019

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Nov 7, 2019

Change https://golang.org/cl/205720 mentions this issue: runtime: set GODEBUG=asyncpreemptoff=1 in TestCrashDumpsAllThreads

@aclements

This comment has been minimized.

Copy link
Member

@aclements aclements commented Nov 7, 2019

@cherrymui

This comment has been minimized.

Copy link
Contributor

@cherrymui cherrymui commented Nov 7, 2019

If the SIGQUIT lands right after asyncPreempt2 switches to the system stack, but before CASing the G status out of _Grunning, I guess this can result in goroutine running on other thread; stack unavailable on the G and also not being trace backed on the M (as it is running on its G0)? (But the failure stack traces don't look like this.)

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 7, 2019

The test is expecting to see that theoretically unpreemptible loop appears the expected number of times in the stack trace. But if the loop is preempted, then the stack trace of that M will show something else, not the loop. So the test will fail.

Non-cooperative preemption will make real code more robust, but it doesn't make this specific test more robust.

@cherrymui It's normal to see goroutine running on other thread in the first set of stack traces. The expectation of the test is that you will then see one more stack trace per thread. The test expects those additional stack traces to fill in the references to the loop. But if the M is preempted when the program dies, then we don't see a reference to the loop in its stack.

@gopherbot gopherbot closed this in 05aa4a7 Nov 7, 2019
@cherrymui

This comment has been minimized.

Copy link
Contributor

@cherrymui cherrymui commented Nov 7, 2019

@ianlancetaylor when the goroutine is preempted, it will show something like

        goroutine 6 [runnable]:
        runtime.asyncPreempt2()
        	/workdir/go/src/runtime/preempt.go:289 +0x62 fp=0xc000030630 sp=0xc000030610 pc=0x42ba62
        runtime.asyncPreempt()
        	/workdir/go/src/runtime/preempt_amd64.s:45 +0xdb fp=0xc0000307b8 sp=0xc000030630 pc=0x4589fb
        main.loop(0x0, 0xc00005e060)
        	/workdir/tmp/go-build869718673/main.go:36 +0x2d fp=0xc0000307d0 sp=0xc0000307b8 pc=0x48e3cd
        runtime.goexit()
        	/workdir/go/src/runtime/asm_amd64.s:1375 +0x1 fp=0xc0000307d8 sp=0xc0000307d0 pc=0x4570d1
        created by main.main
        	/workdir/tmp/go-build869718673/main.go:17 +0xac

The test counts the number of "main.loop" appearances, so this is counted correctly. On that thread, the additional stack trace (what I meant by saying "being trace backed on the M") will show that M's G0 stack, which does not include main.loop.

@cherrymui

This comment has been minimized.

Copy link
Contributor

@cherrymui cherrymui commented Nov 7, 2019

So, I think,

  • if the goroutine is not preempted, the first set of stack traces will have goroutine running on other thread, and the additional stack trace will show one with main.loop.
  • if the goroutine is preempted, the first set of stack traces will show main.loop (by tracing back through runtime.asyncPreempt), and the additional stack trace will show its G0 stack, which does not include main.loop.

So either way, it should be counted as 1 appearance (except in the case that the SIGQUIT lands in the small window I mentioned in my previous comment).

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 7, 2019

I think that what is happening when the test fails is that 1) the signal is sent; 2) the goroutine is reported as goroutine running on other thread; 3) the goroutine is preempted and the M enters schedule; 4) the M that was running the goroutine is signaled; 5) the M reports a stack trace up through schedule; 6) nothing reports main.loop because that goroutine was already reported in step 2 and the M is now doing something else.

@cherrymui

This comment has been minimized.

Copy link
Contributor

@cherrymui cherrymui commented Nov 7, 2019

@ianlancetaylor I thing you're right. This can result in a missing main.loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.