Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: TestCgoCCodeSIGPROF failure due to hang in subprocess #53648

Open
bcmills opened this issue Jul 1, 2022 · 8 comments
Open

runtime: TestCgoCCodeSIGPROF failure due to hang in subprocess #53648

bcmills opened this issue Jul 1, 2022 · 8 comments
Labels
NeedsInvestigation
Milestone

Comments

@bcmills
Copy link
Member

bcmills commented Jul 1, 2022

https://storage.googleapis.com/go-build-log/4efa4b08/linux-arm-aws_aaf94adf.log (a linux-arm-aws TryBot on unrelated CL 411397):

--- FAIL: TestCgoCCodeSIGPROF (120.01s)
    testenv.go:460: [/workdir/tmp/go-build3722395938/testprogcgo.exe CgoCCodeSIGPROF] exit status: exit status 2
    crash_cgo_test.go:215: expected "OK\n" got SIGQUIT: quit
        PC=0x83aec m=2 sigcode=0
        
        goroutine 0 [idle]:
        runtime.usleep()
        	/workdir/go/src/runtime/sys_linux_arm.s:627 +0x2c fp=0xe7a50d94 sp=0xe7a50d84 pc=0x83aec
        runtime.sysmon()
        	/workdir/go/src/runtime/proc.go:5145 +0xcc fp=0xe7a50df0 sp=0xe7a50d94 pc=0x5e53c
        runtime.mstart1()
        	/workdir/go/src/runtime/proc.go:1425 +0x7c fp=0xe7a50e00 sp=0xe7a50df0 pc=0x549dc
        runtime.mstart0()
        	/workdir/go/src/runtime/proc.go:1382 +0x68 fp=0xe7a50e14 sp=0xe7a50e00 pc=0x54944
        runtime.mstart()
        	/workdir/go/src/runtime/asm_arm.s:210 +0x8 fp=0xe7a50e18 sp=0xe7a50e14 pc=0x80bb0
        
        goroutine 1 [chan receive]:
        runtime.gopark(0x1601c0, 0x24800b0, 0xe, 0x17, 0x2)
        	/workdir/go/src/runtime/proc.go:363 +0x100 fp=0x2436ec0 sp=0x2436eac pc=0x52028
        runtime.chanrecv(0x2480080, 0x0, 0x1)
        	/workdir/go/src/runtime/chan.go:583 +0x530 fp=0x2436f0c sp=0x2436ec0 pc=0x18788
        runtime.chanrecv1(0x2480080, 0x0)
        	/workdir/go/src/runtime/chan.go:442 +0x20 fp=0x2436f20 sp=0x2436f0c pc=0x18228
        main.CgoCCodeSIGPROF()
        	/workdir/go/src/runtime/testdata/testprogcgo/aprof.go:52 +0xb4 fp=0x2436f50 sp=0x2436f20 pc=0x122644
        main.main()
        	/workdir/go/src/runtime/testdata/testprogcgo/main.go:34 +0x158 fp=0x2436fc0 sp=0x2436f50 pc=0x121a00
        runtime.main()
        	/workdir/go/src/runtime/proc.go:250 +0x26c fp=0x2436fec sp=0x2436fc0 pc=0x51b60
        runtime.goexit()
        	/workdir/go/src/runtime/asm_arm.s:824 +0x4 fp=0x2436fec sp=0x2436fec pc=0x828a8
        
        goroutine 2 [force gc (idle)]:
        runtime.gopark(0x1602a8, 0x2497e8, 0x11, 0x14, 0x1)
        	/workdir/go/src/runtime/proc.go:363 +0x100 fp=0x2428fd8 sp=0x2428fc4 pc=0x52028
        runtime.goparkunlock(...)
        	/workdir/go/src/runtime/proc.go:369
        runtime.forcegchelper()
        	/workdir/go/src/runtime/proc.go:302 +0xc4 fp=0x2428fec sp=0x2428fd8 pc=0x51e8c
        runtime.goexit()
        	/workdir/go/src/runtime/asm_arm.s:824 +0x4 fp=0x2428fec sp=0x2428fec pc=0x828a8
        created by runtime.init.5
        	/workdir/go/src/runtime/proc.go:290 +0x1c
        
        goroutine 3 [GC sweep wait]:
        runtime.gopark(0x1602a8, 0x249908, 0xc, 0x14, 0x1)
        	/workdir/go/src/runtime/proc.go:363 +0x100 fp=0x24297d0 sp=0x24297bc pc=0x52028
        runtime.goparkunlock(...)
        	/workdir/go/src/runtime/proc.go:369
        runtime.bgsweep(0x241c040)
        	/workdir/go/src/runtime/mgcsweep.go:278 +0x94 fp=0x24297e4 sp=0x24297d0 pc=0x3a4a0
        runtime.gcenable.func1()
        	/workdir/go/src/runtime/mgc.go:178 +0x28 fp=0x24297ec sp=0x24297e4 pc=0x2ab74
        runtime.goexit()
        	/workdir/go/src/runtime/asm_arm.s:824 +0x4 fp=0x24297ec sp=0x24297ec pc=0x828a8
        created by runtime.gcenable
        	/workdir/go/src/runtime/mgc.go:178 +0x74
        
        goroutine 4 [GC scavenge wait]:
        runtime.gopark(0x1602a8, 0x249aa8, 0xd, 0x14, 0x2)
        	/workdir/go/src/runtime/proc.go:363 +0x100 fp=0x2429fb4 sp=0x2429fa0 pc=0x52028
        runtime.goparkunlock(...)
        	/workdir/go/src/runtime/proc.go:369
        runtime.(*scavengerState).park(0x249aa8)
        	/workdir/go/src/runtime/mgcscavenge.go:389 +0x64 fp=0x2429fc8 sp=0x2429fb4 pc=0x37f30
        runtime.bgscavenge(0x241c040)
        	/workdir/go/src/runtime/mgcscavenge.go:617 +0x3c fp=0x2429fe4 sp=0x2429fc8 pc=0x3865c
        runtime.gcenable.func2()
        	/workdir/go/src/runtime/mgc.go:179 +0x28 fp=0x2429fec sp=0x2429fe4 pc=0x2ab20
        runtime.goexit()
        	/workdir/go/src/runtime/asm_arm.s:824 +0x4 fp=0x2429fec sp=0x2429fec pc=0x828a8
        created by runtime.gcenable
        	/workdir/go/src/runtime/mgc.go:179 +0xbc
        
        goroutine 18 [finalizer wait]:
        runtime.gopark(0x1602a8, 0x25b0c0, 0x10, 0x14, 0x1)
        	/workdir/go/src/runtime/proc.go:363 +0x100 fp=0x242878c sp=0x2428778 pc=0x52028
        runtime.goparkunlock(...)
        	/workdir/go/src/runtime/proc.go:369
        runtime.runfinq()
        	/workdir/go/src/runtime/mfinal.go:180 +0x108 fp=0x24287ec sp=0x242878c pc=0x29bb4
        runtime.goexit()
        	/workdir/go/src/runtime/asm_arm.s:824 +0x4 fp=0x24287ec sp=0x24287ec pc=0x828a8
        created by runtime.createfing
        	/workdir/go/src/runtime/mfinal.go:157 +0x54
        
        goroutine 19 [running]:
        	goroutine running on other thread; stack unavailable
        created by main.CgoCCodeSIGPROF
        	/workdir/go/src/runtime/testdata/testprogcgo/aprof.go:35 +0x74
        
        goroutine 20 [running]:
        	goroutine running on other thread; stack unavailable
        created by runtime/pprof.StartCPUProfile
        	/workdir/go/src/runtime/pprof/pprof.go:783 +0x134
        
        trap    0x0
        error   0x0
        oldmask 0x0
        r0      0xfffffffc
        r1      0x0
        r2      0x0
        r3      0x2710
        r4      0x0
        r5      0x4
        r6      0x240e020
        r7      0xa2
        r8      0x6
        r9      0x1
        r10     0x24025a0
        fp      0x2710
        ip      0x7
        sp      0xe7a50d84
        lr      0x83acc
        pc      0x83aec
        cpsr    0x20870010
        fault   0x0
FAIL
FAIL	runtime	140.903s

(attn @golang/runtime; CC @golang/arm)

greplogs -l -e 'FAIL: TestCgoCCodeSIGPROF .*(?:\n\s+.*)*SIGQUIT' --since=2022-01-01

(0 matching logs)

@bcmills
Copy link
Member Author

bcmills commented Jul 1, 2022

There aren't any of these in the dashboard logs, which suggests that this failure mode is either new or rare. However, CL 412474 is very recent and looks like it could plausibly be related (attn @cherrymui @aclements).

@bcmills bcmills added the NeedsInvestigation label Jul 1, 2022
@bcmills
Copy link
Member Author

bcmills commented Jul 1, 2022

linux/arm is a first-class port, so marking as release-blocker until we can determine whether this is a regression in Go 1.19.

(If it is determined unlikely to be a regression, the failure mode may be rare enough to leave on the Backlog for now.)

@bcmills bcmills added this to the Go1.19 milestone Jul 1, 2022
@bcmills
Copy link
Member Author

bcmills commented Jul 1, 2022

(#38333 may or may not be related.)

@cherrymui
Copy link
Member

cherrymui commented Jul 1, 2022

CL 412474 shouldn't be related. It is specific to ARM64, and doesn't touch ARM(32) at all.

@joedian
Copy link

joedian commented Jul 6, 2022

Not a RC blocker. Austin is reviewing.

@aclements
Copy link
Member

aclements commented Jul 6, 2022

I'm not sure we can do much from this traceback.

goroutine 10 is in a long loop that does a no-op cgo call and is supposed to check the time and exit after 1 second. Given that the traceback says it's running on another thread, I think that means we got the SIGQUIT while in Go code (though I'm not 100% positive about that). There are a few possibilities:

  1. A timing blip held the test up until it timed out. It's supposed to take 1 second and we're giving it 1 minute, so this seems unlikely. On the other hand, this is a virtualized platform, so it's possible there was some very long, very rare descheduling.
  2. The runtime wedged in cgo call or return. We would at least need more traceback or a core dump to make any progress on this possibility.
  3. Something went wrong with the time comparison. This should be using monotonic time, so this seems very unlikely.
  4. A miscompilation cause us to miss the check. Given that this is the only time we've seen the failure, I think we can rule this out.

Given the rareness of this, I think we can move this to backlog. We could add more debugging in case this happens again, but it's not totally clear to me what we would add that would be really valuable. We have general plans to get core files from failures like this, which I think is what we really want to debug this.

@aclements aclements modified the milestones: Go1.19, Backlog Jul 6, 2022
@bcmills
Copy link
Member Author

bcmills commented Jul 14, 2022

@bcmills
Copy link
Member Author

bcmills commented Jul 14, 2022

A miscompilation cause us to miss the check. Given that this is the only time we've seen the failure, I think we can rule this out.

With two failures in two weeks, are we still confident in running out a miscompilation or other regression?

The runtime wedged in cgo call or return. We would at least need more traceback or a core dump to make any progress on this possibility.

If that's the case, would it make sense to skip the test on linux/arm until we're equipped to collect core files?
(Or, given the new information about the failure rate, does it seem plausible to obtain a core by running a gomote swarm?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation
Projects
None yet
Development

No branches or pull requests

5 participants