Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/pprof: TestMorestack failures aix-ppc64 builder #38567

Closed
bcmills opened this issue Apr 21, 2020 · 12 comments
Closed

runtime/pprof: TestMorestack failures aix-ppc64 builder #38567

bcmills opened this issue Apr 21, 2020 · 12 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@bcmills
Copy link
Member

bcmills commented Apr 21, 2020

2020-04-21T11:41:40-664d270/aix-ppc64
2020-04-21T08:08:49-17fbc81/aix-ppc64
2020-04-20T22:42:49-2edd351/aix-ppc64

This failure mode started very recently, so it could perhaps be due to CL 228886 (CC @josharian @hyangah). Tentatively marking as release-blocker because this appears to be a regression in 1.15.

#38316 may be related (CC @trex58, @Helflym).

--- FAIL: TestMorestack (6.42s)
    pprof_test.go:394: total 6 CPU profile samples collected:
        1: 0x100032320 (runtime.syscall1:221) 0x10003343b (runtime.sem_wait:574 runtime.semasleep:75) 0x10000ad0b (runtime.notesleep:181) 0x10003da5f (runtime.stopm:1881) 0x10003f493 (runtime.findrunnable:2413) 0x100040513 (runtime.schedule:2613) 0x1000416b3 (runtime.goexit0:2942) 0x10006b3a7 (runtime.mcall:202)
        
        3: 0x10003343b (runtime.sem_wait:574 runtime.semasleep:75) 0x10000ad0b (runtime.notesleep:181) 0x10003da5f (runtime.stopm:1881) 0x10003f493 (runtime.findrunnable:2413) 0x100040513 (runtime.schedule:2613) 0x100040b57 (runtime.park_m:2783) 0x10006b3a7 (runtime.mcall:202)
        
        1: 0x10003343b (runtime.sem_wait:574 runtime.semasleep:75) 0x10000ad0b (runtime.notesleep:181) 0x10003da5f (runtime.stopm:1881) 0x10003f493 (runtime.findrunnable:2413) 0x100040513 (runtime.schedule:2613) 0x1000416b3 (runtime.goexit0:2942) 0x10006b3a7 (runtime.mcall:202)
        
        1: 0x10006e450 (asmsyscall6:52) 0x10006d9a7 (runtime.asmcgocall:633)
        
    pprof_test.go:429: runtime.newstack,runtime/pprof.growstack: 0
    pprof_test.go:432: no samples in expected functions
    pprof_test.go:442: runtime.newstack,runtime/pprof.growstack has 0 samples out of 0, want at least 1, ideally 0
    pprof_test.go:323: retrying with 200ms duration
    pprof_test.go:394: total 11 CPU profile samples collected:
        11: 0x10003343b (runtime.sem_wait:574 runtime.semasleep:75) 0x10000ad0b (runtime.notesleep:181) 0x10003da5f (runtime.stopm:1881) 0x10003f493 (runtime.findrunnable:2413) 0x100040513 (runtime.schedule:2613) 0x100040b57 (runtime.park_m:2783) 0x10006b3a7 (runtime.mcall:202)
        
    pprof_test.go:429: runtime.newstack,runtime/pprof.growstack: 0
    pprof_test.go:432: no samples in expected functions
    pprof_test.go:442: runtime.newstack,runtime/pprof.growstack has 0 samples out of 0, want at least 1, ideally 0
    pprof_test.go:323: retrying with 400ms duration
    pprof_test.go:394: total 23 CPU profile samples collected:
        23: 0x10003343b (runtime.sem_wait:574 runtime.semasleep:75) 0x10000ad0b (runtime.notesleep:181) 0x10003da5f (runtime.stopm:1881) 0x10003f493 (runtime.findrunnable:2413) 0x100040513 (runtime.schedule:2613) 0x100040b57 (runtime.park_m:2783) 0x10006b3a7 (runtime.mcall:202)
        
    pprof_test.go:429: runtime.newstack,runtime/pprof.growstack: 0
    pprof_test.go:432: no samples in expected functions
    pprof_test.go:442: runtime.newstack,runtime/pprof.growstack has 0 samples out of 0, want at least 1, ideally 0
    pprof_test.go:323: retrying with 800ms duration
    pprof_test.go:394: total 46 CPU profile samples collected:
        46: 0x10003343b (runtime.sem_wait:574 runtime.semasleep:75) 0x10000ad0b (runtime.notesleep:181) 0x10003da5f (runtime.stopm:1881) 0x10003f493 (runtime.findrunnable:2413) 0x100040513 (runtime.schedule:2613) 0x100040b57 (runtime.park_m:2783) 0x10006b3a7 (runtime.mcall:202)
        
    pprof_test.go:429: runtime.newstack,runtime/pprof.growstack: 0
    pprof_test.go:432: no samples in expected functions
    pprof_test.go:442: runtime.newstack,runtime/pprof.growstack has 0 samples out of 0, want at least 1, ideally 0
    pprof_test.go:323: retrying with 1.6s duration
    pprof_test.go:394: total 92 CPU profile samples collected:
        91: 0x10003343b (runtime.sem_wait:574 runtime.semasleep:75) 0x10000ad0b (runtime.notesleep:181) 0x10003da5f (runtime.stopm:1881) 0x10003f493 (runtime.findrunnable:2413) 0x100040513 (runtime.schedule:2613) 0x100040b57 (runtime.park_m:2783) 0x10006b3a7 (runtime.mcall:202)
        
        1: 0x100032ca3 (runtime.usleep:535) 0x10004557f (runtime.sysmon:4561) 0x10003c173 (runtime.mstart1:1150) 0x10003c05b (runtime.mstart:1115)
        
    pprof_test.go:429: runtime.newstack,runtime/pprof.growstack: 0
    pprof_test.go:432: no samples in expected functions
    pprof_test.go:442: runtime.newstack,runtime/pprof.growstack has 0 samples out of 0, want at least 1, ideally 0
    pprof_test.go:323: retrying with 3.2s duration
    pprof_test.go:394: total 184 CPU profile samples collected:
        184: 0x10003343b (runtime.sem_wait:574 runtime.semasleep:75) 0x10000ad0b (runtime.notesleep:181) 0x10003da5f (runtime.stopm:1881) 0x10003f493 (runtime.findrunnable:2413) 0x100040513 (runtime.schedule:2613) 0x100040b57 (runtime.park_m:2783) 0x10006b3a7 (runtime.mcall:202)
        
    pprof_test.go:429: runtime.newstack,runtime/pprof.growstack: 0
    pprof_test.go:432: no samples in expected functions
    pprof_test.go:442: runtime.newstack,runtime/pprof.growstack has 0 samples out of 0, want at least 1, ideally 0
--- FAIL: TestCPUProfileLabel (6.42s)
    pprof_test.go:394: total 5 CPU profile samples collected:
        5: 0x10003343b (runtime.sem_wait:574 runtime.semasleep:75) 0x10000ad0b (runtime.notesleep:181) 0x10003da5f (runtime.stopm:1881) 0x10003f493 (runtime.findrunnable:2413) 0x100040513 (runtime.schedule:2613) 0x100040b57 (runtime.park_m:2783) 0x10006b3a7 (runtime.mcall:202)
        
    pprof_test.go:429: runtime/pprof.cpuHogger;key=value: 0
    pprof_test.go:432: no samples in expected functions
    pprof_test.go:442: runtime/pprof.cpuHogger;key=value has 0 samples out of 0, want at least 1, ideally 0
    pprof_test.go:323: retrying with 200ms duration
    pprof_test.go:394: total 11 CPU profile samples collected:
        11: 0x10003343b (runtime.sem_wait:574 runtime.semasleep:75) 0x10000ad0b (runtime.notesleep:181) 0x10003da5f (runtime.stopm:1881) 0x10003f493 (runtime.findrunnable:2413) 0x100040513 (runtime.schedule:2613) 0x100040b57 (runtime.park_m:2783) 0x10006b3a7 (runtime.mcall:202)
        
    pprof_test.go:429: runtime/pprof.cpuHogger;key=value: 0
    pprof_test.go:432: no samples in expected functions
    pprof_test.go:442: runtime/pprof.cpuHogger;key=value has 0 samples out of 0, want at least 1, ideally 0
    pprof_test.go:323: retrying with 400ms duration
    pprof_test.go:394: total 22 CPU profile samples collected:
        22: 0x10003343b (runtime.sem_wait:574 runtime.semasleep:75) 0x10000ad0b (runtime.notesleep:181) 0x10003da5f (runtime.stopm:1881) 0x10003f493 (runtime.findrunnable:2413) 0x100040513 (runtime.schedule:2613) 0x100040b57 (runtime.park_m:2783) 0x10006b3a7 (runtime.mcall:202)
        
    pprof_test.go:429: runtime/pprof.cpuHogger;key=value: 0
    pprof_test.go:432: no samples in expected functions
    pprof_test.go:442: runtime/pprof.cpuHogger;key=value has 0 samples out of 0, want at least 1, ideally 0
    pprof_test.go:323: retrying with 800ms duration
    pprof_test.go:394: total 46 CPU profile samples collected:
        46: 0x10003343b (runtime.sem_wait:574 runtime.semasleep:75) 0x10000ad0b (runtime.notesleep:181) 0x10003da5f (runtime.stopm:1881) 0x10003f493 (runtime.findrunnable:2413) 0x100040513 (runtime.schedule:2613) 0x100040b57 (runtime.park_m:2783) 0x10006b3a7 (runtime.mcall:202)
        
    pprof_test.go:429: runtime/pprof.cpuHogger;key=value: 0
    pprof_test.go:432: no samples in expected functions
    pprof_test.go:442: runtime/pprof.cpuHogger;key=value has 0 samples out of 0, want at least 1, ideally 0
    pprof_test.go:323: retrying with 1.6s duration
    pprof_test.go:394: total 90 CPU profile samples collected:
        90: 0x10003343b (runtime.sem_wait:574 runtime.semasleep:75) 0x10000ad0b (runtime.notesleep:181) 0x10003da5f (runtime.stopm:1881) 0x10003f493 (runtime.findrunnable:2413) 0x100040513 (runtime.schedule:2613) 0x100040b57 (runtime.park_m:2783) 0x10006b3a7 (runtime.mcall:202)
        
    pprof_test.go:429: runtime/pprof.cpuHogger;key=value: 0
    pprof_test.go:432: no samples in expected functions
    pprof_test.go:442: runtime/pprof.cpuHogger;key=value has 0 samples out of 0, want at least 1, ideally 0
    pprof_test.go:323: retrying with 3.2s duration
    pprof_test.go:394: total 185 CPU profile samples collected:
        185: 0x10003343b (runtime.sem_wait:574 runtime.semasleep:75) 0x10000ad0b (runtime.notesleep:181) 0x10003da5f (runtime.stopm:1881) 0x10003f493 (runtime.findrunnable:2413) 0x100040513 (runtime.schedule:2613) 0x100040b57 (runtime.park_m:2783) 0x10006b3a7 (runtime.mcall:202)
        
    pprof_test.go:429: runtime/pprof.cpuHogger;key=value: 0
    pprof_test.go:432: no samples in expected functions
    pprof_test.go:442: runtime/pprof.cpuHogger;key=value has 0 samples out of 0, want at least 1, ideally 0
FAIL
FAIL	runtime/pprof	21.230s
@bcmills bcmills added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker labels Apr 21, 2020
@bcmills bcmills added this to the Go1.15 milestone Apr 21, 2020
@josharian
Copy link
Contributor

Are atomic memory operations implemented using syscalls on aix/ppc64?

Almost every single sample collected has the same stack trace. It’s sitting in a semaphore syscall deep in the scheduler.

I don’t really see how the pprof CL could have caused that, but I’ll revert it to see experimentally whether it fixes it.

pprof_test.go:394: total 184 CPU profile samples collected:
        184: 0x10003343b (runtime.sem_wait:574 runtime.semasleep:75) 0x10000ad0b (runtime.notesleep:181) 0x10003da5f (runtime.stopm:1881) 0x10003f493 (runtime.findrunnable:2413) 0x100040513 (runtime.schedule:2613) 0x100040b57 (runtime.park_m:2783) 0x10006b3a7 (runtime.mcall:202)

@josharian
Copy link
Contributor

I wonder whether we have somehow ended up with the profiling signals synchronizing with something else in the runtime, so that instead of getting a good random sample we are instead always sampling the scheduler.

@josharian
Copy link
Contributor

https://golang.org/cl/228886 has been reverted. Let’s see whether that fixes the builders...

@Helflym
Copy link
Contributor

Helflym commented Apr 22, 2020

Are atomic memory operations implemented using syscalls on aix/ppc64?

If you're speaking of sysMap and similar functions, yes there are using syscalls. However, these syscalls are using asmcgocall directly without a lock, so I don't think there is any semaphores involved in.

@dmitshur dmitshur added the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label May 14, 2020
@toothrot toothrot removed the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Jun 10, 2020
@toothrot
Copy link
Contributor

@bcmills @josharian Can you verify if reverting the CL resolved this issue? Thanks!

@aclements
Copy link
Member

greplogs -dashboard -E "FAIL: TestMorestack" -md shows the last failure was 2020-04-24, which was unfortunately after the CL 228886 revert, though it was a while ago.

$ greplogs -dashboard -E "FAIL: TestMorestack" -l | grep aix | findflakes -paths
First observed 2edd351 20 Apr 22:42 2020 (487 commits ago)
Last observed  82f2989 24 Apr 08:21 2020 (410 commits ago)
0.00% chance failure is still happening
3.9% failure probability (4 of 78 commits)
Likely culprits:
    4% 2edd351 runtime: skip TestBigGOMAXPROCS if it runs out of memory
    4% 0cffc95 Revert "cmd/compile: adjust RISCV64 rewrite rules to use typed aux fields"
    4% 75e79ad cmd/api: limit concurrent 'go list' calls to GOMAXPROCS
    3% 40a144b crypto/tls: add Dialer
    3% 7004be9 cmd/compile: adjust RISCV64 rewrite rules to use typed aux fields
    3% 0239a5c cmd/compile: use fuse to implement shortcircuit loop
    3% c9af552 database/sql: on Tx rollback, retain connection if driver can reset session
    3% d8f0a22 database/sql: prevent Tx statement from committing after rollback
    3% b2cff7e database/sql: check conn expiry when returning to pool, not when handing it out
    3% f8ff12d cmd/compile: use dereference boundedness hint in ssa.addr
No known past failures

@aclements
Copy link
Member

No failures since 2020-04-24. I looked more closely at 2020-04-24T08:21:27-82f2989/aix-ppc64, which happened after CL 228886 was reverted. That failure might actually be different: almost all of the samples are in sysmon -> usleep, whereas in the previous failures before 228886 was reverted, most samples were in stopm -> sem_wait.

@aclements
Copy link
Member

@josharian , since I'm not particularly familiar with the likely culprit CL, could you look at the 2020-04-24T08:21:27-82f2989/aix-ppc64 failure. Does it look like an unrelated flake?

@josharian
Copy link
Contributor

@aclements I was mystified as to how the likely culprit CL caused the failures in the first place, so I am probably not the right person to ask. :)

@trex58
Copy link

trex58 commented Jun 23, 2020

Helflym, who helps for the maintenance of Go on AIX, is in vacations and will be back 6th of July.

@aclements
Copy link
Member

aclements commented Jun 25, 2020

Thanks. Since AIX isn't a first-class port, I'm going to drop the release-blocker tag on this.

@ianlancetaylor
Copy link
Contributor

No failures since April. findflakes estimates a 0.00% change that this is still happening. I'm calling this one fixed.

@golang golang locked and limited conversation to collaborators Jun 26, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

9 participants