Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/pprof: TestCPUProfileMultithreadMagnitude rusage too high on linux-arm64-aws #50097

Closed
bcmills opened this issue Dec 10, 2021 · 8 comments
Closed

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Dec 10, 2021

Seen on a linux-arm64-aws TryBot for unrelated CL 370666 (https://storage.googleapis.com/go-build-log/a297f68d/linux-arm64-aws_04471a78.log):

--- FAIL: TestCPUProfileMultithreadMagnitude (0.66s)
    pprof_test.go:127: Running on Linux 4.19.0
    pprof_test.go:526: total 59 CPU profile samples collected:
        5: 0x7566c (runtime.epollwait:731) 0x3f23f (runtime.netpoll:126) 0x4b013 (runtime.findrunnable:2956) 0x4c497 (runtime.schedule:3376) 0x4ca13 (runtime.park_m:3525) 0x71be3 (runtime.mcall:186) labels: map[]
        
        8: 0x138d10 (runtime/pprof.cpuHog0:64 runtime/pprof.cpuHog1:55) 0x138c57 (runtime/pprof.cpuHogger:41) 0x139d83 (runtime/pprof.TestCPUProfileMultithreadMagnitude.func2.1:176) 0x148d83 (runtime/pprof.diffCPUTimeRUsage:27) 0x13aca7 (runtime/pprof.diffCPUTime:485) 0x139a5f (runtime/pprof.TestCPUProfileMultithreadMagnitude.func2:174) 0x13ab07 (runtime/pprof.testCPUProfile:453) 0x13965b (runtime/pprof.TestCPUProfileMultithreadMagnitude:173) 0xd439f (testing.tRunner:1410) labels: map[]
        
        1: 0x52680 (runtime.pMask.set:5749 runtime.pidleput:5815) 0x4adb7 (runtime.findrunnable:2863) 0x4c497 (runtime.schedule:3376) 0x4ca13 (runtime.park_m:3525) 0x71be3 (runtime.mcall:186) labels: map[]
        
        2: 0x138cfc (runtime/pprof.cpuHog0:60 runtime/pprof.cpuHog1:55) 0x138c57 (runtime/pprof.cpuHogger:41) 0x139d83 (runtime/pprof.TestCPUProfileMultithreadMagnitude.func2.1:176) 0x148d83 (runtime/pprof.diffCPUTimeRUsage:27) 0x13aca7 (runtime/pprof.diffCPUTime:485) 0x139a5f (runtime/pprof.TestCPUProfileMultithreadMagnitude.func2:174) 0x13ab07 (runtime/pprof.testCPUProfile:453) 0x13965b (runtime/pprof.TestCPUProfileMultithreadMagnitude:173) 0xd439f (testing.tRunner:1410) labels: map[]
        
        1: 0x4b02b (runtime.nanotime:19 runtime.findrunnable:2958) 0x4c497 (runtime.schedule:3376) 0x4ca13 (runtime.park_m:3525) 0x71be3 (runtime.mcall:186) labels: map[]
        
        31: 0x138d90 (runtime/pprof.cpuHog0:64 runtime/pprof.cpuHog3:83) 0x138c57 (runtime/pprof.cpuHogger:41) 0x139c6b (runtime/pprof.TestCPUProfileMultithreadMagnitude.func2.2.1:188) labels: map[]
        
        1: 0x74b30 (runtime.asyncPreempt:7) 0x138d7b (runtime/pprof.cpuHog0:60 runtime/pprof.cpuHog3:83) 0x138c57 (runtime/pprof.cpuHogger:41) 0x139c6b (runtime/pprof.TestCPUProfileMultithreadMagnitude.func2.2.1:188) labels: map[]
        
        2: 0x74b30 (runtime.asyncPreempt:7) 0x138d8f (runtime/pprof.cpuHog0:62 runtime/pprof.cpuHog3:83) 0x138c57 (runtime/pprof.cpuHogger:41) 0x139c6b (runtime/pprof.TestCPUProfileMultithreadMagnitude.func2.2.1:188) labels: map[]
        
        7: 0x138d7c (runtime/pprof.cpuHog0:60 runtime/pprof.cpuHog3:83) 0x138c57 (runtime/pprof.cpuHogger:41) 0x139c6b (runtime/pprof.TestCPUProfileMultithreadMagnitude.func2.2.1:188) labels: map[]
        
        1: 0x526d8 (runtime.pidleput:5818) 0x4adb7 (runtime.findrunnable:2863) 0x4c497 (runtime.schedule:3376) 0x4ca13 (runtime.park_m:3525) 0x71be3 (runtime.mcall:186) labels: map[]
        
    pprof_test.go:597: runtime/pprof.cpuHog1: 10
    pprof_test.go:597: runtime/pprof.cpuHog3: 41
    --- FAIL: TestCPUProfileMultithreadMagnitude/serial_execution_OS_vs_pprof (0.00s)
        pprof_test.go:214: compare 183.697ms vs 100ms
        pprof_test.go:226: CPU usage reports are too different (limit -40.0%, got -45.6%)
    pprof_test.go:130: Failure of this test may indicate that your system suffers from a known Linux kernel bug fixed on newer kernels. See https://golang.org/issue/49065.
FAIL
FAIL	runtime/pprof	5.806s

Marking as release-blocker for Go 1.18 because this test is new (as of CL 324129).
(CC @prattmic @rhysh)

@rhysh
Copy link
Contributor

@rhysh rhysh commented Dec 10, 2021

The number of samples in the serial portion (cpuHog1) are spot-on: 10ms per sample, 100ms wall clock, 10 samples. The number of samples in the parallel portion (cpuHog3) are very close to the goal: 10ms per sample, 100ms wall clock, GOMAXPROCS=4 (this builder runs on a m6g.xlarge, 4 cores and no hyperthreading), 41 samples vs an expected 40.

The failure is in the comparison of the serial portion. The profile reports 100ms, but rusage(2) reports 184ms. So this looks like another source of "extra/background/non-application CPU time that the runtime expends". (Earlier flakes were from GC.)

Of the 59 total samples, taking out the 10+41 that are in the CPU hog functions the test expects, the remaining 8 are all in findrunnable. I'm surprised to see any there, and I'm not sure what to make of it.

Of those, five point at the PC right after the epollwait syscall:

MOVW R0, ret+24(FP)

Two are at atomics:
atomic.Or(&p[word], mask)
atomic.Xadd(&sched.npidle, 1) // TODO: fast atomic

One is at a function call boundary, either before or after calling nanotime1:

return nanotime1()

@prattmic prattmic changed the title runtime/pprof: TestCPUProfileMultithreadMagnitude/serial_execution_OS_vs_pprof failure with "CPU usage reports are too different" on linux-arm64-aws runtime/pprof: TestCPUProfileMultithreadMagnitude rusage too high on linux-arm64-aws Dec 16, 2021
@gopherbot
Copy link

@gopherbot gopherbot commented Dec 16, 2021

Change https://golang.org/cl/372800 mentions this issue: runtime/pprof: run TestCPUProfileMultithreadMagnitude subtests separately

gopherbot pushed a commit that referenced this issue Jan 10, 2022
…tely

Currently TestCPUProfileMultithreadMagnitude runs two CPU consumption
functions in a single profile and then analyzes the results as separate
subtests.

This works fine, but when debugging failures it makes manual analysis of
the profile dump a bit annoying.

Refactor the test to collect separate profiles for each subtest for
easier future analysis.

For #50097.
For #50232.

Change-Id: Ia1c8bb86aaaf652e64c5e660dcc2da47d2194c2b
Reviewed-on: https://go-review.googlesource.com/c/go/+/372800
Trust: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Rhys Hiltner <rhys@justin.tv>
Reviewed-by: Bryan Mills <bcmills@google.com>
@heschi
Copy link
Contributor

@heschi heschi commented Jan 12, 2022

Was the change above intended to fix the test, or just improve debuggability? This is marked as a release blocker.

@prattmic
Copy link
Member

@prattmic prattmic commented Jan 12, 2022

Just improve debuggability. Fixing this still needs more investigation, which the CL helps with.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Jan 12, 2022

Given that the rate of builder runs has slowed down for the code freeze, and given that these failures appear to be detecting a mostly-successful-but-incomplete fix, would it make sense to add a testenv.SkipFlaky for this test until the tree reopens for 1.19 and then pick it up from there?

@prattmic
Copy link
Member

@prattmic prattmic commented Jan 14, 2022

I ran this test for several hours on a linux-arm64-aws gomote and was unable to reproduce any failures.

I think that adding a temporary skip would be OK given that as @rhysh's mentions, this appears to be unexpected background CPU, rather than an issue with profiling itself. That said, I don't have any particular reason to believe this issue would be limited to arm64, so I'm not sure what constraints to put on the skip.

@rhysh right now the test compares time in runtime/pprof.cpuHog1 samples vs rusage. What would you think about comparing the total sample time (i.e., any samples) vs rusage. It seems that should still cover the core of this test, which is whether we receive enough samples to cover all CPU usage and I believe should solve these kinds of issues with background runtime CPU.

@toothrot
Copy link
Contributor

@toothrot toothrot commented Jan 19, 2022

Any updates on this, @prattmic? Checking in as it's a release-blocker for Go 1.18.

@gopherbot
Copy link

@gopherbot gopherbot commented Jan 19, 2022

Change https://golang.org/cl/379535 mentions this issue: runtime/pprof: compare all samples vs rusage in TestCPUProfileMultithreadMagnitude

@gopherbot gopherbot closed this in d1640d8 Jan 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants