Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/pprof: pprof provides wrong address/line for CPU profiles #41338

Open
mpx opened this issue Sep 11, 2020 · 2 comments
Open

cmd/pprof: pprof provides wrong address/line for CPU profiles #41338

mpx opened this issue Sep 11, 2020 · 2 comments
Milestone

Comments

@mpx
Copy link
Contributor

@mpx mpx commented Sep 11, 2020

Using CPU profiling and go tool pprof attributes CPU to the instruction following the one that consumed the time. This can cause CPU time to be attributed to an incorrect line number. For comparison, the Linux perf command works correctly. The example below shows the time is spent in the for loop, when it is actually spent during expensive memory accesses.

AFAICT this issue has existed for a long time (at least since 1.4). I can't see this issue has been logged yet which is surprising. It occurs with recent toolchains as well:

go version devel +d277a36123 Fri Sep 11 02:58:36 2020 +0000 linux/amd64
go version go1.15.2 linux/amd64

Tested on: Linux localhost.localdomain 5.8.4-200.fc32.x86_64 #1 SMP Wed Aug 26 22:28:08 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Here is a script which demonstrates the fault:

#!/bin/sh

cat > prof_test.go <<EOF
// Demonstrate that CPU profiles refer to the wrong instruction.
package prof

import "testing"

var buf [1024 * 1024 * 1024]byte

func Benchmark(b *testing.B) {
    for i := 0; i < b.N; i++ {
        // Expensive memory accesses.
        buf[(2097169*i)%len(buf)] = 42
    }
}
EOF

go mod init prof
go test -c -o prof.test
perf record -- ./prof.test -test.bench . -test.benchtime 3s -test.cpuprofile cpu.prof
echo
echo '=== pprof list shows the "for" loop is expensive, not the memory access. ==='
go tool pprof -list Benchmark prof.test cpu.prof
echo
echo '=== pprof disasm shows the loop "INC" instruction is most expensive (not the "MOV"). ==='
go tool pprof -disasm Benchmark prof.test cpu.prof
echo
echo '=== Linux perf correctly shows the "MOV" instruction is expensive. ==='
perf annotate --stdio -s prof.Benchmark | cat

And the output:

$ ./run.sh 
goos: linux
goarch: amd64
pkg: prof
Benchmark-8   	314821470	        11.2 ns/op
PASS
[ perf record: Woken up 4 times to write data ]
[ perf record: Captured and wrote 0.995 MB perf.data (25644 samples) ]

=== pprof list shows the "for" loop is expensive, not the memory access. ===
Total: 6.34s
ROUTINE ======================== prof.Benchmark in /home/mark/Prog/golang/src/junk/memperf/prof_test.go
     6.34s      6.34s (flat, cum)   100% of Total
         .          .      4:import "testing"
         .          .      5:
         .          .      6:var buf [1024 * 1024 * 1024]byte
         .          .      7:
         .          .      8:func Benchmark(b *testing.B) {
     5.52s      5.52s      9:    for i := 0; i < b.N; i++ {
         .          .     10:        // Expensive memory accesses.
     820ms      820ms     11:        buf[(2097169*i)%len(buf)] = 42
         .          .     12:    }
         .          .     13:}

=== pprof disasm shows the loop "INC" instruction is most expensive (not the "MOV"). ===
Total: 6.34s
ROUTINE ======================== prof.Benchmark
     6.34s      6.34s (flat, cum)   100% of Total
         .          .     509500: SUBQ $0x18, SP                          ;prof_test.go:8
         .          .     509504: MOVQ BP, 0x10(SP)
         .          .     509509: LEAQ 0x10(SP), BP
         .          .     50950e: MOVQ 0x20(SP), DX                       ;prof_test.go:9
         .          .     509513: XORL AX, AX
         .          .     509515: JMP 0x509525
      30ms       30ms     509517: LEAQ prof.buf(SB), BX                   ;prof.Benchmark prof_test.go:11
     540ms      540ms     50951e: MOVB $0x2a, 0(BX)(SI*1)
     5.49s      5.49s     509522: INCQ AX                                 ;prof.Benchmark prof_test.go:9
      30ms       30ms     509525: CMPQ AX, 0x170(DX)
         .          .     50952c: JLE 0x509559                            ;prof_test.go:9
      10ms       10ms     50952e: IMULQ $0x200011, AX, BX                 ;prof.Benchmark prof_test.go:11
      20ms       20ms     509535: MOVQ BX, SI
      80ms       80ms     509538: SARQ $0x3f, BX
         .          .     50953c: SHRQ $0x22, BX                          ;prof_test.go:11
         .          .     509540: ADDQ SI, BX
      30ms       30ms     509543: SARQ $0x1e, BX                          ;prof.Benchmark prof_test.go:11
      90ms       90ms     509547: SHLQ $0x1e, BX
         .          .     50954b: SUBQ BX, SI                             ;prof_test.go:11
      20ms       20ms     50954e: CMPQ $prof.buf+1067102432(SB), SI       ;prof.Benchmark prof_test.go:11
         .          .     509555: JB 0x509517                             ;prof_test.go:11
         .          .     509557: JMP 0x509563
         .          .     509559: MOVQ 0x10(SP), BP                       ;prof_test.go:9
         .          .     50955e: ADDQ $0x18, SP
         .          .     509562: RET
         .          .     509563: MOVQ SI, AX                             ;prof_test.go:11
         .          .     509566: MOVL $prof.buf+1067102432(SB), CX
         .          .     50956b: CALL runtime.panicIndex(SB)

=== Linux perf correctly shows the "MOV" instruction is expensive. ===
 Percent |	Source code & Disassembly of prof.test for cycles:u (24257 samples, percent: local period)
----------------------------------------------------------------------------------------------------------
         :
         :
         :
         :            Disassembly of section .text:
         :
         :            0000000000509500 <prof.Benchmark>:
         :            prof.Benchmark():
         :
         :            import "testing"
         :
         :            var buf [1024 * 1024 * 1024]byte
         :
         :            func Benchmark(b *testing.B) {
    0.00 :   509500: sub    $0x18,%rsp
    0.00 :   509504: mov    %rbp,0x10(%rsp)
    0.00 :   509509: lea    0x10(%rsp),%rbp
         :            for i := 0; i < b.N; i++ {
    0.00 :   50950e: mov    0x20(%rsp),%rdx
    0.00 :   509513: xor    %eax,%eax
    0.00 :   509515: jmp    509525 <prof.Benchmark+0x25>
         :            // Expensive memory accesses.
         :            buf[(2097169*i)%len(buf)] = 42
    1.82 :   509517: lea    0x14ba02(%rip),%rbx        # 654f20 <prof.buf>
   92.32 :   50951e: movb   $0x2a,(%rbx,%rsi,1)
         :            for i := 0; i < b.N; i++ {
    0.08 :   509522: inc    %rax
    0.25 :   509525: cmp    %rax,0x170(%rdx)
    0.00 :   50952c: jle    509559 <prof.Benchmark+0x59>
         :            buf[(2097169*i)%len(buf)] = 42
    0.53 :   50952e: imul   $0x200011,%rax,%rbx
    1.74 :   509535: mov    %rbx,%rsi
    0.02 :   509538: sar    $0x3f,%rbx
    0.11 :   50953c: shr    $0x22,%rbx
    0.46 :   509540: add    %rsi,%rbx
    1.93 :   509543: sar    $0x1e,%rbx
    0.03 :   509547: shl    $0x1e,%rbx
    0.12 :   50954b: sub    %rbx,%rsi
    0.00 :   50954e: cmp    $0x40000000,%rsi
    0.61 :   509555: jb     509517 <prof.Benchmark+0x17>
    0.00 :   509557: jmp    509563 <prof.Benchmark+0x63>
         :            for i := 0; i < b.N; i++ {
    0.00 :   509559: mov    0x10(%rsp),%rbp
    0.00 :   50955e: add    $0x18,%rsp
    0.00 :   509562: retq
         :            buf[(2097169*i)%len(buf)] = 42
    0.00 :   509563: mov    %rsi,%rax
    0.00 :   509566: mov    $0x40000000,%ecx
    0.00 :   50956b: callq  46aae0 <runtime.panicIndex>
         :            ./<autogenerated>:1
    0.00 :   509570: nop

Cc @hyangah @randall77

@prattmic
Copy link
Member

@prattmic prattmic commented Sep 11, 2020

This sounds like it is likely due to the proposal in #36821. Go's pprof profiling uses ITIMER_PROF, which can have skid of several instructions from the actual expensive instruction.

Linux's perf, on the other hand, can use Intel PEBS (/AMD equivalent) for precise sampling.

I don't think perf enables PEBS by default (-e cycles:pp is required), but even the standard hardware counter likely has less skid than ITIMER_PROF.

@odeke-em odeke-em changed the title cmd/pprof: Pprof provides wrong address/line for CPU profiles cmd/pprof: pprof provides wrong address/line for CPU profiles Sep 11, 2020
@odeke-em
Copy link
Member

@odeke-em odeke-em commented Sep 11, 2020

/cc @rhysh too.

@toothrot toothrot added this to the Backlog milestone Sep 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.