-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Open
Labels
DocumentationIssues describing a change to documentation.Issues describing a change to documentation.NeedsFixThe path to resolution is known, but the work has not been done.The path to resolution is known, but the work has not been done.
Milestone
Description
Using CPU profiling and go tool pprof attributes CPU to the instruction following the one that consumed the time. This can cause CPU time to be attributed to an incorrect line number. For comparison, the Linux perf command works correctly. The example below shows the time is spent in the for loop, when it is actually spent during expensive memory accesses.
AFAICT this issue has existed for a long time (at least since 1.4). I can't see this issue has been logged yet which is surprising. It occurs with recent toolchains as well:
go version devel +d277a36123 Fri Sep 11 02:58:36 2020 +0000 linux/amd64
go version go1.15.2 linux/amd64
Tested on: Linux localhost.localdomain 5.8.4-200.fc32.x86_64 #1 SMP Wed Aug 26 22:28:08 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Here is a script which demonstrates the fault:
#!/bin/sh
cat > prof_test.go <<EOF
// Demonstrate that CPU profiles refer to the wrong instruction.
package prof
import "testing"
var buf [1024 * 1024 * 1024]byte
func Benchmark(b *testing.B) {
for i := 0; i < b.N; i++ {
// Expensive memory accesses.
buf[(2097169*i)%len(buf)] = 42
}
}
EOF
go mod init prof
go test -c -o prof.test
perf record -- ./prof.test -test.bench . -test.benchtime 3s -test.cpuprofile cpu.prof
echo
echo '=== pprof list shows the "for" loop is expensive, not the memory access. ==='
go tool pprof -list Benchmark prof.test cpu.prof
echo
echo '=== pprof disasm shows the loop "INC" instruction is most expensive (not the "MOV"). ==='
go tool pprof -disasm Benchmark prof.test cpu.prof
echo
echo '=== Linux perf correctly shows the "MOV" instruction is expensive. ==='
perf annotate --stdio -s prof.Benchmark | catAnd the output:
$ ./run.sh
goos: linux
goarch: amd64
pkg: prof
Benchmark-8 314821470 11.2 ns/op
PASS
[ perf record: Woken up 4 times to write data ]
[ perf record: Captured and wrote 0.995 MB perf.data (25644 samples) ]
=== pprof list shows the "for" loop is expensive, not the memory access. ===
Total: 6.34s
ROUTINE ======================== prof.Benchmark in /home/mark/Prog/golang/src/junk/memperf/prof_test.go
6.34s 6.34s (flat, cum) 100% of Total
. . 4:import "testing"
. . 5:
. . 6:var buf [1024 * 1024 * 1024]byte
. . 7:
. . 8:func Benchmark(b *testing.B) {
5.52s 5.52s 9: for i := 0; i < b.N; i++ {
. . 10: // Expensive memory accesses.
820ms 820ms 11: buf[(2097169*i)%len(buf)] = 42
. . 12: }
. . 13:}
=== pprof disasm shows the loop "INC" instruction is most expensive (not the "MOV"). ===
Total: 6.34s
ROUTINE ======================== prof.Benchmark
6.34s 6.34s (flat, cum) 100% of Total
. . 509500: SUBQ $0x18, SP ;prof_test.go:8
. . 509504: MOVQ BP, 0x10(SP)
. . 509509: LEAQ 0x10(SP), BP
. . 50950e: MOVQ 0x20(SP), DX ;prof_test.go:9
. . 509513: XORL AX, AX
. . 509515: JMP 0x509525
30ms 30ms 509517: LEAQ prof.buf(SB), BX ;prof.Benchmark prof_test.go:11
540ms 540ms 50951e: MOVB $0x2a, 0(BX)(SI*1)
5.49s 5.49s 509522: INCQ AX ;prof.Benchmark prof_test.go:9
30ms 30ms 509525: CMPQ AX, 0x170(DX)
. . 50952c: JLE 0x509559 ;prof_test.go:9
10ms 10ms 50952e: IMULQ $0x200011, AX, BX ;prof.Benchmark prof_test.go:11
20ms 20ms 509535: MOVQ BX, SI
80ms 80ms 509538: SARQ $0x3f, BX
. . 50953c: SHRQ $0x22, BX ;prof_test.go:11
. . 509540: ADDQ SI, BX
30ms 30ms 509543: SARQ $0x1e, BX ;prof.Benchmark prof_test.go:11
90ms 90ms 509547: SHLQ $0x1e, BX
. . 50954b: SUBQ BX, SI ;prof_test.go:11
20ms 20ms 50954e: CMPQ $prof.buf+1067102432(SB), SI ;prof.Benchmark prof_test.go:11
. . 509555: JB 0x509517 ;prof_test.go:11
. . 509557: JMP 0x509563
. . 509559: MOVQ 0x10(SP), BP ;prof_test.go:9
. . 50955e: ADDQ $0x18, SP
. . 509562: RET
. . 509563: MOVQ SI, AX ;prof_test.go:11
. . 509566: MOVL $prof.buf+1067102432(SB), CX
. . 50956b: CALL runtime.panicIndex(SB)
=== Linux perf correctly shows the "MOV" instruction is expensive. ===
Percent | Source code & Disassembly of prof.test for cycles:u (24257 samples, percent: local period)
----------------------------------------------------------------------------------------------------------
:
:
:
: Disassembly of section .text:
:
: 0000000000509500 <prof.Benchmark>:
: prof.Benchmark():
:
: import "testing"
:
: var buf [1024 * 1024 * 1024]byte
:
: func Benchmark(b *testing.B) {
0.00 : 509500: sub $0x18,%rsp
0.00 : 509504: mov %rbp,0x10(%rsp)
0.00 : 509509: lea 0x10(%rsp),%rbp
: for i := 0; i < b.N; i++ {
0.00 : 50950e: mov 0x20(%rsp),%rdx
0.00 : 509513: xor %eax,%eax
0.00 : 509515: jmp 509525 <prof.Benchmark+0x25>
: // Expensive memory accesses.
: buf[(2097169*i)%len(buf)] = 42
1.82 : 509517: lea 0x14ba02(%rip),%rbx # 654f20 <prof.buf>
92.32 : 50951e: movb $0x2a,(%rbx,%rsi,1)
: for i := 0; i < b.N; i++ {
0.08 : 509522: inc %rax
0.25 : 509525: cmp %rax,0x170(%rdx)
0.00 : 50952c: jle 509559 <prof.Benchmark+0x59>
: buf[(2097169*i)%len(buf)] = 42
0.53 : 50952e: imul $0x200011,%rax,%rbx
1.74 : 509535: mov %rbx,%rsi
0.02 : 509538: sar $0x3f,%rbx
0.11 : 50953c: shr $0x22,%rbx
0.46 : 509540: add %rsi,%rbx
1.93 : 509543: sar $0x1e,%rbx
0.03 : 509547: shl $0x1e,%rbx
0.12 : 50954b: sub %rbx,%rsi
0.00 : 50954e: cmp $0x40000000,%rsi
0.61 : 509555: jb 509517 <prof.Benchmark+0x17>
0.00 : 509557: jmp 509563 <prof.Benchmark+0x63>
: for i := 0; i < b.N; i++ {
0.00 : 509559: mov 0x10(%rsp),%rbp
0.00 : 50955e: add $0x18,%rsp
0.00 : 509562: retq
: buf[(2097169*i)%len(buf)] = 42
0.00 : 509563: mov %rsi,%rax
0.00 : 509566: mov $0x40000000,%ecx
0.00 : 50956b: callq 46aae0 <runtime.panicIndex>
: ./<autogenerated>:1
0.00 : 509570: nop
prashantv and akatsuki105
Metadata
Metadata
Assignees
Labels
DocumentationIssues describing a change to documentation.Issues describing a change to documentation.NeedsFixThe path to resolution is known, but the work has not been done.The path to resolution is known, but the work has not been done.