Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: framepointer-based stack unwinding omits memmove's callee #58835

Open
rhysh opened this issue Mar 2, 2023 · 10 comments
Open

runtime: framepointer-based stack unwinding omits memmove's callee #58835

rhysh opened this issue Mar 2, 2023 · 10 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@rhysh
Copy link
Contributor

rhysh commented Mar 2, 2023

It looks like the immediate caller of runtime.memmove is reported correctly in runtime/pprof profiles, but is missing in perf's framepointer-based unwinding. I don't know whether this affects callers of other assembly functions, or if it's specific to memmove. CC @golang/runtime , and probably also of interest to @felixge

What version of Go are you using (go version)?

$ go1.20 version
go version go1.20.1 darwin/arm64

Does this issue reproduce with the latest release?

Yes, go1.20.1 is the latest stable release. I first identified this behavior in production use with go1.19.6.

What operating system and processor architecture are you using (go env)?

For the reproducer, I'm cross-compiling from darwin/arm64 to linux/amd64. The production behavior I'm reproducing was with linux/amd64 builds for linux/amd64.

go env Output
$ go1.20 env
GO111MODULE=""
GOARCH="arm64"
GOBIN=""
GOCACHE="/Users/rhys/Library/Caches/go-build"
GOENV="/Users/rhys/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/rhys/go/pkg/mod"
GONOPROXY="*"
GONOSUMDB="*"
GOOS="darwin"
GOPATH="/Users/rhys/go"
GOPRIVATE="*"
GOPROXY="direct"
GOROOT="/Users/rhys/go/version/go1.20"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/rhys/go/version/go1.20/pkg/tool/darwin_arm64"
GOVCS=""
GOVERSION="go1.20.1"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/lf/n4ry0qv5639_3z0nhyhy3z7c0000gs/T/go-build3223373974=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I used Linux's perf to report collect data on an app's performance using the -g flag to collect stack traces using the frame pointer.

$ cat /tmp/frames_test.go                     
package main

import "testing"

func BenchmarkMemmove(b *testing.B) {
        src := make([]byte, 1<<15)
        dst := make([]byte, len(src))
        for i := 0; i < b.N; i++ {
                root(dst, src)
        }
}

//go:noinline
func root(dst, src []byte) {
        leaf(dst, src)
}

//go:noinline
func leaf(dst, src []byte) {
        copy(dst, src)
}
$ GOOS=linux GOARCH=amd64 go1.20 test -c -o /tmp/memmove-fp /tmp/frames_test.go
# [scp to linux/amd64 machine, run the following there]
$ sudo perf record -o /tmp/mm.perf.data -g -F 99 -- ./memmove-fp -test.bench=. -test.cpuprofile=/tmp/mm.pb.gz
goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
BenchmarkMemmove-4       1511648               795.6 ns/op
PASS
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.028 MB /tmp/mm.perf.data (197 samples) ]

What did you expect to see?

I expected go tool pprof and perf record -g+perf script to agree on the call stacks, including the identity of the immediate caller of runtime.memmove. I expected the function named command-line-arguments.leaf to show up in the call stack, with an immediate caller of command-line-arguments.root and an immediate callee of runtime.memmove.

What did you see instead?

With go tool pprof, I see reports of CPU time spent in runtime.memmove, called by command-line-arguments.leaf, called by command-line-arguments.root, as I expect.

With perf record -g and perf script, I see the on-CPU time attributed to runtime.memmove as called by command-line-arguments.root. The ...leaf frame is missing, which I consider to be a bug.

$ perf script -i /tmp/mm.perf.data | head -n 26
memmove-fp  1265 1124104.950646:   10101010 cpu-clock:pppH: 
                  468641 runtime.memmove+0x421 (/home/ubuntu/memmove-fp)
                  4f6d85 command-line-arguments.root+0x25 (/home/ubuntu/memmove-fp)
                  4f6d05 command-line-arguments.BenchmarkMemmove+0x85 (/home/ubuntu/memmove-fp)
                  4ada02 testing.(*B).runN+0x102 (/home/ubuntu/memmove-fp)
                  4ae685 testing.(*B).launch+0x1c5 (/home/ubuntu/memmove-fp)
                  4ae486 testing.(*B).doBench.func1+0x26 (/home/ubuntu/memmove-fp)
                  467381 runtime.goexit.abi0+0x1 (/home/ubuntu/memmove-fp)

memmove-fp  1265 1124104.960746:   10101010 cpu-clock:pppH: 
                  468641 runtime.memmove+0x421 (/home/ubuntu/memmove-fp)
                  4f6d85 command-line-arguments.root+0x25 (/home/ubuntu/memmove-fp)
                  4f6d05 command-line-arguments.BenchmarkMemmove+0x85 (/home/ubuntu/memmove-fp)
                  4ada02 testing.(*B).runN+0x102 (/home/ubuntu/memmove-fp)
                  4ae685 testing.(*B).launch+0x1c5 (/home/ubuntu/memmove-fp)
                  4ae486 testing.(*B).doBench.func1+0x26 (/home/ubuntu/memmove-fp)
                  467381 runtime.goexit.abi0+0x1 (/home/ubuntu/memmove-fp)

memmove-fp  1265 1124104.970847:   10101010 cpu-clock:pppH: 
                  46864b runtime.memmove+0x42b (/home/ubuntu/memmove-fp)
                  4f6d85 command-line-arguments.root+0x25 (/home/ubuntu/memmove-fp)
                  4f6d05 command-line-arguments.BenchmarkMemmove+0x85 (/home/ubuntu/memmove-fp)
                  4ada02 testing.(*B).runN+0x102 (/home/ubuntu/memmove-fp)
                  4ae685 testing.(*B).launch+0x1c5 (/home/ubuntu/memmove-fp)
                  4ae486 testing.(*B).doBench.func1+0x26 (/home/ubuntu/memmove-fp)
                  467381 runtime.goexit.abi0+0x1 (/home/ubuntu/memmove-fp)
$ go1.20 tool pprof -peek=runtime.memmove /tmp/mmpbgz           
File: memmove-fp
Type: cpu
Time: Mar 2, 2023 at 11:40am (PST)
Duration: 2.12s, Total samples = 2010ms (94.82%)
Showing nodes accounting for 2010ms, 100% of 2010ms total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context              
----------------------------------------------------------+-------------
                                            1960ms   100% |   command-line-arguments.leaf
    1960ms 97.51% 97.51%     1960ms 97.51%                | runtime.memmove
----------------------------------------------------------+-------------

$ go1.20 tool pprof -raw /tmp/mmpbgz   
PeriodType: cpu nanoseconds
Period: 10000000
Time: 2023-03-02 11:40:52.391789789 -0800 PST
Duration: 2.11
Samples:
samples/count cpu/nanoseconds
         47  470000000: 1 2 3 4 5 6 
        100 1000000000: 7 2 3 4 5 6 
         24  240000000: 8 2 3 4 5 6 
          3   30000000: 9 5 6 
          6   60000000: 10 2 3 4 5 6 
         12  120000000: 11 2 3 4 5 6 
          3   30000000: 12 2 3 4 5 6 
          1   10000000: 13 2 3 4 5 6 
          2   20000000: 14 5 6 
          3   30000000: 15 2 3 4 5 6 
Locations
     1: 0x468641 M=1 runtime.memmove /usr/local/go/src/runtime/memmove_amd64.s:369 s=35
     2: 0x4f6e14 M=1 command-line-arguments.leaf /tmp/frames_test.go:20 s=19
     3: 0x4f6d84 M=1 command-line-arguments.root /tmp/frames_test.go:15 s=14
     4: 0x4f6d04 M=1 command-line-arguments.BenchmarkMemmove /tmp/frames_test.go:9 s=5
     5: 0x4ada01 M=1 testing.(*B).runN /usr/local/go/src/testing/benchmark.go:193 s=181
     6: 0x4ae684 M=1 testing.(*B).launch /usr/local/go/src/testing/benchmark.go:334 s=293
     7: 0x46864b M=1 runtime.memmove /usr/local/go/src/runtime/memmove_amd64.s:371 s=35
     8: 0x468661 M=1 runtime.memmove /usr/local/go/src/runtime/memmove_amd64.s:376 s=35
     9: 0x4f6ce8 M=1 command-line-arguments.BenchmarkMemmove /tmp/frames_test.go:8 s=5
    10: 0x468657 M=1 runtime.memmove /usr/local/go/src/runtime/memmove_amd64.s:374 s=35
    11: 0x46865c M=1 runtime.memmove /usr/local/go/src/runtime/memmove_amd64.s:375 s=35
    12: 0x468638 M=1 runtime.memmove /usr/local/go/src/runtime/memmove_amd64.s:367 s=35
    13: 0x4685cc M=1 runtime.memmove /usr/local/go/src/runtime/memmove_amd64.s:302 s=35
    14: 0x4f6d1d M=1 command-line-arguments.BenchmarkMemmove /tmp/frames_test.go:8 s=5
    15: 0x468652 M=1 runtime.memmove /usr/local/go/src/runtime/memmove_amd64.s:373 s=35
Mappings
1: 0x400000/0x4f8000/0x0 /home/ubuntu/memmove-fp  [FN]
2: 0x7fff85d4c000/0x7fff85d4e000/0x0 [vdso]  
3: 0xffffffffff600000/0xffffffffff601000/0x0 [vsyscall]
$ go1.20 tool objdump -s '\.root$' /tmp/memmove-fp 
TEXT command-line-arguments.root(SB) /tmp/frames_test.go
  frames_test.go:14     0x4f6d60                493b6610                CMPQ 0x10(R14), SP                      
  frames_test.go:14     0x4f6d64                7629                    JBE 0x4f6d8f                            
  frames_test.go:14     0x4f6d66                4883ec38                SUBQ $0x38, SP                          
  frames_test.go:14     0x4f6d6a                48896c2430              MOVQ BP, 0x30(SP)                       
  frames_test.go:14     0x4f6d6f                488d6c2430              LEAQ 0x30(SP), BP                       
  frames_test.go:14     0x4f6d74                4889442440              MOVQ AX, 0x40(SP)                       
  frames_test.go:14     0x4f6d79                48897c2458              MOVQ DI, 0x58(SP)                       
  frames_test.go:15     0x4f6d7e                6690                    NOPW                                    
  frames_test.go:15     0x4f6d80                e85b000000              CALL command-line-arguments.leaf(SB)    
  frames_test.go:16     0x4f6d85                488b6c2430              MOVQ 0x30(SP), BP                       
  frames_test.go:16     0x4f6d8a                4883c438                ADDQ $0x38, SP                          
  frames_test.go:16     0x4f6d8e                c3                      RET                                     
  frames_test.go:14     0x4f6d8f                4889442408              MOVQ AX, 0x8(SP)                        
  frames_test.go:14     0x4f6d94                48895c2410              MOVQ BX, 0x10(SP)                       
  frames_test.go:14     0x4f6d99                48894c2418              MOVQ CX, 0x18(SP)                       
  frames_test.go:14     0x4f6d9e                48897c2420              MOVQ DI, 0x20(SP)                       
  frames_test.go:14     0x4f6da3                4889742428              MOVQ SI, 0x28(SP)                       
  frames_test.go:14     0x4f6da8                4c89442430              MOVQ R8, 0x30(SP)                       
  frames_test.go:14     0x4f6dad                e84ee6f6ff              CALL runtime.morestack_noctxt.abi0(SB)  
  frames_test.go:14     0x4f6db2                488b442408              MOVQ 0x8(SP), AX                        
  frames_test.go:14     0x4f6db7                488b5c2410              MOVQ 0x10(SP), BX                       
  frames_test.go:14     0x4f6dbc                488b4c2418              MOVQ 0x18(SP), CX                       
  frames_test.go:14     0x4f6dc1                488b7c2420              MOVQ 0x20(SP), DI                       
  frames_test.go:14     0x4f6dc6                488b742428              MOVQ 0x28(SP), SI                       
  frames_test.go:14     0x4f6dcb                4c8b442430              MOVQ 0x30(SP), R8                       
  frames_test.go:14     0x4f6dd0                eb8e                    JMP command-line-arguments.root(SB)

$ go1.20 tool objdump -s '\.leaf$' /tmp/memmove-fp 
TEXT command-line-arguments.leaf(SB) /tmp/frames_test.go
  frames_test.go:19     0x4f6de0                493b6610                CMPQ 0x10(R14), SP                      
  frames_test.go:19     0x4f6de4                7639                    JBE 0x4f6e1f                            
  frames_test.go:19     0x4f6de6                4883ec20                SUBQ $0x20, SP                          
  frames_test.go:19     0x4f6dea                48896c2418              MOVQ BP, 0x18(SP)                       
  frames_test.go:19     0x4f6def                488d6c2418              LEAQ 0x18(SP), BP                       
  frames_test.go:19     0x4f6df4                4889442428              MOVQ AX, 0x28(SP)                       
  frames_test.go:19     0x4f6df9                48897c2440              MOVQ DI, 0x40(SP)                       
  frames_test.go:20     0x4f6dfe                4839de                  CMPQ BX, SI                             
  frames_test.go:20     0x4f6e01                480f4cde                CMOVL SI, BX                            
  frames_test.go:20     0x4f6e05                4839f8                  CMPQ DI, AX                             
  frames_test.go:20     0x4f6e08                740b                    JE 0x4f6e15                             
  frames_test.go:20     0x4f6e0a                4889d9                  MOVQ BX, CX                             
  frames_test.go:20     0x4f6e0d                4889fb                  MOVQ DI, BX                             
  frames_test.go:20     0x4f6e10                e80b14f7ff              CALL runtime.memmove(SB)                
  frames_test.go:21     0x4f6e15                488b6c2418              MOVQ 0x18(SP), BP                       
  frames_test.go:21     0x4f6e1a                4883c420                ADDQ $0x20, SP                          
  frames_test.go:21     0x4f6e1e                c3                      RET                                     
  frames_test.go:19     0x4f6e1f                4889442408              MOVQ AX, 0x8(SP)                        
  frames_test.go:19     0x4f6e24                48895c2410              MOVQ BX, 0x10(SP)                       
  frames_test.go:19     0x4f6e29                48894c2418              MOVQ CX, 0x18(SP)                       
  frames_test.go:19     0x4f6e2e                48897c2420              MOVQ DI, 0x20(SP)                       
  frames_test.go:19     0x4f6e33                4889742428              MOVQ SI, 0x28(SP)                       
  frames_test.go:19     0x4f6e38                4c89442430              MOVQ R8, 0x30(SP)                       
  frames_test.go:19     0x4f6e3d                0f1f00                  NOPL 0(AX)                              
  frames_test.go:19     0x4f6e40                e8bbe5f6ff              CALL runtime.morestack_noctxt.abi0(SB)  
  frames_test.go:19     0x4f6e45                488b442408              MOVQ 0x8(SP), AX                        
  frames_test.go:19     0x4f6e4a                488b5c2410              MOVQ 0x10(SP), BX                       
  frames_test.go:19     0x4f6e4f                488b4c2418              MOVQ 0x18(SP), CX                       
  frames_test.go:19     0x4f6e54                488b7c2420              MOVQ 0x20(SP), DI                       
  frames_test.go:19     0x4f6e59                488b742428              MOVQ 0x28(SP), SI                       
  frames_test.go:19     0x4f6e5e                4c8b442430              MOVQ 0x30(SP), R8                       
  frames_test.go:19     0x4f6e63                e978ffffff              JMP command-line-arguments.leaf(SB)

$ go1.20 tool objdump -s '^runtime\.memmove$' /tmp/memmove-fp | head -n 30
TEXT runtime.memmove(SB) /usr/local/go/src/runtime/memmove_amd64.s
  memmove_amd64.s:39    0x468220                4889c7                  MOVQ AX, DI                             
  memmove_amd64.s:40    0x468223                4889de                  MOVQ BX, SI                             
  memmove_amd64.s:41    0x468226                4889cb                  MOVQ CX, BX                             
  memmove_amd64.s:54    0x468229                4885db                  TESTQ BX, BX                            
  memmove_amd64.s:55    0x46822c                0f84f7000000            JE 0x468329                             
  memmove_amd64.s:56    0x468232                4883fb02                CMPQ $0x2, BX                           
  memmove_amd64.s:57    0x468236                0f86e0000000            JBE 0x46831c                            
  memmove_amd64.s:58    0x46823c                4883fb04                CMPQ $0x4, BX                           
  memmove_amd64.s:59    0x468240                0f82e9000000            JB 0x46832f                             
  memmove_amd64.s:60    0x468246                0f86de000000            JBE 0x46832a                            
  memmove_amd64.s:61    0x46824c                4883fb08                CMPQ $0x8, BX                           
  memmove_amd64.s:62    0x468250                0f82e6000000            JB 0x46833c                             
  memmove_amd64.s:63    0x468256                0f84ed000000            JE 0x468349                             
  memmove_amd64.s:64    0x46825c                4883fb10                CMPQ $0x10, BX                          
  memmove_amd64.s:65    0x468260                0f86ea000000            JBE 0x468350                            
  memmove_amd64.s:66    0x468266                4883fb20                CMPQ $0x20, BX                          
  memmove_amd64.s:67    0x46826a                0f86f1000000            JBE 0x468361                            
  memmove_amd64.s:68    0x468270                4883fb40                CMPQ $0x40, BX                          
  memmove_amd64.s:69    0x468274                0f86fc000000            JBE 0x468376                            
  memmove_amd64.s:70    0x46827a                4881fb80000000          CMPQ $0x80, BX                          
  memmove_amd64.s:71    0x468281                0f861a010000            JBE 0x4683a1                            
  memmove_amd64.s:72    0x468287                4881fb00010000          CMPQ $0x100, BX                         
  memmove_amd64.s:73    0x46828e                0f8664010000            JBE 0x4683f8                            
  memmove_amd64.s:75    0x468294                f60558331d0001          TESTB $0x1, runtime.useAVXmemmove(SB)   
  memmove_amd64.s:76    0x46829b                0f8525030000            JNE 0x4685c6                            
  memmove_amd64.s:81    0x4682a1                4839fe                  CMPQ DI, SI                             
  memmove_amd64.s:82    0x4682a4                763a                    JBE 0x4682e0                            
  memmove_amd64.s:88    0x4682a6                4881fb00080000          CMPQ $0x800, BX                         
  memmove_amd64.s:89    0x4682ad                0f8609020000            JBE 0x4684bc
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Mar 2, 2023
@prattmic
Copy link
Member

prattmic commented Mar 2, 2023

Is this fixed by https://go.dev/cl/466316?

@cherrymui
Copy link
Member

I don't think so. runtime.memmove is a frameless leaf function and so doesn't save the frame pointer. In general frameless leaf functions tend to not save the frame pointer for performance reasons. I think C compiler tend to do that as well. I'm not sure what we do here.

Do you know if C memmove saves frame pointer (I guess it depends on implementation) and whether perf can handle that correctly?

@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 2, 2023
@dmitshur dmitshur added this to the Backlog milestone Mar 2, 2023
@rhysh
Copy link
Contributor Author

rhysh commented Mar 3, 2023

It's not fixed by CL 466316 (details below).

In general frameless leaf functions tend to not save the frame pointer for performance reasons.

For what it's worth, I came across this while asking perf about last-level cache misses (perf record -g --freq 99 --event ...,LLC-load-misses,...). More than 20% of them came from memmove with a particular (reported) caller, and this shortcoming made it a bit trickier to interpret the data. The omitted pre-leaf function in that case has four separate places that it calls memmove, so at this point the best I have is a "pretty good guess" about which of them is responsible for the 20%+ LLC-misses.

This would affect a question @aclements had elsewhere recently about whether, when using perf_event_open in runtime/pprof for some future release, we'd be able to use the stacks that we get from perf, or whether we'd need to ask the OS to send us a (synchronous?) signal so we could do our own traceback. If some frames (callers of frameless leaf functions) are missing from perf's view, we might not be able to use it as the only data source.

I don't know what performance is saved by not saving the frame pointer, but I want to make the point that the costs of various forms of work that take place in those frameless leaf functions can add up.

I don't know my way around stack management, so maybe this is a nonsense suggestion: is there room for a middle-ground where the caller carves out an extra word of stack space (as it would for an additional uintptr argument) that the leaf can use to store the frame pointer without needing to have its own frame?

Thanks!

$ go version /tmp/memmove-fp-tip                                                                                                                 
/tmp/memmove-fp-tip: devel go1.21-0487ba08a6 Thu Mar 2 23:34:34 2023 +0000

$ perf script -i /tmp/mm-tip.perf.data | head -n 8'
memmove-fp-tip  2730 1139297.063377:   10101010 cpu-clock:pppH: 
                  466f0b runtime.memmove+0x42b (/home/ubuntu/memmove-fp-tip)
                  4f3e1d command-line-arguments.root+0x1d (/home/ubuntu/memmove-fp-tip)
                  4f3db9 command-line-arguments.BenchmarkMemmove+0x79 (/home/ubuntu/memmove-fp-tip)
                  4abf42 testing.(*B).runN+0x102 (/home/ubuntu/memmove-fp-tip)
                  4acb5c testing.(*B).launch+0x1bc (/home/ubuntu/memmove-fp-tip)
                  4ac965 testing.(*B).doBench.func1+0x25 (/home/ubuntu/memmove-fp-tip)
                  465d61 runtime.goexit.abi0+0x1 (/home/ubuntu/memmove-fp-tip)

$ go tool pprof -peek=runtime.memmove /tmp/mmtippbgz                     
File: memmove-fp-tip
Type: cpu
Time: Mar 2, 2023 at 3:54pm (PST)
Duration: 2.10s, Total samples = 1990ms (94.57%)
Showing nodes accounting for 1990ms, 100% of 1990ms total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context              
----------------------------------------------------------+-------------
                                            1960ms   100% |   command-line-arguments.leaf
    1960ms 98.49% 98.49%     1960ms 98.49%                | runtime.memmove
----------------------------------------------------------+-------------

@felixge
Copy link
Contributor

felixge commented Mar 3, 2023

In general frameless leaf functions tend to not save the frame pointer for performance reasons. I think C compiler tend to do that as well. I'm not sure what we do here.

Maybe the heuristics for this could be extended to take the "complexity" of the leaf function into account, similar to how it is done for inlining? Not sure how feasible this is, but looking at memmove's implementation, I suspect it wouldn't cause a noticeable perf hit if it pushed a frame pointer. But I haven't measured that 😅.

I don't know my way around stack management, so maybe this is a nonsense suggestion: is there room for a middle-ground where the caller carves out an extra word of stack space (as it would for an additional uintptr argument) that the leaf can use to store the frame pointer without needing to have its own frame?

Not sure if this could work. I think it would cause the positions of the return addr and the frame pointer to be reversed on the stack. Frame pointer unwinders rely on the relative position of these two values on the stack.

@rhysh quick question: Have you tried using the DWARF unwinder in perf for this?

@cherrymui
Copy link
Member

Maybe the heuristics for this could be extended to take the "complexity" of the leaf function into account

Thanks for the suggestion, but I think we'd want to keep the assembler simple and avoid surprises (e.g. it doesn't save frame pointer for a 50-instruction function, but suddenly does for a 51-instruction function?) If you think it is helpful, it wouldn't be hard to manually save the frame pointer, or declare a nonzero frame size (if this comes often, we can think about adding a new text flag, e.g. HAVEFRAME, to force saving frame pointer).

@qmuntal
Copy link
Contributor

qmuntal commented Mar 3, 2023

Agree with @cherrymui, frameless leaf functions should remain frameless, for performance and code size. AFAIK, perf supports DWARF unwinding, so it should be possible to emit CFI information that models frameless leaf functions, if we are not doing that yet.

@rhysh
Copy link
Contributor Author

rhysh commented Mar 4, 2023

If you think it is helpful, it wouldn't be hard to manually save the frame pointer, or declare a nonzero frame size

Indeed @cherrymui , changing the end of the TEXT directive from $0-24 to $8-24 causes the command-line-arguments.leaf frame to show up as the caller of runtime.memmove. Thanks!

What sort of data would motivate accepting this change to runtime.memmove, or to other functions?


Have you tried using the DWARF unwinder in perf for this?

I haven't used that since GOEXPERIMENT=framepointer became available, so I'm not sure what behavior to expect from it @felixge . But the results I get from adding --call-stack=dwarf don't look right at all. I think that the executable includes all the usual DWARF info: on one hand it's structured a test, but I asked to compile it explicitly, and the output of file ./memmove-fp is ./memmove-fp: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, with debug_info, not stripped, and addr2line --exe=./memmove-fp 468641 tells me /usr/local/go/src/runtime/memmove_amd64.s:369.

Here's part of what I get from perf script on the output of sudo perf record -o /tmp/mm.perf.data -g --call-graph=dwarf -F 9999 -- ./memmove-fp -test.bench=. -test.cpuprofile=/tmp/mm.pb.gz, compiled with go1.20.1.

memmove-fp 16068 1233896.649417:     100010 cpu-clock:pppH: 
        ffffffff99880353 [unknown] ([kernel.kallsyms])
        ffffffff9988062c [unknown] ([kernel.kallsyms])
        ffffffff99876d94 [unknown] ([kernel.kallsyms])
        ffffffff9a401304 [unknown] ([kernel.kallsyms])
                  42f6ee runtime.(*spanSet).push+0x16e (/home/ubuntu/memmove-fp)

[snip]

memmove-fp 16068 1233896.649517:     100010 cpu-clock:pppH: 
                  42f704 runtime/internal/atomic.(*UnsafePointer).StoreNoWB+0x184 (inlined)
                  42f704 runtime.(*atomicMSpanPointer).StoreNoWB+0x184 (inlined)
                  42f704 runtime.(*spanSet).push+0x184 (/home/ubuntu/memmove-fp)

[snip, fast-forward to a rare sighting of something other than "memmove"]

memmove-fp 16068 1233897.125657:     100010 cpu-clock:pppH: 
                  468661 runtime.memmove+0x441 (/home/ubuntu/memmove-fp)

memmove-fp 16068 1233897.125757:     100010 cpu-clock:pppH: 
                  4f6ce8 command-line-arguments.BenchmarkMemmove+0x68 (/home/ubuntu/memmove-fp)

memmove-fp 16068 1233897.125857:     100010 cpu-clock:pppH: 
                  4f6d6f command-line-arguments.root+0xf (/home/ubuntu/memmove-fp)

memmove-fp 16068 1233897.125957:     100010 cpu-clock:pppH: 
                  468641 runtime.memmove+0x421 (/home/ubuntu/memmove-fp)

memmove-fp 16068 1233897.126057:     100010 cpu-clock:pppH: 
                  46864b runtime.memmove+0x42b (/home/ubuntu/memmove-fp)

memmove-fp 16068 1233897.126157:     100010 cpu-clock:pppH: 
                  468641 runtime.memmove+0x421 (/home/ubuntu/memmove-fp)

memmove-fp 16068 1233897.126257:     100010 cpu-clock:pppH: 
                  468641 runtime.memmove+0x421 (/home/ubuntu/memmove-fp)

And with @cherrymui 's suggested workaround and sudo perf record -o /tmp/mm-tip-haveframe.perf.data -g -F 99 -- ./memmove-fp-tip-haveframe -test.bench=. -test.cpuprofile=/tmp/mm-tip-haveframe.pb.gz:

diff --git a/src/runtime/memmove_amd64.s b/src/runtime/memmove_amd64.s
index 018bb0b19d..c599090cc3 100644
--- a/src/runtime/memmove_amd64.s
+++ b/src/runtime/memmove_amd64.s
@@ -32,7 +32,7 @@
 
 // func memmove(to, from unsafe.Pointer, n uintptr)
 // ABIInternal for performance.
-TEXT runtime·memmove<ABIInternal>(SB), NOSPLIT, $0-24
+TEXT runtime·memmove<ABIInternal>(SB), NOSPLIT, $8-24
        // AX = to
        // BX = from
        // CX = n
$ perf script -i /tmp/mm-tip-haveframe.perf.data | head -n 9
memmove-fp-tip- 17664 1235775.612837:   10101010 cpu-clock:pppH: 
                  466f4f runtime.memmove+0x46f (/home/ubuntu/memmove-fp-tip-haveframe)
                  4f3f10 command-line-arguments.leaf+0x30 (/home/ubuntu/memmove-fp-tip-haveframe)
                  4f3e7d command-line-arguments.root+0x1d (/home/ubuntu/memmove-fp-tip-haveframe)
                  4f3e19 command-line-arguments.BenchmarkMemmove+0x79 (/home/ubuntu/memmove-fp-tip-haveframe)
                  4abfa2 testing.(*B).runN+0x102 (/home/ubuntu/memmove-fp-tip-haveframe)
                  4acbbc testing.(*B).launch+0x1bc (/home/ubuntu/memmove-fp-tip-haveframe)
                  4ac9c5 testing.(*B).doBench.func1+0x25 (/home/ubuntu/memmove-fp-tip-haveframe)
                  465d61 runtime.goexit.abi0+0x1 (/home/ubuntu/memmove-fp-tip-haveframe)

@felixge
Copy link
Contributor

felixge commented Mar 4, 2023

I think that the executable includes all the usual DWARF info: on one hand it's structured a test, but I asked to compile it explicitly

go run omits DWARF info, but I'm not sure about go test right now, especially when the -c flag is used.

addr2line --exe=./memmove-fp 468641 tells me /usr/local/go/src/runtime/memmove_amd64.s:369.

I think addr2line uses gopclntab rather than DWARF.

Either way, I'm +1 on your proposed change to memmove if it doesn't cause noticeable performance regressions.

@cherrymui
Copy link
Member

What sort of data would motivate accepting this change to runtime.memmove, or to other functions?

If benchmark results show minimal performance impact, and it helps debugging/profiling, I think it would be acceptable.

go run omits DWARF info, but I'm not sure about go test right now, especially when the -c flag is used.

go test -c should keep the DWARF info (unless you explicitly disable it on the command line).

@mknyszek
Copy link
Contributor

mknyszek commented Mar 8, 2023

In C&RT triage, @rhysh are you planning on working on this? Assigning to you optimistically but feel free to unassign.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Development

No branches or pull requests

8 participants