Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: runtime.KeepAlive causes useless extra run-time overhead, even though it is an intrinsic? #33442

Open
nsajko opened this issue Aug 2, 2019 · 4 comments

Comments

@nsajko
Copy link
Contributor

commented Aug 2, 2019

$ go version
go version devel +2d6ee6e89a Thu Aug 1 20:37:08 2019 +0000 linux/amd64

Does this issue reproduce with the latest release?

Did not check.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/tmp/freedesktopCache/go-build"
GOENV="/home/nsajko/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/nsajko"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/nsajko/goNew"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/nsajko/goNew/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/nsajko/goNew/src/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build460144380=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I converted the benchmarks of the math package from the standard library to use runtime.KeepAlive instead of package level exported sink variables [1]. The rationale was to get more accurate and more robust benchmarks. See [0] for background (the more broad motivation for this change to package math was to check if runtime.KeepAlive is a possible substitute for the proposed formallyRead builtin func). For accuracy runtime.KeepAlive should not cause a run-time overhead, which I expected to be the case. For this issue I considered and compared only the effect on BenchmarkSignbit:

func BenchmarkSignbit(b *testing.B) {
        for i := 0; i < b.N; i++ {
                runtime.KeepAlive(Signbit(signbitPos))
        }
}

What did you expect to see?

I expected that if the change caused any significant performance differences, they should be speedups. This is because I expected runtime.KeepAlive not to cause a run time overhead.

What did you see instead?

In a lot of the benchmarks for the math package considerable slowdowns happened. In the following data for the changed code (with KeepAlive) one can see a weird sequence of four instructions MOVZX, LEAQ, ADDQ, MOVQ. The MOVQ writes to the stack. In essence, I think these four instruction spill to the stack the address of either the byte 0x00 or 0x01 from the runtime.staticbytes array, which seems like a useless thing to do.

Here is the disassembly and performance data for BenchmarkSignbit (outputs of objdump, pprof, and radare2) before and after the change:

CPUProfileSignbitGitMaster

TEXT math_test.BenchmarkSignbit(SB) /home/nsajko/goNew/src/math/all_test.go
  all_test.go:3556      0x515780                488b442408              MOVQ 0x8(SP), AX
  all_test.go:3556      0x515785                31c9                    XORL CX, CX
  all_test.go:3556      0x515787                31d2                    XORL DX, DX
  all_test.go:3556      0x515789                eb18                    JMP 0x5157a3
  all_test.go:3557      0x51578b                f20f1005d5d91400        MOVSD_XMM math_test.signbitPos(SB), X0
  unsafe.go:23          0x515793                66480f7ec3              MOVQ X0, BX
  signbit.go:9          0x515798                480fbae33f              BTQ $0x3f, BX
  signbit.go:9          0x51579d                0f92c2                  SETB DL
  all_test.go:3556      0x5157a0                48ffc1                  INCQ CX
  all_test.go:3556      0x5157a3                48398808010000          CMPQ CX, 0x108(AX)
  all_test.go:3556      0x5157aa                7fdf                    JG 0x51578b
  all_test.go:3559      0x5157ac                881551841800            MOVB DL, math_test.GlobalB(SB)
  all_test.go:3560      0x5157b2                c3                      RET





Total: 540ms
ROUTINE ======================== math_test.BenchmarkSignbit
     540ms      740ms (flat, cum) 137.04% of Total
         .          .     515780: MOVQ 0x8(SP), AX                        ;all_test.go:3556
         .          .     515785: XORL CX, CX
         .          .     515787: XORL DX, DX
         .          .     515789: JMP 0x5157a3
      80ms      280ms     51578b: MOVSD_XMM math_test.signbitPos(SB), X0  ;math_test.BenchmarkSignbit all_test.go:3557
         .          .     515793: MOVQ X0, BX                             ;unsafe.go:23
     180ms      180ms     515798: BTQ $0x3f, BX                           ;math.Signbit signbit.go:9
      20ms       20ms     51579d: SETB DL
     260ms      260ms     5157a0: INCQ CX                                 ;math_test.BenchmarkSignbit all_test.go:3556
         .          .     5157a3: CMPQ CX, 0x108(AX)                      ;all_test.go:3556
         .          .     5157aa: JG 0x51578b
         .          .     5157ac: MOVB DL, math_test.GlobalB(SB)          ;all_test.go:3559





                        ┌────────────────────────────────────────────────────┐
                        │ [0x515780]                                         │
                        │ ;-- sym.go.math_test.BenchmarkSignbit:             │
                        │ (fcn) sym.math_test.BenchmarkSignbit 51            │
                        │   sym.math_test.BenchmarkSignbit (int32_t arg_8h); │
                        │ ; arg int32_t arg_8h @ rsp+0x8                     │
                        │ ; [0x8:8]=-1                                       │
                        │ ; 8                                                │
                        │ mov rax, qword [arg_8h]                            │
                        │ xor ecx, ecx                                       │
                        │ xor edx, edx                                       │
                        │ jmp 0x5157a3                                       │
                        └────────────────────────────────────────────────────┘
                            v
                            │
                          ┌─┘
 ┌──────────────────────────┐
 │                        │ │
 │                  ┌────────────────────────────────────────────────────────────┐
 │                  │  0x5157a3 [oc]                                             │
 │                  │ ; CODE XREF from sym.math_test.BenchmarkSignbit (0x515789) │
 │                  │ ; [0x108:8]=-1                                             │
 │                  │ ; 264                                                      │
 │                  │ cmp qword [rax + 0x108], rcx                               │
 │                  │ jg 0x51578b                                                │
 │                  └────────────────────────────────────────────────────────────┘
 │                        t f
 │                        │ │
 │    ┌───────────────────┘ │
 │    │                     └───────────────────────────────────────────┐
 │    │                                                                 │
 │┌────────────────────────────────────────────────────────────┐    ┌─────────────────────────────────────────┐
 ││  0x51578b [ob]                                             │    │  0x5157ac [od]                          │
 ││ ; CODE XREF from sym.math_test.BenchmarkSignbit (0x5157aa) │    │ ; [0x69dc03:1]=0                        │
 ││ ; MOV xmm0 = [0x663168] = 0x4004000000000000               │    │ mov byte [obj.math_test.GlobalB], dl    │
 ││ ; [0x663168:8]=0x4004000000000000                          │    │ ret                                     │
 ││ movsd xmm0, qword obj.math_test.signbitPos                 │    └─────────────────────────────────────────┘
 ││ movq rbx, xmm0                                             │
 ││ bt rbx, 0x3f                                               │
 ││ setb dl                                                    │
 ││ inc rcx                                                    │
 │└────────────────────────────────────────────────────────────┘
 │    v
 │    │
 └────┘

CPUProfileSignbitKeepAlive

TEXT math_test.BenchmarkSignbit(SB) /home/nsajko/goNew/src/math/all_test.go
  all_test.go:3442      0x515740                4883ec10                SUBQ $0x10, SP
  all_test.go:3442      0x515744                48896c2408              MOVQ BP, 0x8(SP)
  all_test.go:3442      0x515749                488d6c2408              LEAQ 0x8(SP), BP
  all_test.go:3443      0x51574e                488b442418              MOVQ 0x18(SP), AX
  all_test.go:3443      0x515753                31c9                    XORL CX, CX
  all_test.go:3443      0x515755                eb29                    JMP 0x515780
  all_test.go:3444      0x515757                f20f100509ca1400        MOVSD_XMM math_test.signbitPos(SB), X0
  unsafe.go:23          0x51575f                66480f7ec2              MOVQ X0, DX
  signbit.go:9          0x515764                480fbae23f              BTQ $0x3f, DX
  signbit.go:9          0x515769                0f92c2                  SETB DL
  all_test.go:3444      0x51576c                0fb6d2                  MOVZX DL, DX
  all_test.go:3444      0x51576f                488d1d0a471500          LEAQ runtime.staticbytes(SB), BX
  all_test.go:3444      0x515776                4801da                  ADDQ BX, DX
  all_test.go:3444      0x515779                48891424                MOVQ DX, 0(SP)
  all_test.go:3443      0x51577d                48ffc1                  INCQ CX
  all_test.go:3443      0x515780                48398808010000          CMPQ CX, 0x108(AX)
  all_test.go:3443      0x515787                7fce                    JG 0x515757
  all_test.go:3443      0x515789                488b6c2408              MOVQ 0x8(SP), BP
  all_test.go:3443      0x51578e                4883c410                ADDQ $0x10, SP
  all_test.go:3443      0x515792                c3                      RET





Total: 820ms
ROUTINE ======================== math_test.BenchmarkSignbit
     820ms      970ms (flat, cum) 118.29% of Total
         .          .     515740: SUBQ $0x10, SP                          ;all_test.go:3442
         .          .     515744: MOVQ BP, 0x8(SP)
         .          .     515749: LEAQ 0x8(SP), BP
         .          .     51574e: MOVQ 0x18(SP), AX                       ;all_test.go:3443
         .          .     515753: XORL CX, CX
         .          .     515755: JMP 0x515780
     120ms      270ms     515757: MOVSD_XMM math_test.signbitPos(SB), X0  ;math_test.BenchmarkSignbit all_test.go:3444
         .          .     51575f: MOVQ X0, DX                             ;unsafe.go:23
     150ms      150ms     515764: BTQ $0x3f, DX                           ;math.Signbit signbit.go:9
         .          .     515769: SETB DL                                 ;signbit.go:9
     120ms      120ms     51576c: MOVZX DL, DX                            ;math_test.BenchmarkSignbit all_test.go:3444
      10ms       10ms     51576f: LEAQ runtime.staticbytes(SB), BX
     130ms      130ms     515776: ADDQ BX, DX
      50ms       50ms     515779: MOVQ DX, 0(SP)
     240ms      240ms     51577d: INCQ CX                                 ;math_test.BenchmarkSignbit all_test.go:3443
         .          .     515780: CMPQ CX, 0x108(AX)                      ;all_test.go:3443
         .          .     515787: JG 0x515757
         .          .     515789: MOVQ 0x8(SP), BP
         .          .     51578e: ADDQ $0x10, SP





              ┌─────────────────────────────────────────────────────┐
              │ [0x515740]                                          │
              │ (fcn) sym.math_test.BenchmarkSignbit 83             │
              │   sym.math_test.BenchmarkSignbit (int32_t arg_18h); │
              │ ; var int32_t var_8h @ rsp+0x8                      │
              │ ; arg int32_t arg_18h @ rsp+0x18                    │
              │ sub rsp, 0x10                                       │
              │ mov qword [var_8h], rbp                             │
              │ ; 8                                                 │
              │ lea rbp, [var_8h]                                   │
              │ ; [0x18:8]=-1                                       │
              │ ; 24                                                │
              │ mov rax, qword [arg_18h]                            │
              │ xor ecx, ecx                                        │
              │ jmp 0x515780                                        │
              └─────────────────────────────────────────────────────┘
                  v
                  │
                ┌─┘
  ┌───────────────┐
  │             │ │
  │       ┌────────────────────────────────────────────────────────────┐
  │       │  0x515780 [oc]                                             │
  │       │ ; CODE XREF from sym.math_test.BenchmarkSignbit (0x515755) │
  │       │ ; [0x108:8]=-1                                             │
  │       │ ; 264                                                      │
  │       │ cmp qword [rax + 0x108], rcx                               │
  │       │ jg 0x515757                                                │
  │       └────────────────────────────────────────────────────────────┘
  │             t f
  │             │ │
  │    ┌────────┘ │
  │    │          └────────────────────────────────────────┐
  │    │                                                   │
  │┌───────────────────────────────────────────────┐   ┌────────────────────────────┐
  ││  0x515757 [ob]                                │   │  0x515789 [od]             │
  ││ ; MOV xmm0 = [0x662168] = 0x4004000000000000  │   │ ; [0x8:8]=-1               │
  ││ ; [0x662168:8]=0x4004000000000000             │   │ ; 8                        │
  ││ movsd xmm0, qword obj.math_test.signbitPos    │   │ mov rbp, qword [var_8h]    │
  ││ movq rdx, xmm0                                │   │ add rsp, 0x10              │
  ││ bt rdx, 0x3f                                  │   │ ret                        │
  ││ setb dl                                       │   └────────────────────────────┘
  ││ movzx edx, dl                                 │
  ││ ; 0x669e80                                    │
  ││ lea rbx, obj.runtime.staticbytes              │
  ││ add rdx, rbx                                  │
  ││ mov qword [rsp], rdx                          │
  ││ inc rcx                                       │
  │└───────────────────────────────────────────────┘
  │    v
  │    │
  └────┘

This issue is maybe a little related: [2]

An unrelated possible issue visible in the disassembly is the MOVSD_XMM followed by the MOVQ. Could those two instructions not be replaced by a single MOVQ?

[0] #33325

[1] https://go-review.googlesource.com/c/go/+/188437

[2] #32115

@randall77

This comment has been minimized.

Copy link
Contributor

commented Aug 2, 2019

I don' think the slowdown of the benchmark itself is reason for concern. The runtime.KeepAlive, like the global variable write or the for i := 0; i < b.N; i++ loop, is just part of the benchmark overhead. As long as it isn't extreme or variable, benchmarking still works.

@nsajko

This comment has been minimized.

Copy link
Contributor Author

commented Aug 2, 2019

I don' think the slowdown of the benchmark itself is reason for concern. The runtime.KeepAlive, like the global variable write or the for i := 0; i < b.N; i++ loop, is just part of the benchmark overhead. As long as it isn't extreme or variable, benchmarking still works.

But still, it would be better if the overhead did not exist, at least for the other possible uses of runtime.KeepAlive. Also, the overhead (the 4 mentioned extra instructions) seems very unnecessary/useless.

@odeke-em odeke-em added this to the Go1.14 milestone Aug 3, 2019

@nsajko

This comment has been minimized.

Copy link
Contributor Author

commented Aug 3, 2019

One thing I am wondering about but did not ask is if spilling data to stack (if it is not on the heap already) is essential to the way runtime.KeepAlive is currently implemented? I guess you guys would not like fixing this issue in that case ...

@randall77

This comment has been minimized.

Copy link
Contributor

commented Aug 19, 2019

Yes, currently KeepAlive requires that its argument be converted to an interface, and that interface be written to the stack. We use the regular conversion and spill/restore mechanisms for that.
There's no reason it has to be implemented that way. It's just the easiest thing that gets us the right semantics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.