Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: regression around now inlining code in 1.17 #48067

Closed
MStoykov opened this issue Aug 30, 2021 · 7 comments
Closed

cmd/compile: regression around now inlining code in 1.17 #48067

MStoykov opened this issue Aug 30, 2021 · 7 comments

Comments

@MStoykov
Copy link

@MStoykov MStoykov commented Aug 30, 2021

What version of Go are you using (go version)?

$ go version
go version go1.17 linux/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/mstoykov/.cache/go-build"
GOENV="/home/mstoykov/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/mstoykov/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/mstoykov/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/mstoykov/.go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/mstoykov/.go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.17"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/mstoykov/work/k6io/reproduceScaleRegression/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3001641157=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I ran our project benchmarks and there was quite a bit of regression in some calculating code(among others).
original issue grafana/k6#2116
repo to reproduce it with a lot less of the code: https://github.com/MStoykov/reproduceScaleRegression

name                                    old time/op  new time/op   delta
ExecutionSegmentScale/5523-8            5.77ns ± 4%  10.45ns ± 2%   +81.15%  (p=0.000 n=9+10)
ExecutionSegmentScale/67280421310721-8  10.9ns ± 5%   59.9ns ± 7%  +451.72%  (p=0.000 n=10+10)
ExecutionSegmentScaleEmpty/5523-8       2.84ns ± 5%   0.64ns ± 7%   -77.32%  (p=0.000 n=10+10)

As mentioned in the repo the commit that makes the particular function inline is adb467f , but even without it, it's possible to make the code inline which results in the same regression even with go1.16.7.

Given that

What did you expect to see?

I would've wanted all the benchmarks to get faster :P.

In this particular example, I would expect that inlining will help not hinder, and some nice people in slack were theorizing that it's likely unaligned jumps or cache line misses and also (and most likely) the branch predictor getting hash collisions as proposed in #18977 (comment) , but I don't know enough to confirm or deny any of this
Given that there isn't such a regression in arm64 benchmarks results, provided by @vearutop 🙇

What did you see instead?

I saw a decrease in performance after a change that should've done the opposite.

@cuonglm cuonglm changed the title Regression around now inlining code in 1.17 cmd/compile: regression around now inlining code in 1.17 Aug 30, 2021
@vearutop
Copy link
Contributor

@vearutop vearutop commented Aug 30, 2021

Reduced linux/arm64 (Ampere A1) result (old 1.16.7, new 1.17) for the mentioned reproducer.

name                                    old time/op  new time/op  delta
ExecutionSegmentScale/5523-4            8.24ns ± 0%  5.34ns ± 0%  -35.19%  (p=0.000 n=9+9)
ExecutionSegmentScale/67280421310721-4  37.4ns ± 0%  32.0ns ± 0%  -14.28%  (p=0.006 n=7+9)
ExecutionSegmentScaleEmpty/5523-4       3.59ns ± 0%  0.83ns ± 0%  -76.73%  (p=0.000 n=9+10)

Loading

@randall77
Copy link
Contributor

@randall77 randall77 commented Aug 30, 2021

I don't see anything obvious. The register allocation looks a bit worse, but not enough to be 400+% worse. There are more reg-reg moves, but still no spills or anything like that.

Where does the cpu profiler say the time is going?

Loading

@MStoykov
Copy link
Author

@MStoykov MStoykov commented Aug 30, 2021

Ran only the worse benchmark with go test -bench BenchmarkExecutionSegmentScale/67280421310721$ -count 10 -cpuprofile

1.16.7 Output
$ go tool pprof 1.16.7.pprof
File: regression.test
Type: cpu
Time: Aug 30, 2021 at 8:32pm (EEST)
Duration: 15.34s, Total samples = 15.21s (99.15%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top5
Showing nodes accounting for 15140ms, 99.54% of 15210ms total
Dropped 14 nodes (cum <= 76.05ms)
      flat  flat%   sum%        cum   cum%
   13530ms 88.95% 88.95%    13540ms 89.02%  regression.(*ExecutionSegmentSequenceWrapper).ScaleInt64
    1220ms  8.02% 96.98%    14760ms 97.04%  regression.(*ExecutionTuple).ScaleInt64
     390ms  2.56% 99.54%    15150ms 99.61%  regression.BenchmarkExecutionSegmentScale.func1
         0     0% 99.54%    15150ms 99.61%  testing.(*B).launch
         0     0% 99.54%    15150ms 99.61%  testing.(*B).runN
(pprof) list ScaleInt64
Total: 15.21s
ROUTINE ======================== regression.(*ExecutionSegmentSequenceWrapper).ScaleInt64 in /home/mstoykov/work/k6io/reproduceScaleRegression/execution_segment.go
    13.53s     13.54s (flat, cum) 89.02% of Total
         .          .      8:   ExecutionSegmentSequence []int64 // not really
         .          .      9:}
         .          .     10:
         .          .     11:// ScaleInt64 ...
         .          .     12:func (essw *ExecutionSegmentSequenceWrapper) ScaleInt64(segmentIndex int, value int64) int64 {
     670ms      670ms     13:   start := essw.offsets[segmentIndex][0]
     570ms      570ms     14:   offsets := essw.offsets[segmentIndex][1:]
     9.48s      9.49s     15:   result := (value / essw.lcd) * int64(len(offsets))
     2.80s      2.80s     16:   for gi, i := 0, start; i < value%essw.lcd; gi, i = gi+1, i+offsets[gi] {
      10ms       10ms     17:           result++
         .          .     18:   }
         .          .     19:   return result
         .          .     20:}
         .          .     21:
         .          .     22:// ExecutionTuple ...
ROUTINE ======================== regression.(*ExecutionTuple).ScaleInt64 in /home/mstoykov/work/k6io/reproduceScaleRegression/execution_segment.go
     1.22s     14.76s (flat, cum) 97.04% of Total
         .          .     26:}
         .          .     27:
         .          .     28://
         .          .     29:
         .          .     30:// ScaleInt64 scales the provided value for our execution segment.
     430ms      430ms     31:func (et *ExecutionTuple) ScaleInt64(value int64) int64 {
     270ms      270ms     32:   if len(et.Sequence.ExecutionSegmentSequence) == 1 {
         .          .     33:           return value // if we don't have any segmentation, just return the original value
         .          .     34:   }
     520ms     14.06s     35:   return et.Sequence.ScaleInt64(et.SegmentIndex, value)
         .          .     36:}
(pprof)

1.17.0 Output
$ go tool pprof 1.17.0.pprof
File: regression.test
Type: cpu
Time: Aug 30, 2021 at 8:31pm (EEST)
Duration: 17.85s, Total samples = 17.76s (99.52%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top5
Showing nodes accounting for 17.63s, 99.27% of 17.76s total
Dropped 44 nodes (cum <= 0.09s)
      flat  flat%   sum%        cum   cum%
    17.31s 97.47% 97.47%     17.32s 97.52%  regression.(*ExecutionSegmentSequenceWrapper).ScaleInt64 (inline)
     0.23s  1.30% 98.76%     17.64s 99.32%  regression.BenchmarkExecutionSegmentScale.func1
     0.09s  0.51% 99.27%     17.41s 98.03%  regression.(*ExecutionTuple).ScaleInt64 (inline)
         0     0% 99.27%     17.65s 99.38%  testing.(*B).launch
         0     0% 99.27%     17.65s 99.38%  testing.(*B).runN
(pprof) list ScaleInt64
Total: 17.76s
ROUTINE ======================== regression.(*ExecutionSegmentSequenceWrapper).ScaleInt64 in /home/mstoykov/work/k6io/reproduceScaleRegression/execution_segment.go
    17.31s     17.32s (flat, cum) 97.52% of Total
         .          .      8:   ExecutionSegmentSequence []int64 // not really
         .          .      9:}
         .          .     10:
         .          .     11:// ScaleInt64 ...
         .          .     12:func (essw *ExecutionSegmentSequenceWrapper) ScaleInt64(segmentIndex int, value int64) int64 {
     100ms      100ms     13:   start := essw.offsets[segmentIndex][0]
     120ms      120ms     14:   offsets := essw.offsets[segmentIndex][1:]
      20ms       20ms     15:   result := (value / essw.lcd) * int64(len(offsets))
    17.07s     17.08s     16:   for gi, i := 0, start; i < value%essw.lcd; gi, i = gi+1, i+offsets[gi] {
         .          .     17:           result++
         .          .     18:   }
         .          .     19:   return result
         .          .     20:}
         .          .     21:
ROUTINE ======================== regression.(*ExecutionTuple).ScaleInt64 in /home/mstoykov/work/k6io/reproduceScaleRegression/execution_segment.go
      90ms     17.41s (flat, cum) 98.03% of Total
         .          .     27:
         .          .     28://
         .          .     29:
         .          .     30:// ScaleInt64 scales the provided value for our execution segment.
         .          .     31:func (et *ExecutionTuple) ScaleInt64(value int64) int64 {
      90ms       90ms     32:   if len(et.Sequence.ExecutionSegmentSequence) == 1 {
         .          .     33:           return value // if we don't have any segmentation, just return the original value
         .          .     34:   }
         .     17.32s     35:   return et.Sequence.ScaleInt64(et.SegmentIndex, value)
         .          .     36:}
(pprof)

It's kind of strange how it moved around that now that it's inlined the for loop is taking a lot more time 🤷

edit: to upload archive pprofs.zip

Loading

@randall77
Copy link
Contributor

@randall77 randall77 commented Aug 30, 2021

It would help to compare benchmarks using -benchtime so they both do the exact same number of iterations (not sure if you did that or not).
It would also help to look at the assembly output from pprof. The source-level pprof is too granular for this.

The one thing that stands out to me from the pprof output is that in the 1.16 case, a lot of time went to line 15, probably the value / essw.lcd computation. In the 1.17 case, that line has almost no samples. Maybe that divide got pushed into the following loop somehow? The compiler will try to combine value/essw.lcd and value%essw.lcd into a single op, as the amd64 assembly instruction computes both. Maybe something changed about that?

Loading

@MStoykov
Copy link
Author

@MStoykov MStoykov commented Aug 31, 2021

Sorry @randall77 , I didn't know I can see the assembly annotated.
I did:
go test -bench BenchmarkExecutionSegmentScale/67280421310721$ -benchtime 100000000x -count 10 -o t16 -cpuprofile 1.16.7.bench

And then copied go pprof -disasm Scale t16 1.16.7.bench (and for 1.17)

1.16.7 Output
Total: 10.57s
ROUTINE ======================== regression.(*ExecutionTuple).ScaleInt64
    10.17s     11.13s (flat, cum) 105.30% of Total
      40ms       40ms     50a4c0: SUBQ $0x18, SP                          ;regression.(*ExecutionTuple).ScaleInt64 execution_segment.go:31
     270ms      270ms     50a4c4: MOVQ BP, 0x10(SP)
      10ms       10ms     50a4c9: LEAQ 0x10(SP), BP
      10ms       10ms     50a4ce: MOVQ 0x20(SP), DX                       ;regression.(*ExecutionTuple).ScaleInt64 execution_segment.go:32
      30ms       30ms     50a4d3: MOVQ 0(DX), BX
      20ms       20ms     50a4d6: CMPQ $0x1, 0x28(BX)
      90ms       90ms     50a4db: NOPL 0(AX)(AX*1)
         .          .     50a4e0: JE 0x50a589                             ;execution_segment.go:32
     260ms      1.22s     50a4e6: MOVQ 0x8(DX), AX                        ;regression.(*ExecutionTuple).ScaleInt64 execution_segment.go:35
      30ms       30ms     50a4ea: MOVQ 0x8(BX), DX                        ;regression.(*ExecutionSegmentSequenceWrapper).ScaleInt64 execution_segment.go:13
         .          .     50a4ee: MOVQ 0x10(BX), CX                       ;execution_segment.go:13
         .          .     50a4f2: CMPQ CX, AX
      10ms       10ms     50a4f5: JAE 0x50a5b1                            ;regression.(*ExecutionSegmentSequenceWrapper).ScaleInt64 execution_segment.go:13
         .          .     50a4fb: LEAQ 0(AX)(AX*2), SI                    ;execution_segment.go:13
         .          .     50a4ff: MOVQ 0x8(DX)(SI*8), CX
     200ms      200ms     50a504: MOVQ 0(DX)(SI*8), DI                    ;regression.(*ExecutionTuple).ScaleInt64 execution_segment.go:13
      20ms       20ms     50a508: MOVQ 0x10(DX)(SI*8), DX                 ;regression.(*ExecutionSegmentSequenceWrapper).ScaleInt64 execution_segment.go:13
     270ms      270ms     50a50d: TESTQ CX, CX                            ;regression.(*ExecutionTuple).ScaleInt64 execution_segment.go:13
         .          .     50a510: JBE 0x50a5aa                            ;execution_segment.go:13
         .          .     50a516: MOVQ 0(DI), SI
     170ms      170ms     50a519: DECQ DX                                 ;regression.(*ExecutionTuple).ScaleInt64 execution_segment.go:14
      10ms       10ms     50a51c: NEGQ DX                                 ;regression.(*ExecutionSegmentSequenceWrapper).ScaleInt64 execution_segment.go:14
         .          .     50a51f: SARQ $0x3f, DX                          ;execution_segment.go:14
      10ms       10ms     50a523: ANDQ $0x8, DX                           ;regression.(*ExecutionSegmentSequenceWrapper).ScaleInt64 execution_segment.go:14
      10ms       10ms     50a527: ADDQ DI, DX
      20ms       20ms     50a52a: MOVQ 0(BX), BX                          ;regression.(*ExecutionSegmentSequenceWrapper).ScaleInt64 execution_segment.go:15
     130ms      130ms     50a52d: TESTQ BX, BX                            ;regression.(*ExecutionTuple).ScaleInt64 execution_segment.go:15
         .          .     50a530: JE 0x50a5a5                             ;execution_segment.go:15
         .          .     50a532: MOVQ 0x28(SP), AX
      10ms       10ms     50a537: MOVQ DX, DI                             ;regression.(*ExecutionSegmentSequenceWrapper).ScaleInt64 execution_segment.go:14
         .          .     50a53a: NOPW 0(AX)(AX*1)                        ;execution_segment.go:14
      10ms       10ms     50a540: CMPQ $-0x1, BX                          ;regression.(*ExecutionSegmentSequenceWrapper).ScaleInt64 execution_segment.go:15
         .          .     50a544: JE 0x50a54d                             ;execution_segment.go:15
         .          .     50a546: CQO
         .          .     50a548: IDIVQ BX
     6.16s      6.16s     50a54b: JMP 0x50a552                            ;regression.(*ExecutionTuple).ScaleInt64 execution_segment.go:15
         .          .     50a54d: NEGQ AX                                 ;execution_segment.go:15
         .          .     50a550: XORL DX, DX
         .          .     50a552: DECQ CX                                 ;execution_segment.go:14
      10ms       10ms     50a555: IMULQ CX, AX                            ;regression.(*ExecutionSegmentSequenceWrapper).ScaleInt64 execution_segment.go:15
     230ms      230ms     50a559: XORL BX, BX                             ;regression.(*ExecutionTuple).ScaleInt64 execution_segment.go:15
         .          .     50a55b: JMP 0x50a56e                            ;execution_segment.go:16
         .          .     50a55d: LEAQ 0x1(BX), R8
      50ms       50ms     50a561: MOVQ 0(DI)(BX*8), R9                    ;regression.(*ExecutionSegmentSequenceWrapper).ScaleInt64 execution_segment.go:16
      20ms       20ms     50a565: ADDQ R9, SI
         .          .     50a568: INCQ AX                                 ;execution_segment.go:17
     720ms      720ms     50a56b: MOVQ R8, BX                             ;regression.(*ExecutionSegmentSequenceWrapper).ScaleInt64 execution_segment.go:16
     270ms      270ms     50a56e: CMPQ DX, SI                             ;regression.(*ExecutionTuple).ScaleInt64 execution_segment.go:16
     730ms      730ms     50a571: JGE 0x50a57a
      10ms       10ms     50a573: CMPQ BX, CX                             ;regression.(*ExecutionSegmentSequenceWrapper).ScaleInt64 execution_segment.go:16
      20ms       20ms     50a576: JA 0x50a55d
         .          .     50a578: JMP 0x50a59d                            ;execution_segment.go:16
         .          .     50a57a: MOVQ AX, 0x30(SP)                       ;execution_segment.go:35
     310ms      310ms     50a57f: MOVQ 0x10(SP), BP                       ;regression.(*ExecutionTuple).ScaleInt64 execution_segment.go:35
      10ms       10ms     50a584: ADDQ $0x18, SP
         .          .     50a588: RET                                     ;execution_segment.go:35
         .          .     50a589: MOVQ 0x28(SP), AX                       ;execution_segment.go:33
         .          .     50a58e: MOVQ AX, 0x30(SP)
         .          .     50a593: MOVQ 0x10(SP), BP
         .          .     50a598: ADDQ $0x18, SP
         .          .     50a59c: RET
         .          .     50a59d: MOVQ BX, AX                             ;execution_segment.go:16
         .          .     50a5a0: CALL runtime.panicIndex(SB)
         .          .     50a5a5: CALL runtime.panicdivide(SB)            ;execution_segment.go:15
         .          .     50a5aa: XORL AX, AX                             ;execution_segment.go:13
         .          .     50a5ac: CALL runtime.panicIndex(SB)
         .          .     50a5b1: CALL runtime.panicIndex(SB)
ROUTINE ======================== regression.BenchmarkExecutionSegmentScale.func1
     380ms     10.55s (flat, cum) 99.81% of Total
         .          .     50aac0: MOVQ FS:0xfffffff8, CX                  ;execution_segment_test.go:19
         .          .     50aac9: CMPQ 0x10(CX), SP
         .          .     50aacd: JBE 0x50ab33
         .          .     50aacf: SUBQ $0x38, SP
         .          .     50aad3: MOVQ BP, 0x30(SP)
         .          .     50aad8: LEAQ 0x30(SP), BP
         .          .     50aadd: MOVQ 0x10(DX), AX
         .          .     50aae1: MOVQ AX, 0x18(SP)
         .          .     50aae6: MOVQ 0x8(DX), CX
         .          .     50aaea: MOVQ CX, 0x28(SP)
         .          .     50aaef: XORL DX, DX                             ;execution_segment_test.go:20
         .          .     50aaf1: JMP 0x50ab19
         .          .     50aaf3: MOVQ DX, 0x20(SP)
         .          .     50aaf8: MOVQ CX, 0(SP)                          ;execution_segment_test.go:21
         .          .     50aafc: MOVQ AX, 0x8(SP)
                                                                          ;regression.BenchmarkExecutionSegmentScale.func1 execution_segment_test.go:21
      20ms     10.19s     50ab01: CALL regression.(*ExecutionTuple).ScaleInt64(SB)
         .          .     50ab06: MOVQ 0x20(SP), AX                       ;execution_segment_test.go:20
      10ms       10ms     50ab0b: LEAQ 0x1(AX), DX                        ;regression.BenchmarkExecutionSegmentScale.func1 execution_segment_test.go:20
      40ms       40ms     50ab0f: MOVQ 0x18(SP), AX                       ;regression.BenchmarkExecutionSegmentScale.func1 execution_segment_test.go:21
      10ms       10ms     50ab14: MOVQ 0x28(SP), CX
      20ms       20ms     50ab19: MOVQ 0x40(SP), BX                       ;regression.BenchmarkExecutionSegmentScale.func1 execution_segment_test.go:20
     220ms      220ms     50ab1e: NOPW
         .          .     50ab20: CMPQ DX, 0x188(BX)                      ;execution_segment_test.go:20
      60ms       60ms     50ab27: JG 0x50aaf3                             ;regression.BenchmarkExecutionSegmentScale.func1 execution_segment_test.go:20
         .          .     50ab29: MOVQ 0x30(SP), BP                       ;execution_segment_test.go:20
         .          .     50ab2e: ADDQ $0x38, SP
         .          .     50ab32: RET
         .          .     50ab33: CALL runtime.morestack(SB)              ;execution_segment_test.go:19
         .          .     50ab38: ?
1.17.0 Output
Total: 57.63s
ROUTINE ======================== regression.BenchmarkExecutionSegmentScale.func1
    57.53s     57.58s (flat, cum) 99.91% of Total
         .          .     4e3b20: SUBQ $0x18, SP                          ;execution_segment_test.go:19
         .          .     4e3b24: MOVQ BP, 0x10(SP)
         .          .     4e3b29: LEAQ 0x10(SP), BP
         .          .     4e3b2e: MOVQ 0x10(DX), BX
         .          .     4e3b32: MOVQ 0x8(DX), DX
         .          .     4e3b36: XORL CX, CX
         .          .     4e3b38: JMP 0x4e3b40                            ;execution_segment_test.go:20
      70ms       70ms     4e3b3a: INCQ CX                                 ;regression.BenchmarkExecutionSegmentScale.func1 execution_segment_test.go:20
         .          .     4e3b3d: NOPL 0(AX)                              ;execution_segment_test.go:20
         .          .     4e3b40: CMPQ CX, 0x190(AX)
         .          .     4e3b47: JLE 0x4e3bb5
         .          .     4e3b49: MOVQ 0(DX), SI                          ;execution_segment.go:32
         .          .     4e3b4c: CMPQ $0x1, 0x28(SI)
     410ms      410ms     4e3b51: JE 0x4e3b3a                             ;regression.BenchmarkExecutionSegmentScale.func1 execution_segment.go:32
         .          .     4e3b53: MOVQ 0x8(DX), DI                        ;execution_segment.go:35
         .          .     4e3b57: MOVQ 0x8(SI), R8                        ;execution_segment.go:13
         .          .     4e3b5b: MOVQ 0x10(SI), R9
      90ms       90ms     4e3b5f: NOPL                                    ;regression.BenchmarkExecutionSegmentScale.func1 execution_segment.go:13
         .          .     4e3b60: CMPQ DI, R9                             ;execution_segment.go:13
         .          .     4e3b63: JBE 0x4e3c16
         .          .     4e3b69: LEAQ 0(DI)(DI*2), DI
         .          .     4e3b6d: MOVQ 0x8(R8)(DI*8), R9
     230ms      230ms     4e3b72: MOVQ 0(R8)(DI*8), R10                   ;regression.BenchmarkExecutionSegmentScale.func1 execution_segment.go:13
      80ms       80ms     4e3b76: MOVQ 0x10(R8)(DI*8), DI
         .          .     4e3b7b: NOPL 0(AX)(AX*1)                        ;execution_segment.go:13
         .          .     4e3b80: TESTQ R9, R9
         .          .     4e3b83: JBE 0x4e3c0c
         .          .     4e3b89: MOVQ 0(R10), R8
         .          .     4e3b8c: DECQ DI                                 ;execution_segment.go:14
      40ms       40ms     4e3b8f: NEGQ DI                                 ;regression.BenchmarkExecutionSegmentScale.func1 execution_segment.go:14
         .          .     4e3b92: SARQ $0x3f, DI                          ;execution_segment.go:14
     270ms      270ms     4e3b96: ANDQ $0x8, DI                           ;regression.BenchmarkExecutionSegmentScale.func1 execution_segment.go:14
         .          .     4e3b9a: ADDQ R10, DI                            ;execution_segment.go:14
      80ms       80ms     4e3b9d: MOVQ 0(SI), SI                          ;regression.BenchmarkExecutionSegmentScale.func1 execution_segment.go:15
         .          .     4e3ba0: TESTQ SI, SI                            ;execution_segment.go:15
         .          .     4e3ba3: JE 0x4e3c07
         .          .     4e3ba5: DECQ R9                                 ;execution_segment.go:14
         .          .     4e3ba8: XORL R10, R10
         .          .     4e3bab: JMP 0x4e3bd3                            ;execution_segment.go:16
         .          .     4e3bad: MOVQ R11, AX                            ;execution_segment_test.go:20
         .          .     4e3bb0: MOVQ R12, DX                            ;execution_segment.go:32
         .          .     4e3bb3: JMP 0x4e3b3a                            ;execution_segment_test.go:21
         .          .     4e3bb5: MOVQ 0x10(SP), BP                       ;execution_segment_test.go:23
         .          .     4e3bba: ADDQ $0x18, SP
         .          .     4e3bbe: RET
         .          .     4e3bbf: LEAQ 0x1(R10), R13                      ;execution_segment.go:16
         .          .     4e3bc3: MOVQ 0(DI)(R10*8), R15
      10ms       10ms     4e3bc7: ADDQ R15, R8                            ;regression.BenchmarkExecutionSegmentScale.func1 execution_segment.go:16
         .          .     4e3bca: MOVQ R11, AX                            ;execution_segment_test.go:20
      30ms       30ms     4e3bcd: MOVQ R12, DX                            ;regression.BenchmarkExecutionSegmentScale.func1 execution_segment.go:32
      90ms       90ms     4e3bd0: MOVQ R13, R10                           ;regression.BenchmarkExecutionSegmentScale.func1 execution_segment.go:16
     110ms      110ms     4e3bd3: MOVQ AX, R11                            ;regression.BenchmarkExecutionSegmentScale.func1 execution_segment_test.go:19
     1.64s      1.64s     4e3bd6: MOVQ BX, AX                             ;regression.BenchmarkExecutionSegmentScale.func1 execution_segment.go:16
      20ms       20ms     4e3bd9: MOVQ DX, R12                            ;regression.BenchmarkExecutionSegmentScale.func1 execution_segment_test.go:19
     400ms      400ms     4e3bdc: NOPL 0(AX)
     120ms      120ms     4e3be0: CMPQ $-0x1, SI                          ;regression.BenchmarkExecutionSegmentScale.func1 execution_segment.go:16
      70ms       70ms     4e3be4: JE 0x4e3bed
         .          .     4e3be6: CQO                                     ;execution_segment.go:16
         .       40ms     4e3be8: IDIVQ SI                                ;regression.BenchmarkExecutionSegmentScale.func1 execution_segment.go:16
    50.22s     50.22s     4e3beb: JMP 0x4e3bf2
         .          .     4e3bed: NEGQ AX                                 ;execution_segment.go:16
         .          .     4e3bf0: XORL DX, DX
      90ms      100ms     4e3bf2: CMPQ R8, DX                             ;regression.BenchmarkExecutionSegmentScale.func1 execution_segment.go:16
     3.37s      3.37s     4e3bf5: JLE 0x4e3bad
         .          .     4e3bf7: CMPQ R9, R10                            ;execution_segment.go:16
      90ms       90ms     4e3bfa: JB 0x4e3bbf                             ;regression.BenchmarkExecutionSegmentScale.func1 execution_segment.go:16
         .          .     4e3bfc: MOVQ R10, AX                            ;execution_segment.go:16
         .          .     4e3bff: MOVQ R9, CX
         .          .     4e3c02: CALL runtime.panicIndex(SB)
         .          .     4e3c07: CALL runtime.panicdivide(SB)            ;execution_segment.go:15
         .          .     4e3c0c: XORL AX, AX                             ;execution_segment.go:13
         .          .     4e3c0e: MOVQ R9, CX
         .          .     4e3c11: CALL runtime.panicIndex(SB)
         .          .     4e3c16: MOVQ DI, AX
         .          .     4e3c19: MOVQ R9, CX
         .          .     4e3c1c: NOPL 0(AX)
         .          .     4e3c20: CALL runtime.panicIndex(SB)

Loading

@randall77
Copy link
Contributor

@randall77 randall77 commented Aug 31, 2021

It looks to me like in 1.16 the divide is outside the loop, but in 1.17 the divide is inside the loop.
Not sure why yet.

Loading

@randall77
Copy link
Contributor

@randall77 randall77 commented Sep 1, 2021

I think I understand why this is happening.

In 1.16, the expressions value / essw.lcd and value%essw.lcd are both implemented by amd64's DIVQ. The former is outside of the inner loop, and the latter is inside. Those two DIVQs get CSEd, and first one is chosen as the representative (it dominates the inner one), which means the divide happens outside the loop.

In 1.17, the expression value / essw.lcd is never used! It is assigned to result and incremented a few times, but the resulting value is just returned up two levels, and then thrown away (by BenchmarkExecutionSegmentScaleEmpty). So when the compiler inlines both of those levels, the compiler realizes it doesn't need that divide at all. Thus the DIVQ implementing the mod never gets CSEd with anything, and as a result stays in the loop.

You can "fix" this by using the result of et.ScaleInt64(value) in BenchmarkExecutionSegmentScaleEmpty in some nontrivial way. (Which is a good idea generally, to avoid the compiler completely optimizing the benchmark away.)

So I think this issue is just a dup of #15808.

Loading

@randall77 randall77 closed this Sep 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants