-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Description
At commit 266b0cf from earlier today (but also with some older toolchains, not claiming the behavior is new), suppose you:
GOOS=linux GOARCH=amd64 go test -c math/big
go tool objdump -s big.nat.scan big.test
I've attached big.zip, which contains the test executable (big.test.old) and the objdump output, lightly annotated (old.asm).
Now consider this line-number-preserving diff, which introduces a new variable (bZ) that must be saved on the stack across a function call.
diff --git a/src/math/big/natconv.go b/src/math/big/natconv.go
index ce94f2cf72..a92343befe 100644
--- a/src/math/big/natconv.go
+++ b/src/math/big/natconv.go
@@ -121,9 +121,9 @@ func (z nat) scan(r io.ByteScanner, base int, fracOk bool) (res nat, b, count in
prev := '.'
invalSep := false
- // one char look-ahead
- ch, err := r.ReadByte()
-
+ bZ := base + 1
+ ch, err := r.ReadByte()
+ base = bZ - 1
// determine actual base
b, prefix := base, 0
if base == 0 {
The big.zip attachment also contains the resulting test executable (big.test.new) and the objdump output (new.asm).
On x86-64, this diff makes the 'Scan/1000/Base10' benchmark run 14% faster! Quite an optimization!
goos: linux
goarch: amd64
pkg: math/big
cpu: AMD Ryzen 9 7950X 16-Core Processor
│ old │ new │
│ sec/op │ sec/op vs base │
Scan/1000/Base10-32 4.694µ ± 0% 4.032µ ± 0% -14.11% (p=0.000 n=20)
How can this be? It turns out that the compiler is make poor choices when spilling to the stack, and for whatever reason, that extra temporary causes different, better choices. The nat.scan method has a loop that looks like:
for err == nil {
if ch == '.' && fracOk {
...
} else if ch == '_' && base == 0 {
...
} else {
...
}
ch, err = r.ReadByte()
}
That last ReadByte call is a basic block of its own, with three possible predecessor blocks. There is a slice variable z in the code that is live across that call.
In the slow version of the code (without bZ), the compiler chooses to store len(z) in 0xb0(SP) and cap(z) in 0xb8(SP) pretty consistently throughout the function, except in this one basic block, where it swaps their locations. On entry to this basic block, len(z) is in R8, and cap(z) is in SI. The assembly, lightly annotated, is:
> natconv.go:243 0x5fc72f 4c898424b8000000 MOVQ R8, 0xb8(SP) // len(z) (NOT cap!)
natconv.go:243 0x5fc737 4c89ac24d0000000 MOVQ R13, 0xd0(SP)
> natconv.go:243 0x5fc73f 4889b424b0000000 MOVQ SI, 0xb0(SP) // cap(z) (NOT len!)
natconv.go:243 0x5fc747 4c897c2470 MOVQ R15, 0x70(SP)
natconv.go:242 0x5fc74c 4889542460 MOVQ DX, 0x60(SP)
natconv.go:231 0x5fc751 4889bc24a0000000 MOVQ DI, 0xa0(SP)
natconv.go:171 0x5fc759 4488642443 MOVB R12, 0x43(SP)
natconv.go:219 0x5fc75e 498b4b18 MOVQ 0x18(R11), CX
natconv.go:219 0x5fc762 ffd1 CALL CX
natconv.go:170 0x5fc764 488b9424c0000000 MOVQ 0xc0(SP), DX
natconv.go:170 0x5fc76c 488bb42410010000 MOVQ 0x110(SP), SI // could be just R11
natconv.go:170 0x5fc774 488b7c2458 MOVQ 0x58(SP), DI // could be just R10
natconv.go:170 0x5fc779 4c8b842420010000 MOVQ 0x120(SP), R8 // could be just R9
natconv.go:170 0x5fc781 4c8b8c2418010000 MOVQ 0x118(SP), R9 // dead
natconv.go:170 0x5fc789 4c8b542468 MOVQ 0x68(SP), R10 // dead
natconv.go:170 0x5fc78e 4c8b5c2470 MOVQ 0x70(SP), R11 // dead
natconv.go:170 0x5fc793 4c8ba424b8000000 MOVQ 0xb8(SP), R12 // len(z) (NOT cap!); dead
natconv.go:170 0x5fc79b 4c8bac24d0000000 MOVQ 0xd0(SP), R13
natconv.go:170 0x5fc7a3 4c8bbc24b0000000 MOVQ 0xb0(SP), R15 // cap(z) (NOT len!); dead
natconv.go:110 0x5fc7ab 4d89c1 MOVQ R8, R9
natconv.go:212 0x5fc7ae 4989fa MOVQ DI, R10
natconv.go:219 0x5fc7b1 4989f3 MOVQ SI, R11
natconv.go:233 0x5fc7b4 4c8b642448 MOVQ 0x48(SP), R12
natconv.go:213 0x5fc7b9 4c8b7c2450 MOVQ 0x50(SP), R15
natconv.go:171 0x5fc7be 4189c0 MOVL AX, R8
natconv.go:227 0x5fc7c1 8b742444 MOVL 0x44(SP), SI
natconv.go:231 0x5fc7c5 488bbc24a0000000 MOVQ 0xa0(SP), DI
natconv.go:219 0x5fc7cd 488b842418010000 MOVQ 0x118(SP), AX // dead
> natconv.go:243 0x5fc7d5 488b8424b8000000 MOVQ 0xb8(SP), AX // len(z) (NOT cap!)
> natconv.go:171 0x5fc7dd 4488842480000000 MOVB R8, 0x80(SP) // temporary
> natconv.go:243 0x5fc7e5 4c8b8424b0000000 MOVQ 0xb0(SP), R8 // cap(z) (NOT len!)
> natconv.go:243 0x5fc7ed 4c898424b8000000 MOVQ R8, 0xb8(SP) // cap(z)
> natconv.go:243 0x5fc7f5 48898424b0000000 MOVQ AX, 0xb0(SP) // len(z)
> natconv.go:219 0x5fc7fd 488b842418010000 MOVQ 0x118(SP), AX
> natconv.go:171 0x5fc805 440fb6842480000000 MOVZX 0x80(SP), R8
The important part is the lines marked with a leading >. At the start of the block, the compiler stores len(z) and cap(z) in the slots usually used for cap(z) and len(z). At the end of the block, the compiler must now correct the stores to match where other parts of the function will expect to load from. The marked sequence is a bit confusing but it does:
- Reload len(z) from 0xb8(SP) into AX.
- Spill a byte from R8, which holds the
chresult of the call, to 0x80(SP). - Reload cap(z) from 0xb0(SP) into R8, and then store it into 0xb8(SP).
- Store len(z) from AX into 0xb0(SP).
- Reload AX with the value it held before the marked sequence (0x118(SP)).
- Reload R8 from its spilled stack location, zero-extending the byte into a full word.
If the marked stores at the top of the basic block wrote to 0xb0(SP) and 0xb8(SP) instead of 0xb8(SP) and 0xb0(SP), then all the marked lines at the bottom could be deleted entirely. In the bZ version of the code, something about the extra temporary leads the compiler to make better choices and do exactly that.
We can also make that change directly. The big.zip attachment contains fixbig.go, which reads big.test.old, makes exactly those edits (swap the two store locations and delete the marked lines at the end), and writes big.test.old.fix. Sure enough, big.test.old.fix runs about as fast as big.test.new does.
Aside 1. Note that the line numbers are bogus. The ReadByte call is line 219. The rest is spilling in preparation for the call and reloading in preparation for falling through to the next basic block. I would argue that the initial spills should be given the first line number from the basic block's actual code, and the cleanup should be given the last line number from the actual code, so that every instruction in this basic block should say line 219. It looks like maybe the spill/load code is assigned the line number of the final mention of that variable in the function. As it stands now, the resulting scattershot line numbering makes source-level profiling almost worthless.
Aside 2. Note the number of loads in the disassembly that I've marked // dead. The registers these instructions load are overwritten later in the basic block without being used. It seems like a basic peephole optimizer could remove them, although better would be not to generate them in the first place. There are also a few marked // could be just that load into a register whose only use is to move into a final destination register later. These suboptimal load sequences persist in the bZ version that is 14% faster. I wrote a version of fixbig.go that removes the lines I marked // dead (except the final one before the > section at the bottom, which stops being dead when that section is deleted), and it runs 17% faster than the base:
goos: linux
goarch: amd64
pkg: math/big
cpu: AMD Ryzen 9 7950X 16-Core Processor
│ old │ fix │
│ sec/op │ sec/op vs base │
Scan/1000/Base10-32 4.694µ ± 0% 3.859µ ± 1% -17.80% (p=0.000 n=20)
This is not the tightest loop in the world (it contains an indirect call!) and yet these basic cleanups give a very significant speedup. I wonder whether this is an isolated case or this happens in basically all code we generate.