What version of Go are you using (go version)?
$ go version
go1.21-dev +fe5af1532a
Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (go env)?
What did you do?
We have a code generator that generates a struct with setters. To track whether set has been called for a given field, we flip the bit in a bitmap. The code looks like this:
func setBit(part *uint32, num uint32) {
*part |= 1 << (num % 32)
}
type x struct {
bitmap [4]uint32 // A bitmap containing whether "Set" was called on a given field.
u int32 // Imagine this is field number 8.
v int32 // Imagine this is field number 38.
}
func (m *x) SetV(val int32) {
m.v = val
setBit(&(m.bitmap[1]), 37)
}
func (m *x) SetU(val int32) {
m.u = val
setBit(&(m.bitmap[0]), 7)
}
What did you expect to see?
I expected similar instructions (with different operands) being generated for both setters.
What did you see instead?
SetU is ~30% slower than SetV, as measured in local benchmarks (on a zen4 machine). The relevant difference is (godbolt):
TEXT main.(*x).SetV(SB), NOSPLIT|NOFRAME|ABIInternal, $0-16
MOVL BX, 20(AX)
NOP
ORL $32, 4(AX)
RET
TEXT main.(*x).SetU(SB), NOSPLIT|NOFRAME|ABIInternal, $0-16
MOVL BX, 16(AX)
MOVL (AX), CX
BTSL $7, CX
NOP
MOVL CX, (AX)
RET
It seems like OR into memory does better than MOV/BTS/MOV.
According to https://www.uops.info/table.html, for skylake-x and zen4, it seems the OR family is pound-for-pound (slightly) better than the BTS family:
| Instruction |
Lat |
TP |
Uops |
Ports |
Lat |
TP |
Uops |
Ports |
| BTS (M32, I8) |
[≤3;≤10] |
1.00 / 1.00 |
3 / 4 |
1p06+1p23+1p237+1p4 |
[5;12] |
2.00 |
4 |
|
| OR (M32, I32) |
[≤3;≤10] |
1.00 / 1.00 |
2 / 4 |
1p0156+1p23+1p237+1p4 |
[≤1;≤8] |
0.56 |
2 |
|
| BTS (R32, I8) |
1 |
0.50 / 0.50 |
1 / 1 |
1*p06 |
[1;2] |
1.00 |
2 |
|
| OR (R32, I8) |
1 |
0.25 / 0.25 |
1 / 1 |
1*p0156 |
1 |
0.25 |
1 |
|
I didn't look up what those MOV instructions cost, but it's difficult to predict costs from individual operations in the complex processors of today. Things I didn't test (because the Go compiler doesn't generate/inline them:
- MOV/OR/MOV
- BTS memory,immediate
Some of the speedup may be due to the shorter instruction sequence, too.
What version of Go are you using (
go version)?Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (
go env)?What did you do?
We have a code generator that generates a struct with setters. To track whether set has been called for a given field, we flip the bit in a bitmap. The code looks like this:
What did you expect to see?
I expected similar instructions (with different operands) being generated for both setters.
What did you see instead?
SetUis ~30% slower thanSetV, as measured in local benchmarks (on a zen4 machine). The relevant difference is (godbolt):It seems like
ORinto memory does better thanMOV/BTS/MOV.According to https://www.uops.info/table.html, for skylake-x and zen4, it seems the OR family is pound-for-pound (slightly) better than the BTS family:
I didn't look up what those
MOVinstructions cost, but it's difficult to predict costs from individual operations in the complex processors of today. Things I didn't test (because the Go compiler doesn't generate/inline them:Some of the speedup may be due to the shorter instruction sequence, too.