Go version
go1.26.1
Output of go env in your module/workspace:
GOEXPERIMENT=simd
GOARCH=amd64
What did you do?
I'm working on a regex engine (coregex) that uses a Teddy multi-pattern SIMD prefilter. I rewrote the core loop from hand-written Go assembly to archsimd intrinsics, hoping to get equivalent performance while also unlocking AVX2 (our asm AVX2 path was unusable due to VZEROUPPER overhead at Go/asm boundary, see #77647).
The hot loop is a standard SIMD nibble-matching pattern — load 32 bytes, extract low/high nibbles, VPSHUFB lookup, AND, check for non-zero. This loop runs ~187K iterations on a 6MB input.
Reproducer showing the two issues:
//go:build goexperiment.simd && amd64
package main
import (
"simd/archsimd"
"unsafe"
)
//go:noinline
func teddyLoop(data []byte, loTable, hiTable *[32]int8) int {
nibbleMask := archsimd.BroadcastInt8x32(0x0F)
zero := archsimd.Int8x32{}
lo := archsimd.LoadInt8x32(loTable)
hi := archsimd.LoadInt8x32(hiTable)
count := 0
for i := 0; i+32 <= len(data); i += 32 {
chunk := archsimd.LoadInt8x32((*[32]int8)(unsafe.Pointer(&data[i])))
loNibbles := chunk.And(nibbleMask)
hiNibbles := chunk.AsUint16x16().ShiftAllRight(4).AsInt8x32().And(nibbleMask)
loResult := lo.PermuteOrZeroGrouped(loNibbles)
hiResult := hi.PermuteOrZeroGrouped(hiNibbles)
combined := loResult.And(hiResult)
if combined.NotEqual(zero).ToBits() != 0 {
count++
}
}
return count
}
func main() {
data := make([]byte, 1024)
var lo, hi [32]int8
_ = teddyLoop(data, &lo, &hi)
}
What did you see happen?
Two codegen issues in the hot loop (go build -gcflags='-S'):
1) ShiftAllRight(4) uses register-form VPSRLW instead of immediate
The constant 4 is reloaded into a GP register and moved to XMM every iteration:
; inside the loop body:
MOVL $4, CX ; reload constant
MOVQ CX, X6 ; GP → XMM
VPSRLW X6, Y4, Y4 ; register-form shift
I dug into the SSA rules and found the root cause. The VPSRLW256const opcode exists in simdAMD64ops.go (line 1448), but the constant-folding rewrite rule is missing. Left shifts have it, right shifts don't:
; simdAMD64.rules — left shift constant folding EXISTS:
(VPSLLW256 x (MOVQconst [c])) => (VPSLLW256const [uint8(c)] x)
; right shift constant folding is MISSING:
; (VPSRLW128 x (MOVQconst [c])) => (VPSRLW128const [uint8(c)] x) — not there
; (VPSRLW256 x (MOVQconst [c])) => (VPSRLW256const [uint8(c)] x) — not there
; (VPSRLW512 x (MOVQconst [c])) => (VPSRLW512const [uint8(c)] x) — not there
Same gap exists for VPSRLD and VPSRLQ non-masked variants — the const opcodes are defined but the rewrite rules to use them are absent. The masked variants (VPSRLWMasked*) have load-merging rules for the const form but also lack the basic MOVQconst folding rule.
Expected: VPSRLW $4, Y4, Y4 — one instruction.
2) NotEqual(zero).ToBits() generates 3 YMM ops instead of 1 scalar op
VPCMPEQB Y4, Y4, Y5 ; all-ones via self-compare
VPCMPEQB Y4, Y1, Y4 ; compare to zero
VPXOR Y5, Y4, Y4 ; NOT
VPMOVMSKB Y4, DI ; extract
This could be:
VPCMPEQB Y1, Y4, Y4 ; compare to zero
VPMOVMSKB Y4, DI ; extract (bit=1 where zero, bit=0 where candidate)
NOTL DI ; scalar NOT — or just branch on the inverted condition
The self-compare + VPXOR is 2 wasted YMM operations per iteration to do what a single NOTL (or an inverted branch) could do.
What did you expect to see?
For issue 1: immediate-form VPSRLW $4, Y-reg, Y-reg. The fix should be straightforward — add the missing rules to simdAMD64.rules, matching the pattern already used for VPSLLW/VPSRAW/VPSRAD/etc.
For issue 2: strength-reduce the NOT from 2 YMM ops to 1 scalar op after VPMOVMSKB. Or recognize that NotEqual(zero).ToBits() != 0 is equivalent to Equal(zero).ToBits() != 0xFFFFFFFF and skip the inversion entirely.
Performance impact
Benchmarked on AMD EPYC 7763 (our regex-bench CI, 6MB input, 8-pattern Teddy search):
| Implementation |
Time |
Notes |
| Hand-written Go asm (SSSE3, 16B/iter) |
6.05 ms |
baseline run |
| archsimd intrinsics (AVX2, 32B/iter) |
34.57 ms |
archsimd run, 5.7x regression |
| Rust regex (AVX2, inlined) |
0.70 ms |
same run, reference |
DNA benchmark (9 Teddy patterns, same 6MB input) shows similar results: baseline vs archsimd — all patterns 34-71x slower than Rust.
archsimd AVX2 should be roughly 2x faster than SSSE3 asm (double the vector width), but the extra instructions per iteration + larger loop body appear to cause significant overhead on this microarchitecture.
Issue 1 alone adds 2 wasted instructions per iteration (the constant reload). Issue 2 adds 2 more. Over 187K iterations that's ~750K unnecessary instructions.
Related
Go version
go1.26.1
Output of
go envin your module/workspace:What did you do?
I'm working on a regex engine (coregex) that uses a Teddy multi-pattern SIMD prefilter. I rewrote the core loop from hand-written Go assembly to
archsimdintrinsics, hoping to get equivalent performance while also unlocking AVX2 (our asm AVX2 path was unusable due to VZEROUPPER overhead at Go/asm boundary, see #77647).The hot loop is a standard SIMD nibble-matching pattern — load 32 bytes, extract low/high nibbles, VPSHUFB lookup, AND, check for non-zero. This loop runs ~187K iterations on a 6MB input.
Reproducer showing the two issues:
What did you see happen?
Two codegen issues in the hot loop (
go build -gcflags='-S'):1)
ShiftAllRight(4)uses register-form VPSRLW instead of immediateThe constant 4 is reloaded into a GP register and moved to XMM every iteration:
I dug into the SSA rules and found the root cause. The
VPSRLW256constopcode exists insimdAMD64ops.go(line 1448), but the constant-folding rewrite rule is missing. Left shifts have it, right shifts don't:Same gap exists for
VPSRLDandVPSRLQnon-masked variants — the const opcodes are defined but the rewrite rules to use them are absent. The masked variants (VPSRLWMasked*) have load-merging rules for the const form but also lack the basicMOVQconstfolding rule.Expected:
VPSRLW $4, Y4, Y4— one instruction.2)
NotEqual(zero).ToBits()generates 3 YMM ops instead of 1 scalar opThis could be:
The self-compare + VPXOR is 2 wasted YMM operations per iteration to do what a single
NOTL(or an inverted branch) could do.What did you expect to see?
For issue 1: immediate-form
VPSRLW $4, Y-reg, Y-reg. The fix should be straightforward — add the missing rules tosimdAMD64.rules, matching the pattern already used forVPSLLW/VPSRAW/VPSRAD/etc.For issue 2: strength-reduce the NOT from 2 YMM ops to 1 scalar op after VPMOVMSKB. Or recognize that
NotEqual(zero).ToBits() != 0is equivalent toEqual(zero).ToBits() != 0xFFFFFFFFand skip the inversion entirely.Performance impact
Benchmarked on AMD EPYC 7763 (our regex-bench CI, 6MB input, 8-pattern Teddy search):
DNA benchmark (9 Teddy patterns, same 6MB input) shows similar results: baseline vs archsimd — all patterns 34-71x slower than Rust.
archsimd AVX2 should be roughly 2x faster than SSSE3 asm (double the vector width), but the extra instructions per iteration + larger loop body appear to cause significant overhead on this microarchitecture.
Issue 1 alone adds 2 wasted instructions per iteration (the constant reload). Issue 2 adds 2 more. Over 187K iterations that's ~750K unnecessary instructions.
Related
LoadUint8x32Sliceemitting unnecessary nops (similar codegen quality area)