Skip to content

cmd/compile: archsimd VPSRLW missing constant-folding rewrite rules + suboptimal NotEqual codegen #78138

@kolkov

Description

@kolkov

Go version

go1.26.1

Output of go env in your module/workspace:

GOEXPERIMENT=simd
GOARCH=amd64

What did you do?

I'm working on a regex engine (coregex) that uses a Teddy multi-pattern SIMD prefilter. I rewrote the core loop from hand-written Go assembly to archsimd intrinsics, hoping to get equivalent performance while also unlocking AVX2 (our asm AVX2 path was unusable due to VZEROUPPER overhead at Go/asm boundary, see #77647).

The hot loop is a standard SIMD nibble-matching pattern — load 32 bytes, extract low/high nibbles, VPSHUFB lookup, AND, check for non-zero. This loop runs ~187K iterations on a 6MB input.

Reproducer showing the two issues:

//go:build goexperiment.simd && amd64

package main

import (
	"simd/archsimd"
	"unsafe"
)

//go:noinline
func teddyLoop(data []byte, loTable, hiTable *[32]int8) int {
	nibbleMask := archsimd.BroadcastInt8x32(0x0F)
	zero := archsimd.Int8x32{}
	lo := archsimd.LoadInt8x32(loTable)
	hi := archsimd.LoadInt8x32(hiTable)

	count := 0
	for i := 0; i+32 <= len(data); i += 32 {
		chunk := archsimd.LoadInt8x32((*[32]int8)(unsafe.Pointer(&data[i])))
		loNibbles := chunk.And(nibbleMask)
		hiNibbles := chunk.AsUint16x16().ShiftAllRight(4).AsInt8x32().And(nibbleMask)

		loResult := lo.PermuteOrZeroGrouped(loNibbles)
		hiResult := hi.PermuteOrZeroGrouped(hiNibbles)
		combined := loResult.And(hiResult)

		if combined.NotEqual(zero).ToBits() != 0 {
			count++
		}
	}
	return count
}

func main() {
	data := make([]byte, 1024)
	var lo, hi [32]int8
	_ = teddyLoop(data, &lo, &hi)
}

What did you see happen?

Two codegen issues in the hot loop (go build -gcflags='-S'):

1) ShiftAllRight(4) uses register-form VPSRLW instead of immediate

The constant 4 is reloaded into a GP register and moved to XMM every iteration:

; inside the loop body:
MOVL    $4, CX          ; reload constant
MOVQ    CX, X6          ; GP → XMM
VPSRLW  X6, Y4, Y4      ; register-form shift

I dug into the SSA rules and found the root cause. The VPSRLW256const opcode exists in simdAMD64ops.go (line 1448), but the constant-folding rewrite rule is missing. Left shifts have it, right shifts don't:

; simdAMD64.rules — left shift constant folding EXISTS:
(VPSLLW256 x (MOVQconst [c])) => (VPSLLW256const [uint8(c)] x)

; right shift constant folding is MISSING:
; (VPSRLW128 x (MOVQconst [c])) => (VPSRLW128const [uint8(c)] x)  — not there
; (VPSRLW256 x (MOVQconst [c])) => (VPSRLW256const [uint8(c)] x)  — not there
; (VPSRLW512 x (MOVQconst [c])) => (VPSRLW512const [uint8(c)] x)  — not there

Same gap exists for VPSRLD and VPSRLQ non-masked variants — the const opcodes are defined but the rewrite rules to use them are absent. The masked variants (VPSRLWMasked*) have load-merging rules for the const form but also lack the basic MOVQconst folding rule.

Expected: VPSRLW $4, Y4, Y4 — one instruction.

2) NotEqual(zero).ToBits() generates 3 YMM ops instead of 1 scalar op

VPCMPEQB Y4, Y4, Y5    ; all-ones via self-compare
VPCMPEQB Y4, Y1, Y4    ; compare to zero
VPXOR    Y5, Y4, Y4    ; NOT
VPMOVMSKB Y4, DI       ; extract

This could be:

VPCMPEQB Y1, Y4, Y4    ; compare to zero
VPMOVMSKB Y4, DI       ; extract (bit=1 where zero, bit=0 where candidate)
NOTL    DI              ; scalar NOT — or just branch on the inverted condition

The self-compare + VPXOR is 2 wasted YMM operations per iteration to do what a single NOTL (or an inverted branch) could do.

What did you expect to see?

For issue 1: immediate-form VPSRLW $4, Y-reg, Y-reg. The fix should be straightforward — add the missing rules to simdAMD64.rules, matching the pattern already used for VPSLLW/VPSRAW/VPSRAD/etc.

For issue 2: strength-reduce the NOT from 2 YMM ops to 1 scalar op after VPMOVMSKB. Or recognize that NotEqual(zero).ToBits() != 0 is equivalent to Equal(zero).ToBits() != 0xFFFFFFFF and skip the inversion entirely.

Performance impact

Benchmarked on AMD EPYC 7763 (our regex-bench CI, 6MB input, 8-pattern Teddy search):

Implementation Time Notes
Hand-written Go asm (SSSE3, 16B/iter) 6.05 ms baseline run
archsimd intrinsics (AVX2, 32B/iter) 34.57 ms archsimd run, 5.7x regression
Rust regex (AVX2, inlined) 0.70 ms same run, reference

DNA benchmark (9 Teddy patterns, same 6MB input) shows similar results: baseline vs archsimd — all patterns 34-71x slower than Rust.

archsimd AVX2 should be roughly 2x faster than SSSE3 asm (double the vector width), but the extra instructions per iteration + larger loop body appear to cause significant overhead on this microarchitecture.

Issue 1 alone adds 2 wasted instructions per iteration (the constant reload). Issue 2 adds 2 more. Over 187K iterations that's ~750K unnecessary instructions.

Related

Metadata

Metadata

Assignees

Labels

BugReportIssues describing a possible bug in the Go implementation.NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.compiler/runtimeIssues related to the Go compiler and/or runtime.

Type

No type

Projects

Status

Todo

Relationships

None yet

Development

No branches or pull requests

Issue actions