[WIP][X86] lowerBuildVectorAsBroadcast - don't convert constant vectors to broadcasts on AVX512VL targets #73509

RKSimon · 2023-11-27T12:36:48Z

On AVX512VL targets we're better off keeping constant vectors at full width to ensure that they can be load folded into vector instructions, reducing register pressure. If a vector constant remains as a basic load, X86FixupVectorConstantsPass will still convert this to a broadcast instruction for us.

This is still a WIP patch - as can be seen by the changes to X86InstrFoldTables.cpp, we have very poor coverage for the BroadcastFoldTables (Issue #66360). I don't know whether just to continue manually extending these tables or to wait for #66360 to be done.

Non-VLX AVX512 targets are still seeing some regressions due to main instructions being implicitly widened to 512-bit ops in isel patterns and not in the DAG, so for now lets keep them as it is (same for AVX1/AVX2 targets). For AVX1/AVX2, broadcasting constants via lowerBuildVectorAsBroadcast helps a lot, as long as we don't cause register spills, which is major problem on larger vectorized hot loops. I'm currently thinking we should add a x86 pass, similar to MachineLICM, that unfolds broadcastable constant loads as long as we have spare registers; we could then remove the remaining lowerBuildVectorAsBroadcast constant handling entirely - any thoughts?

My goal is to improve AVX1/AVX2 vector constant handling but getting AVX512 out of the way appears to be an easier first step.

github-actions · 2023-11-27T12:39:18Z

✅ With the latest revision this PR passed the C/C++ code formatter.

llvm/lib/Target/X86/X86ISelLowering.cpp

llvm/test/CodeGen/X86/avx512cfma-intrinsics.ll

…tries Prep work for #73509 (missed in #73654)

RKSimon · 2023-12-07T14:23:26Z

Added basic handling for non-VLX AVX512 targets when dealing with 512-bit constant vectors

llvm/test/CodeGen/X86/avx512fp16-arith.ll

goldsteinn · 2023-12-07T18:07:45Z

llvm/test/CodeGen/X86/vector-interleaved-load-i8-stride-4.ll

@@ -948,7 +948,8 @@ define void @load_i8_stride4_vf32(ptr %in.vec, ptr %out.vec0, ptr %out.vec1, ptr
 ; AVX512F-NEXT:    vpshufb %ymm0, %ymm1, %ymm2
 ; AVX512F-NEXT:    vmovdqa 64(%rdi), %ymm3
 ; AVX512F-NEXT:    vpshufb %ymm0, %ymm3, %ymm0
-; AVX512F-NEXT:    vmovdqa {{.*#+}} ymm4 = [0,4,0,4,0,4,8,12]
+; AVX512F-NEXT:    vbroadcasti128 {{.*#+}} ymm4 = [0,4,8,12,0,4,8,12]


New broadcast?

Yes - still addressing regressions, that's why its still a draft :)

goldsteinn · 2023-12-07T18:09:59Z

Can't memory ops microfuse on all targets? Why is this avx512vl only?

We were using VPTERNLOGQ for everything but i32 types, which made broadcasts wider than necessary Noticed in #73509

RKSimon · 2023-12-08T11:45:44Z

Can't memory ops microfuse on all targets? Why is this avx512vl only?

I don't understand what you're asking - we already load-fold for all targets. This patch is about improving broadcast-load-fold. By prematurely converting to constant broadcasts in DAG we're hindering later optimizations - MachineLICM is a good example (we end up hoisting the broadcast which then often spills the full width broadcasted vector......). By keeping to full vector width until X86FixupVectorConstants we avoid a lot of this.

I will eventually be disabling constant broadcasting in lowerBuildVectorAsBroadcast for all AVX targets later but theres a lot of regressions to still deal with - AVX512VL (and AVX512F for 512-bit vectors) is the first step.

…ndi(z,w,c1)) to AVX512BW mask select Yet another yak shaving regression fix for #73509

Handle masked predicated load/broadcasts in addConstantComments now that we can generically handle the destination + mask register This will more significantly help improve 'fixup constant' comments from #73509

Handle masked predicated movss/movsd in addConstantComments now that we can generically handle the destination + mask register This will more significantly help improve 'fixup constant' comments from #73509

Handle masked predicated load/broadcasts in addConstantComments now that we can generically handle the destination + mask register This will more significantly help improve 'fixup constant' comments from llvm#73509

Handle masked predicated movss/movsd in addConstantComments now that we can generically handle the destination + mask register This will more significantly help improve 'fixup constant' comments from llvm#73509

… targets On AVX512 targets we're better off keeping constant vector at full width to ensure that they can be load folded into vector instructions, reducing register pressure. If a vector constant remains as a basic load, X86FixupVectorConstantsPass will still convert this to a broadcast instruction for us. Non-VLX targets are still seeing some regressions due to these being implicitly widened to 512-bit ops in isel patterns and not in the DAG, so I've limited this to just 512-bit vectors for now.

RKSimon requested review from phoebewang, KanRobert, goldsteinn and yubingex007-a11y November 27, 2023 12:36

phoebewang reviewed Nov 27, 2023

View reviewed changes

llvm/lib/Target/X86/X86ISelLowering.cpp Outdated Show resolved Hide resolved

phoebewang reviewed Nov 27, 2023

View reviewed changes

llvm/test/CodeGen/X86/avx512cfma-intrinsics.ll Outdated Show resolved Hide resolved

RKSimon force-pushed the perf/broadcast-avx512 branch 8 times, most recently from 21fad09 to dae6506 Compare November 30, 2023 13:37

RKSimon added a commit that referenced this pull request Nov 30, 2023

[X86] X86InstrFoldTables.cpp - add Op4 Broadcast Fold/Unfold table en…

b8bbd5f

…tries Prep work for #73509 (missed in #73654)

RKSimon force-pushed the perf/broadcast-avx512 branch 5 times, most recently from fc410b2 to c5db884 Compare December 7, 2023 14:22

goldsteinn reviewed Dec 7, 2023

View reviewed changes

llvm/test/CodeGen/X86/avx512fp16-arith.ll Outdated Show resolved Hide resolved

goldsteinn reviewed Dec 7, 2023

View reviewed changes

RKSimon force-pushed the perf/broadcast-avx512 branch from c5db884 to 4fedfe0 Compare December 8, 2023 11:17

RKSimon added a commit that referenced this pull request Dec 8, 2023

[X86] canonicalizeBitSelect - always use VPTERNLOGD for sub-32bit types

5f91335

We were using VPTERNLOGQ for everything but i32 types, which made broadcasts wider than necessary Noticed in #73509

RKSimon force-pushed the perf/broadcast-avx512 branch 2 times, most recently from adc89f0 to 3af0810 Compare December 8, 2023 13:21

RKSimon force-pushed the perf/broadcast-avx512 branch 7 times, most recently from 0c80ea8 to 6b2809b Compare December 20, 2023 15:46

RKSimon force-pushed the perf/broadcast-avx512 branch from 6b2809b to 5eff513 Compare January 2, 2024 13:42

RKSimon added a commit that referenced this pull request Jan 3, 2024

[X86] combineConcatVectorOps - fold 512-bit concat(blendi(x,y,c0),ble…

1d27669

…ndi(z,w,c1)) to AVX512BW mask select Yet another yak shaving regression fix for #73509

RKSimon force-pushed the perf/broadcast-avx512 branch from 5eff513 to f738150 Compare January 3, 2024 13:03

RKSimon force-pushed the perf/broadcast-avx512 branch from f738150 to 6d1519e Compare February 5, 2024 12:58

RKSimon force-pushed the perf/broadcast-avx512 branch 2 times, most recently from 86ed907 to 925a8d0 Compare February 5, 2024 18:09

RKSimon force-pushed the perf/broadcast-avx512 branch from 925a8d0 to 77629d5 Compare February 28, 2024 10:58

RKSimon force-pushed the perf/broadcast-avx512 branch from 77629d5 to e8d60f1 Compare April 8, 2024 11:10

RKSimon mentioned this pull request Apr 18, 2024

[X86] Use GFNI for vXi8 shifts/rotates #89115

Merged

RKSimon force-pushed the perf/broadcast-avx512 branch from e8d60f1 to 27c0a8a Compare April 18, 2024 21:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][X86] lowerBuildVectorAsBroadcast - don't convert constant vectors to broadcasts on AVX512VL targets #73509

[WIP][X86] lowerBuildVectorAsBroadcast - don't convert constant vectors to broadcasts on AVX512VL targets #73509

RKSimon commented Nov 27, 2023

github-actions bot commented Nov 27, 2023 •

edited

RKSimon commented Dec 7, 2023

goldsteinn Dec 7, 2023

RKSimon Dec 8, 2023

goldsteinn commented Dec 7, 2023

RKSimon commented Dec 8, 2023

[WIP][X86] lowerBuildVectorAsBroadcast - don't convert constant vectors to broadcasts on AVX512VL targets #73509

Are you sure you want to change the base?

[WIP][X86] lowerBuildVectorAsBroadcast - don't convert constant vectors to broadcasts on AVX512VL targets #73509

Conversation

RKSimon commented Nov 27, 2023

github-actions bot commented Nov 27, 2023 • edited

RKSimon commented Dec 7, 2023

goldsteinn Dec 7, 2023

Choose a reason for hiding this comment

RKSimon Dec 8, 2023

Choose a reason for hiding this comment

goldsteinn commented Dec 7, 2023

RKSimon commented Dec 8, 2023

github-actions bot commented Nov 27, 2023 •

edited