Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][X86] lowerBuildVectorAsBroadcast - don't convert constant vectors to broadcasts on AVX512VL targets #73509

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

RKSimon
Copy link
Collaborator

@RKSimon RKSimon commented Nov 27, 2023

On AVX512VL targets we're better off keeping constant vectors at full width to ensure that they can be load folded into vector instructions, reducing register pressure. If a vector constant remains as a basic load, X86FixupVectorConstantsPass will still convert this to a broadcast instruction for us.

This is still a WIP patch - as can be seen by the changes to X86InstrFoldTables.cpp, we have very poor coverage for the BroadcastFoldTables (Issue #66360). I don't know whether just to continue manually extending these tables or to wait for #66360 to be done.

Non-VLX AVX512 targets are still seeing some regressions due to main instructions being implicitly widened to 512-bit ops in isel patterns and not in the DAG, so for now lets keep them as it is (same for AVX1/AVX2 targets). For AVX1/AVX2, broadcasting constants via lowerBuildVectorAsBroadcast helps a lot, as long as we don't cause register spills, which is major problem on larger vectorized hot loops. I'm currently thinking we should add a x86 pass, similar to MachineLICM, that unfolds broadcastable constant loads as long as we have spare registers; we could then remove the remaining lowerBuildVectorAsBroadcast constant handling entirely - any thoughts?

My goal is to improve AVX1/AVX2 vector constant handling but getting AVX512 out of the way appears to be an easier first step.

Copy link

github-actions bot commented Nov 27, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch 8 times, most recently from 21fad09 to dae6506 Compare November 30, 2023 13:37
RKSimon added a commit that referenced this pull request Nov 30, 2023
@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch 5 times, most recently from fc410b2 to c5db884 Compare December 7, 2023 14:22
@RKSimon
Copy link
Collaborator Author

RKSimon commented Dec 7, 2023

Added basic handling for non-VLX AVX512 targets when dealing with 512-bit constant vectors

@@ -948,7 +948,8 @@ define void @load_i8_stride4_vf32(ptr %in.vec, ptr %out.vec0, ptr %out.vec1, ptr
; AVX512F-NEXT: vpshufb %ymm0, %ymm1, %ymm2
; AVX512F-NEXT: vmovdqa 64(%rdi), %ymm3
; AVX512F-NEXT: vpshufb %ymm0, %ymm3, %ymm0
; AVX512F-NEXT: vmovdqa {{.*#+}} ymm4 = [0,4,0,4,0,4,8,12]
; AVX512F-NEXT: vbroadcasti128 {{.*#+}} ymm4 = [0,4,8,12,0,4,8,12]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New broadcast?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - still addressing regressions, that's why its still a draft :)

@goldsteinn
Copy link
Contributor

Can't memory ops microfuse on all targets? Why is this avx512vl only?

RKSimon added a commit that referenced this pull request Dec 8, 2023
We were using VPTERNLOGQ for everything but i32 types, which made broadcasts wider than necessary

Noticed in #73509
@RKSimon
Copy link
Collaborator Author

RKSimon commented Dec 8, 2023

Can't memory ops microfuse on all targets? Why is this avx512vl only?

I don't understand what you're asking - we already load-fold for all targets. This patch is about improving broadcast-load-fold. By prematurely converting to constant broadcasts in DAG we're hindering later optimizations - MachineLICM is a good example (we end up hoisting the broadcast which then often spills the full width broadcasted vector......). By keeping to full vector width until X86FixupVectorConstants we avoid a lot of this.

I will eventually be disabling constant broadcasting in lowerBuildVectorAsBroadcast for all AVX targets later but theres a lot of regressions to still deal with - AVX512VL (and AVX512F for 512-bit vectors) is the first step.

@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch 2 times, most recently from adc89f0 to 3af0810 Compare December 8, 2023 13:21
@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch 7 times, most recently from 0c80ea8 to 6b2809b Compare December 20, 2023 15:46
RKSimon added a commit that referenced this pull request Jan 3, 2024
…ndi(z,w,c1)) to AVX512BW mask select

Yet another yak shaving regression fix for #73509
RKSimon added a commit that referenced this pull request Feb 5, 2024
Handle masked predicated load/broadcasts in addConstantComments now that we can generically handle the destination + mask register

This will more significantly help improve 'fixup constant' comments from #73509
RKSimon added a commit that referenced this pull request Feb 5, 2024
Handle masked predicated movss/movsd in addConstantComments now that we can generically handle the destination + mask register

This will more significantly help improve 'fixup constant' comments from #73509
@RKSimon RKSimon force-pushed the perf/broadcast-avx512 branch 2 times, most recently from 86ed907 to 925a8d0 Compare February 5, 2024 18:09
agozillon pushed a commit to agozillon/llvm-project that referenced this pull request Feb 5, 2024
Handle masked predicated load/broadcasts in addConstantComments now that we can generically handle the destination + mask register

This will more significantly help improve 'fixup constant' comments from llvm#73509
agozillon pushed a commit to agozillon/llvm-project that referenced this pull request Feb 5, 2024
Handle masked predicated movss/movsd in addConstantComments now that we can generically handle the destination + mask register

This will more significantly help improve 'fixup constant' comments from llvm#73509
… targets

On AVX512 targets we're better off keeping constant vector at full width to ensure that they can be load folded into vector instructions, reducing register pressure.

If a vector constant remains as a basic load, X86FixupVectorConstantsPass will still convert this to a broadcast instruction for us.

Non-VLX targets are still seeing some regressions due to these being implicitly widened to 512-bit ops in isel patterns and not in the DAG, so I've limited this to just 512-bit vectors for now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants