Skip to content

Conversation

felipealmeida
Copy link
Contributor

[Vectorize] Add StridedLoopUnroll + Versioning for 2-D strided loop nests

Summary

Introduce two passes that recognize a common 2-D row/column loop idiom (e.g. image/matrix kernels) and produce:

  • a fallback version for LoopVectorize, and
  • a strided version using widened ops, VP strided loads/stores, and controlled unrolling on scalable-vector targets.

What it matches

  • Outer canonical IV (rows) and inner canonical IV (cols), both starting at 0, step = 1, != predicate.
  • Inner loop: unit-stride loads/stores of uniform element size.
  • Outer loop: base pointers advanced by a regular (dynamic) stride SCEV.
  • Single store in inner body drives the producer graph.

What it does

Function pass (StridedLoopUnrollVersioningPass):

  • Builds LAA with AssumptionCache; uses LoopVersioning with runtime pointer checks.
  • Adds guards: inner TC divisible by unroll, outer TC divisible by vscale, and alignment if required by target.
  • Unrolls inner loop (heuristic 8 / elemSize), hoists invariant loads, eliminates duplicate loads.
  • Marks loops (alias scopes, llvm.stride.loop_idiom, etc.).

Loop pass (StridedLoopUnrollPass):

  • On loops marked llvm.stride.loop_idiom, widens supported ops by vscale.
  • Lowers unit-stride memory to experimental.vp.strided_{load,store}, adjusts IV increments, and cleans up dead code.

Why does it matter

The x264 project works on calls to pixel_avg of images of 8x8 or 16x16
sizes. This loop versioning allows the use of strided load/stores to
load the whole image depending on the size of the vector for images
8x8. This gives a considerable boost in performance in SPEC 2017 for
Ventana design (6% instruction count reduction) on x264_r test.

Feedback

We want to see if this could be improved, or maybe if some other alternative way to implement this would be better.

Copy link

github-actions bot commented Sep 9, 2025

✅ With the latest revision this PR passed the undef deprecator.

Copy link

github-actions bot commented Sep 9, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@felipealmeida felipealmeida force-pushed the felipe_strided_loop_unroll_upstream branch 4 times, most recently from ab898fb to ee48286 Compare September 9, 2025 22:43
…l loops

Introduce two passes that recognize a common 2-D row/column loop idiom (e.g. image/matrix kernels) and produce:

* a **fallback** version for LoopVectorize, and
* a **strided** version using widened ops, VP strided loads/stores, and controlled unrolling on scalable-vector targets.

* Outer canonical IV (rows) and inner canonical IV (cols), both starting at 0, step = 1, `!=` predicate.
* Inner loop: **unit-stride** loads/stores of uniform element size.
* Outer loop: base pointers advanced by a **regular (dynamic) stride** SCEV.
* Single store in inner body drives the producer graph.
* Target supports scalable vectors (`TTI::supportsScalableVectors()`).

Function pass (`StridedLoopUnrollVersioningPass`):

* Builds LAA with **AssumptionCache**; uses `LoopVersioning` with runtime pointer checks.
* Adds guards: inner TC divisible by unroll, outer TC divisible by `vscale`, and alignment if required by target.
* Unrolls inner loop (heuristic `8 / elemSize`), hoists invariant loads, eliminates duplicate loads.
* Marks loops (alias scopes, `llvm.stride.loop_idiom`, etc.).

Loop pass (`StridedLoopUnrollPass`):

* On loops marked `llvm.stride.loop_idiom`, widens supported ops by `vscale`.
* Lowers unit-stride memory to **`experimental.vp.strided_{load,store}`**, adjusts IV increments, and cleans up dead code.
Adds initial test to show difference in code generation and for
regression test for Strided Loop Unroll passes test
@felipealmeida felipealmeida force-pushed the felipe_strided_loop_unroll_upstream branch from ee48286 to 33e47f7 Compare September 10, 2025 12:21
@mshockwave mshockwave self-requested a review September 10, 2025 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant