[Vectorize] Add StridedLoopUnroll + Versioning for 2-D strided loop nests #157749
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
[Vectorize] Add StridedLoopUnroll + Versioning for 2-D strided loop nests
Summary
Introduce two passes that recognize a common 2-D row/column loop idiom (e.g. image/matrix kernels) and produce:
What it matches
!=
predicate.What it does
Function pass (
StridedLoopUnrollVersioningPass
):LoopVersioning
with runtime pointer checks.vscale
, and alignment if required by target.8 / elemSize
), hoists invariant loads, eliminates duplicate loads.llvm.stride.loop_idiom
, etc.).Loop pass (
StridedLoopUnrollPass
):llvm.stride.loop_idiom
, widens supported ops byvscale
.experimental.vp.strided_{load,store}
, adjusts IV increments, and cleans up dead code.Why does it matter
The x264 project works on calls to pixel_avg of images of 8x8 or 16x16
sizes. This loop versioning allows the use of strided load/stores to
load the whole image depending on the size of the vector for images
8x8. This gives a considerable boost in performance in SPEC 2017 for
Ventana design (6% instruction count reduction) on x264_r test.
Feedback
We want to see if this could be improved, or maybe if some other alternative way to implement this would be better.