[Stdlib] Implement SIMD-width-based unroll factor heuristic in elementwise/reduction loops#6084
Open
msaelices wants to merge 2 commits intomodular:mainfrom
Open
[Stdlib] Implement SIMD-width-based unroll factor heuristic in elementwise/reduction loops#6084msaelices wants to merge 2 commits intomodular:mainfrom
msaelices wants to merge 2 commits intomodular:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates several hot-loop CPU implementations in the stdlib to choose loop unrolling based on the target’s native SIMD width rather than using a hard-coded constant, aiming to better balance throughput and code size across architectures.
Changes:
- Replaced
comptime unroll_factor = 8with a SIMD-width-based heuristic inmap_reduceandreduce_boolean. - Replaced
comptime unroll_factor = 8with the same heuristic in_elementwise_impl_cpu_1d,_elementwise_impl_cpu_nd, and_stencil_impl_cpu.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
mojo/stdlib/std/algorithm/reduction.mojo |
Uses a SIMD-width-based unroll heuristic for map_reduce and reduce_boolean to avoid over-unrolling on wide-SIMD targets. |
mojo/stdlib/std/algorithm/functional.mojo |
Applies the SIMD-width-based unroll heuristic to elementwise and stencil CPU vectorized loops. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
80cc91c to
830ed98
Compare
Contributor
|
I can't comment on whether this is a good heuristics. I think you could use |
abduld
reviewed
Mar 7, 2026
| comptime assert rank == 1, "Specialization for 1D" | ||
|
|
||
| comptime unroll_factor = 8 # TODO: Comeup with a cost heuristic. | ||
| comptime unroll_factor = max(1, min(8, simd_width // 4)) |
Contributor
There was a problem hiding this comment.
can you use the clamp function for these
0dd7bf5 to
2bd479d
Compare
Contributor
Author
|
@abduld could you PTAL? |
…twise/reduction loops Five hot loops in functional.mojo and reduction.mojo had `unroll_factor = 8` with TODO comments asking for a cost heuristic. Replace the hard-coded 8 with `max(1, min(8, simd_width // 4))`: - Scales with the native SIMD width of the dtype/target. - Caps at 8 to avoid excessive code size on wide-SIMD targets. - Ensures at least 1 to satisfy the unroll-factor > 0 invariant. Typical values: float32 / AVX2 (width=8): heuristic → 2 float32 / AVX512 (width=16): heuristic → 4 float64 / AVX2 (width=4): heuristic → 1 Signed-off-by: Manuel Saelices <msaelices@gmail.com>
2bd479d to
a9457d3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Five hot loops in
algorithm/functional.mojoandalgorithm/reduction.mojohadunroll_factor = 8hard-coded with TODO comments asking for a cost heuristic.Replace with
max(1, min(8, simd_width // 4)):unroll_factor > 0invariant.Affected sites:
_elementwise_impl_cpu_1d,_elementwise_impl_cpu_nd,_stencil_impl_cpu(functional.mojo) andmap_reduce,reduce_boolean(reduction.mojo).