Skip to content

[Stdlib] Implement SIMD-width-based unroll factor heuristic in elementwise/reduction loops#6084

Open
msaelices wants to merge 2 commits intomodular:mainfrom
msaelices:fix/reduction-unroll-heuristic
Open

[Stdlib] Implement SIMD-width-based unroll factor heuristic in elementwise/reduction loops#6084
msaelices wants to merge 2 commits intomodular:mainfrom
msaelices:fix/reduction-unroll-heuristic

Conversation

@msaelices
Copy link
Copy Markdown
Contributor

Summary

Five hot loops in algorithm/functional.mojo and algorithm/reduction.mojo had unroll_factor = 8 hard-coded with TODO comments asking for a cost heuristic.

Replace with max(1, min(8, simd_width // 4)):

  • Scales unrolling to the native SIMD width of the target.
  • Caps at 8 to avoid excessive code-size on wide-SIMD targets (AVX-512, SVE).
  • Clamps to at least 1 to satisfy the unroll_factor > 0 invariant.

Affected sites: _elementwise_impl_cpu_1d, _elementwise_impl_cpu_nd, _stencil_impl_cpu (functional.mojo) and map_reduce, reduce_boolean (reduction.mojo).

@msaelices msaelices requested a review from a team as a code owner March 6, 2026 15:59
Copilot AI review requested due to automatic review settings March 6, 2026 15:59
@github-actions github-actions bot added mojo-stdlib Tag for issues related to standard library waiting-on-review labels Mar 6, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates several hot-loop CPU implementations in the stdlib to choose loop unrolling based on the target’s native SIMD width rather than using a hard-coded constant, aiming to better balance throughput and code size across architectures.

Changes:

  • Replaced comptime unroll_factor = 8 with a SIMD-width-based heuristic in map_reduce and reduce_boolean.
  • Replaced comptime unroll_factor = 8 with the same heuristic in _elementwise_impl_cpu_1d, _elementwise_impl_cpu_nd, and _stencil_impl_cpu.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
mojo/stdlib/std/algorithm/reduction.mojo Uses a SIMD-width-based unroll heuristic for map_reduce and reduce_boolean to avoid over-unrolling on wide-SIMD targets.
mojo/stdlib/std/algorithm/functional.mojo Applies the SIMD-width-based unroll heuristic to elementwise and stencil CPU vectorized loops.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@msaelices msaelices force-pushed the fix/reduction-unroll-heuristic branch from 80cc91c to 830ed98 Compare March 6, 2026 18:48
@soraros
Copy link
Copy Markdown
Contributor

soraros commented Mar 7, 2026

I can't comment on whether this is a good heuristics. I think you could use clamp though.

comptime assert rank == 1, "Specialization for 1D"

comptime unroll_factor = 8 # TODO: Comeup with a cost heuristic.
comptime unroll_factor = max(1, min(8, simd_width // 4))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use the clamp function for these

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@msaelices msaelices force-pushed the fix/reduction-unroll-heuristic branch 2 times, most recently from 0dd7bf5 to 2bd479d Compare March 9, 2026 15:37
@msaelices msaelices requested a review from abduld March 9, 2026 15:53
@msaelices
Copy link
Copy Markdown
Contributor Author

@abduld could you PTAL?

…twise/reduction loops

Five hot loops in functional.mojo and reduction.mojo had
`unroll_factor = 8` with TODO comments asking for a cost heuristic.

Replace the hard-coded 8 with `max(1, min(8, simd_width // 4))`:
- Scales with the native SIMD width of the dtype/target.
- Caps at 8 to avoid excessive code size on wide-SIMD targets.
- Ensures at least 1 to satisfy the unroll-factor > 0 invariant.

Typical values:
  float32 / AVX2  (width=8):  heuristic → 2
  float32 / AVX512 (width=16): heuristic → 4
  float64 / AVX2  (width=4):  heuristic → 1

Signed-off-by: Manuel Saelices <msaelices@gmail.com>
@msaelices msaelices force-pushed the fix/reduction-unroll-heuristic branch from 2bd479d to a9457d3 Compare March 10, 2026 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

mojo-stdlib Tag for issues related to standard library waiting-on-review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants