Skip to content

SIMD: Explore ARM SVE/SVE2 for predicated sorting networks #33

@jonathanpeppers

Description

@jonathanpeppers

Summary

Explore ARM SVE/SVE2 (Scalable Vector Extension) for sorting networks, leveraging predicated operations and scalable vector lengths.

.NET 10 SVE/SVE2 API Status

SVE support is now available in .NET as [Experimental] APIs:

  • System.Runtime.Intrinsics.Arm.Sve — full SVE intrinsics, landed in .NET 9, refined in .NET 10
  • System.Runtime.Intrinsics.Arm.Sve2 — partial SVE2 APIs (non-streaming only) in .NET 10, full coverage expected in .NET 11+
  • Runtime detection: Sve.IsSupported / Sve2.IsSupported
  • Key APIs for sorting networks are available:
    • Sve.Min / Sve.Max — predicated element-wise min/max
    • Sve.ConditionalSelect — mask-based element selection (blend)
    • Predicate generation: Sve.CreateWhileLessThan — naturally handles n=27 without padding/masking unused lanes

Tracking issues:

Blog post: Engineering the Scalable Vector Extension in .NET

SVE's Key Advantage for Sorting Networks

SVE's predication model is fundamentally different from AdvSimd/NEON and is well-suited for sorting:

  1. Natural handling of n=27: Sve.CreateWhileLessThan(0, 27) creates a predicate mask that covers exactly 27 elements. No need to zero-pad the 28th element or worry about garbage in unused lanes.
  2. Vector-length agnostic (VLA): A single implementation works across all SVE vector widths (128-bit to 2048-bit). On wider hardware, more elements are processed per instruction automatically.
  3. Predicated min/max: Sve.Min(predicate, a, b) — inactive lanes are untouched, eliminating the shuffle/blend complexity of the current AdvSimd path.

Hardware Landscape (Critical Context)

Platform Core SVE/SVE2 Vector Width Status
Graviton 3 Neoverse V1 SVE 256 bits Shipping (AWS, 2022+)
Graviton 4 Neoverse V2 SVE2 128 bits Shipping (AWS, 2024+)
Nvidia Grace Neoverse V2 SVE2 128 bits Shipping
Apple M1-M4 Apple custom No SVE NEON 128b only No SVE planned

Key insight: Graviton 4 and Neoverse V2 implement SVE2 at only 128 bits — the same width as NEON. This means SVE offers no width advantage over the current AdvSimd path on the most common ARM64 server hardware shipping today. The main SVE width benefit is on Graviton 3 (256-bit) and future wider implementations.

Apple Silicon (the macos-latest CI runner) has no SVE support at all — SVE code would never run on macOS CI.

Implementation Approach

SVE sorting would look conceptually like:

if (Sve.IsSupported)
{
    // Vector length determined at runtime — could be 128b, 256b, 512b, etc.
    var pred = Sve.CreateWhileLessThan(0, n);  // predicate for n elements
    var vec = Sve.LoadVector(pred, ref first);

    // Each network step: permute + predicated min/max + select
    var shuffled = Sve.PermuteVariable(vec, shuffleIndices);
    var mins = Sve.Min(pred, vec, shuffled);
    var maxs = Sve.Max(pred, vec, shuffled);
    vec = Sve.ConditionalSelect(blendMask, maxs, mins);
}

Challenges:

  • The generator would need a new WriteArmSveSortMethod path
  • SVE shuffle/permute instructions differ from AdvSimd's TBL approach
  • The VLA model means the number of elements per vector isn't known at compile time — the sorting network must be structured around this (though for n=27/28 with a minimum of 128-bit, we know we always have at least 16 byte-lanes)
  • For 128-bit SVE (Graviton 4), performance would likely be similar to the existing AdvSimd path — the benefit comes from wider implementations
  • Testing requires SVE-capable hardware (CI would need a Graviton instance or QEMU SVE emulation)

Research References

  • Bramas 2021 — "A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE)" — PeerJ Computer Science. Demonstrates predicated bitonic sorting networks on SVE with up to 4x speedup. The most directly relevant academic work.
  • Brank 2023 — Thesis on VLA SIMD parallelism with focus on ARM SVE.

Recommendation

Priority: Low-medium. The APIs exist in .NET 10 but are [Experimental]. The hardware picture is mixed — the most common ARM64 servers (Graviton 4, Neoverse V2) only have 128-bit SVE, offering no width advantage over the existing AdvSimd path. The main benefit would be:

  1. Code simplicity — predication eliminates the n=27 vs n=28 split and lane-crossing complexity
  2. Future-proofing — wider SVE implementations (512-bit+) would see automatic speedups
  3. Graviton 3 — the 256-bit SVE width would process more elements per instruction

Consider revisiting when .NET 11 stabilizes SVE2 APIs and wider SVE hardware (Neoverse V3+) ships.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions