SIMD: Explore ARM SVE/SVE2 for predicated sorting networks

## Summary

Explore ARM SVE/SVE2 (Scalable Vector Extension) for sorting networks, leveraging predicated operations and scalable vector lengths.

## .NET 10 SVE/SVE2 API Status

**SVE support is now available in .NET as `[Experimental]` APIs:**

- `System.Runtime.Intrinsics.Arm.Sve` — full SVE intrinsics, landed in .NET 9, refined in .NET 10
- `System.Runtime.Intrinsics.Arm.Sve2` — partial SVE2 APIs (non-streaming only) in .NET 10, full coverage expected in .NET 11+
- Runtime detection: `Sve.IsSupported` / `Sve2.IsSupported`
- Key APIs for sorting networks are available:
  - `Sve.Min` / `Sve.Max` — predicated element-wise min/max
  - `Sve.ConditionalSelect` — mask-based element selection (blend)
  - Predicate generation: `Sve.CreateWhileLessThan` — naturally handles n=27 without padding/masking unused lanes

**Tracking issues:**
- [dotnet/runtime#99957](https://github.com/dotnet/runtime/issues/99957) — Implement SVE APIs
- [dotnet/runtime#93095](https://github.com/dotnet/runtime/issues/93095) — Add SVE/SVE2 support
- [dotnet/runtime#109652](https://github.com/dotnet/runtime/issues/109652) — Improve Arm64 Performance in .NET 10

**Blog post:** [Engineering the Scalable Vector Extension in .NET](https://devblogs.microsoft.com/dotnet/engineering-sve-in-dotnet/)

## SVE's Key Advantage for Sorting Networks

SVE's **predication model** is fundamentally different from AdvSimd/NEON and is well-suited for sorting:

1. **Natural handling of n=27**: `Sve.CreateWhileLessThan(0, 27)` creates a predicate mask that covers exactly 27 elements. No need to zero-pad the 28th element or worry about garbage in unused lanes.
2. **Vector-length agnostic (VLA)**: A single implementation works across all SVE vector widths (128-bit to 2048-bit). On wider hardware, more elements are processed per instruction automatically.
3. **Predicated min/max**: `Sve.Min(predicate, a, b)` — inactive lanes are untouched, eliminating the shuffle/blend complexity of the current AdvSimd path.

## Hardware Landscape (Critical Context)

| Platform | Core | SVE/SVE2 | Vector Width | Status |
|---|---|---|---|---|
| **Graviton 3** | Neoverse V1 | SVE | **256 bits** | Shipping (AWS, 2022+) |
| **Graviton 4** | Neoverse V2 | SVE2 | **128 bits** | Shipping (AWS, 2024+) |
| **Nvidia Grace** | Neoverse V2 | SVE2 | **128 bits** | Shipping |
| **Apple M1-M4** | Apple custom | **No SVE** | NEON 128b only | No SVE planned |

**Key insight**: Graviton 4 and Neoverse V2 implement SVE2 at only **128 bits** — the same width as NEON. This means SVE offers **no width advantage** over the current AdvSimd path on the most common ARM64 server hardware shipping today. The main SVE width benefit is on Graviton 3 (256-bit) and future wider implementations.

Apple Silicon (the `macos-latest` CI runner) has no SVE support at all — SVE code would never run on macOS CI.

## Implementation Approach

SVE sorting would look conceptually like:

```csharp
if (Sve.IsSupported)
{
    // Vector length determined at runtime — could be 128b, 256b, 512b, etc.
    var pred = Sve.CreateWhileLessThan(0, n);  // predicate for n elements
    var vec = Sve.LoadVector(pred, ref first);

    // Each network step: permute + predicated min/max + select
    var shuffled = Sve.PermuteVariable(vec, shuffleIndices);
    var mins = Sve.Min(pred, vec, shuffled);
    var maxs = Sve.Max(pred, vec, shuffled);
    vec = Sve.ConditionalSelect(blendMask, maxs, mins);
}
```

**Challenges:**
- The generator would need a new `WriteArmSveSortMethod` path
- SVE shuffle/permute instructions differ from AdvSimd's TBL approach
- The VLA model means the number of elements per vector isn't known at compile time — the sorting network must be structured around this (though for n=27/28 with a minimum of 128-bit, we know we always have at least 16 byte-lanes)
- For 128-bit SVE (Graviton 4), performance would likely be similar to the existing AdvSimd path — the benefit comes from wider implementations
- Testing requires SVE-capable hardware (CI would need a Graviton instance or QEMU SVE emulation)

## Research References

- **[Bramas 2021](https://peerj.com/articles/cs-769/)** — "A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE)" — PeerJ Computer Science. Demonstrates predicated bitonic sorting networks on SVE with up to 4x speedup. The most directly relevant academic work.
- **[Brank 2023](https://elekpub.bib.uni-wuppertal.de/ubwhs/content/titleinfo/7078341/full.pdf)** — Thesis on VLA SIMD parallelism with focus on ARM SVE.

## Recommendation

**Priority: Low-medium.** The APIs exist in .NET 10 but are `[Experimental]`. The hardware picture is mixed — the most common ARM64 servers (Graviton 4, Neoverse V2) only have 128-bit SVE, offering no width advantage over the existing AdvSimd path. The main benefit would be:

1. **Code simplicity** — predication eliminates the n=27 vs n=28 split and lane-crossing complexity
2. **Future-proofing** — wider SVE implementations (512-bit+) would see automatic speedups
3. **Graviton 3** — the 256-bit SVE width would process more elements per instruction

Consider revisiting when .NET 11 stabilizes SVE2 APIs and wider SVE hardware (Neoverse V3+) ships.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIMD: Explore ARM SVE/SVE2 for predicated sorting networks #33

Summary

.NET 10 SVE/SVE2 API Status

SVE's Key Advantage for Sorting Networks

Hardware Landscape (Critical Context)

Implementation Approach

Research References

Recommendation

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Platform	Core	SVE/SVE2	Vector Width	Status
Graviton 3	Neoverse V1	SVE	256 bits	Shipping (AWS, 2022+)
Graviton 4	Neoverse V2	SVE2	128 bits	Shipping (AWS, 2024+)
Nvidia Grace	Neoverse V2	SVE2	128 bits	Shipping
Apple M1-M4	Apple custom	No SVE	NEON 128b only	No SVE planned

SIMD: Explore ARM SVE/SVE2 for predicated sorting networks #33

Description

Summary

.NET 10 SVE/SVE2 API Status

SVE's Key Advantage for Sorting Networks

Hardware Landscape (Critical Context)

Implementation Approach

Research References

Recommendation

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions