Skip to content

Safely invoke your intrinsic power, using the tokens granted to you by the CPU. Cast primitive magics faster than any mage alive.

Notifications You must be signed in to change notification settings

imazen/archmage

Repository files navigation

archmage

CI Crates.io docs.rs codecov MSRV License

Safely invoke your intrinsic power, using the tokens granted to you by the CPU.

archmage provides zero-cost capability tokens that prove CPU features are available at runtime, making raw SIMD intrinsics safe to call via the #[arcane] macro.

Quick Start

[dependencies]
archmage = "0.4"
safe_unaligned_simd = "0.2"  # For safe memory operations
use archmage::{Desktop64, SimdToken, arcane};
use std::arch::x86_64::*;

#[arcane]
fn square(_token: Desktop64, data: &[f32; 8]) -> [f32; 8] {
    let v = safe_unaligned_simd::x86_64::_mm256_loadu_ps(data);
    let squared = _mm256_mul_ps(v, v);
    let mut out = [0.0f32; 8];
    safe_unaligned_simd::x86_64::_mm256_storeu_ps(&mut out, squared);
    out
}

fn main() {
    if let Some(token) = Desktop64::summon() {
        let result = square(token, &[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]);
        println!("{:?}", result); // [1.0, 4.0, 9.0, 16.0, 25.0, 36.0, 49.0, 64.0]
    }
}

How It Works

SIMD intrinsics are unsafe for two reasons:

  1. Feature availability: Calling AVX2 instructions on a CPU without AVX2 is undefined behavior
  2. Memory operations: Load/store intrinsics use raw pointers

archmage solves #1 with capability tokens - zero-sized types that can only be created after runtime CPU detection succeeds:

// summon() checks CPUID and returns Some only if features are available
if let Some(token) = Desktop64::summon() {
    // Token exists = CPU definitely has AVX2 + FMA
}

The #[arcane] macro transforms your function to enable #[target_feature], which makes value-based intrinsics safe (Rust 1.85+):

#[arcane]
fn example(token: Desktop64, data: &[f32; 8]) -> [f32; 8] {
    let v = safe_unaligned_simd::x86_64::_mm256_loadu_ps(data);  // Safe!
    let result = _mm256_mul_ps(v, v);  // Safe! (value-based)
    // ...
}

For memory operations (#2), use the safe_unaligned_simd crate which provides reference-based alternatives.

Token Reference

x86-64 Tokens

Use X64V3Token (or its alias Desktop64) for most applications:

Token Features CPU Support
X64V2Token SSE4.2 + POPCNT Windows 11 minimum, Nehalem 2008+
X64V3Token AVX2 + FMA + BMI2 95%+ of CPUs, Haswell 2013+, Zen 1+
Desktop64 AVX2 + FMA + BMI2 Alias for X64V3Token

x86-64 AVX-512 Tokens (requires avx512 feature)

[dependencies]
archmage = { version = "0.4", features = ["avx512"] }
Token Features CPU Support
X64V4Token AVX-512 F/BW/CD/DQ/VL Intel Skylake-X 2017+, AMD Zen 4 2022+
Avx512ModernToken + VBMI2, VNNI, BF16, etc. Intel Ice Lake 2019+, AMD Zen 4+
Avx512Fp16Token + FP16 Intel Sapphire Rapids 2023+

Note: Intel 12th-14th gen consumer CPUs do NOT have AVX-512.

ARM Tokens

Token Features CPU Support
Arm64 NEON All AArch64 (baseline)
NeonToken NEON Same as Arm64 (alias)
NeonAesToken NEON + AES ARM with crypto extensions
NeonSha3Token NEON + SHA3 ARMv8.2+
NeonCrcToken NEON + CRC Most ARMv8 CPUs

WASM Tokens

Token Features
Simd128Token WASM SIMD

Target Selection

Choosing Your Baseline

x86-64-v2 is the minimum requirement for Windows 11, making it a safe baseline for distributed binaries. However, 95%+ of desktop/laptop CPUs from the last decade support x86-64-v3 (AVX2+FMA), so optimizing for v3 covers nearly all users.

Target Use Case Coverage
x86-64-v2 Maximum compatibility (Windows 11 minimum) ~100%
x86-64-v3 Recommended for most apps ~95%+
x86-64-v4 Server/HPC workloads Xeon, Zen 4+

For most applications, compile a v2 baseline and add v3-optimized paths:

if let Some(token) = X64V3Token::summon() {
    fast_path(token, data);  // 95%+ of users
} else {
    baseline_path(data);      // Fallback
}

Compile-Time Optimization

When you compile with -C target-cpu=native or specify target features that match or exceed a token's requirements, runtime detection is eliminated:

// Compiled with RUSTFLAGS="-C target-cpu=haswell"
if let Some(token) = X64V3Token::summon() {  // Always succeeds, check optimized away
    process(token, data);
} else {
    fallback(data);  // Dead code, optimized away entirely
}

This means:

  • summon() becomes a no-op returning Some
  • The else branch is eliminated by the optimizer
  • Zero runtime overhead for feature detection

Build for your deployment target and let the compiler eliminate unused paths.

Token Hierarchy

Tokens form a hierarchy. Higher-level tokens can extract lower-level ones:

if let Some(v3) = X64V3Token::summon() {
    let v2: X64V2Token = v3.v2();  // v3 implies v2
}

if let Some(v4) = X64V4Token::summon() {
    let v3: X64V3Token = v4.v3();  // v4 implies v3
    let v2: X64V2Token = v4.v2();  // v4 implies v2
}

Trait Bounds

Use trait bounds for generic SIMD code:

use archmage::{HasX64V2, SimdToken, arcane};

// Accept any token with at least v2 features
#[arcane]
fn process<T: HasX64V2>(_token: T, data: &[u8]) {
    // SSE4.2 intrinsics available
}

Available traits:

Trait Meaning
SimdToken Base trait for all tokens
HasX64V2 Has SSE4.2 + POPCNT
HasX64V4 Has AVX-512 (requires avx512 feature)
Has128BitSimd Has 128-bit vectors
Has256BitSimd Has 256-bit vectors
Has512BitSimd Has 512-bit vectors
HasNeon Has ARM NEON
HasNeonAes Has NEON + AES
HasNeonSha3 Has NEON + SHA3

Cross-Platform Code

All tokens compile on all platforms. summon() returns None on unsupported architectures:

use archmage::{Desktop64, Arm64, SimdToken};

fn process(data: &mut [f32]) {
    if let Some(token) = Desktop64::summon() {
        process_avx2(token, data);
    } else if let Some(token) = Arm64::summon() {
        process_neon(token, data);
    } else {
        process_scalar(data);
    }
}

SIMD Types

The companion crate magetypes provides token-gated SIMD types with ergonomic operators:

[dependencies]
magetypes = "0.4"
use archmage::{Desktop64, SimdToken};
use magetypes::simd::f32x8;

if let Some(token) = Desktop64::summon() {
    let a = f32x8::splat(token, 2.0);
    let b = f32x8::from_array(token, [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]);
    let c = a * b + a;  // Operators work naturally
    let result = c.sqrt();
    println!("{:?}", result.to_array());
}

Available Types

Width Float Signed Int Unsigned Int Token Required
128-bit f32x4, f64x2 i8x16, i16x8, i32x4, i64x2 u8x16, u16x8, u32x4, u64x2 X64V3Token
256-bit f32x8, f64x4 i8x32, i16x16, i32x8, i64x4 u8x32, u16x16, u32x8, u64x4 X64V3Token
512-bit f32x16, f64x8 i8x64, i16x32, i32x16, i64x8 u8x64, u16x32, u32x16, u64x8 X64V4Token

Operations

Construction (requires token): splat, from_array, load, zero

Extraction: to_array, as_array, store, raw

Arithmetic: +, -, *, / and assignment variants

Bitwise: &, |, ^ and assignment variants

Math (float): sqrt, abs, floor, ceil, round, min, max, clamp, mul_add, mul_sub, recip, rsqrt

Transcendentals (float): log2_lowp, log2_midp, exp2_lowp, exp2_midp, ln_lowp, ln_midp, exp_lowp, exp_midp, pow_lowp, pow_midp, cbrt_midp

Comparison: simd_eq, simd_ne, simd_lt, simd_le, simd_gt, simd_ge

Reduction: reduce_add, reduce_min, reduce_max

Integer: shl::<N>, shr::<N>, shr_arithmetic::<N>

Feature Flags

Feature Description
std (default) Standard library support
macros (default) #[arcane] macro
avx512 AVX-512 tokens

License

MIT OR Apache-2.0

AI-Generated Code Notice

Developed with Claude (Anthropic). Review critical paths before production use.

About

Safely invoke your intrinsic power, using the tokens granted to you by the CPU. Cast primitive magics faster than any mage alive.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages