-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Context
The M5 introduces per-core Neural Accelerators delivering 4x peak AI compute vs M4. The KB has performance headlines (3.3-4x TTFT speedup, 627 tok/s on Llama 7B) but lacks the operational constraints needed to write kernels that reliably route to the Neural Accelerator hardware. Current KB depth: ~40% (performance known, constraints unknown).
Gap Description
To build a Metal 4 kernel library that targets M5 Neural Accelerators effectively, we need to know when operations WILL vs WON'T use the accelerator hardware. Without this, kernels may silently fall back to the ALU path (still correct, but missing 4x potential speedup).
What We Have (8 findings)
- KB Gap: CPU→GPU algorithm transformation patterns (recursion, allocation, collections, control flow) #19: 4x peak AI compute vs M4, 6x vs M1. Doubles FP16 throughput.
- #31: 128 matrix FMAs per compute partition per cycle
- #32: NOT directly exposed in MSL — only via MPP and Metal 4 Tensor APIs
- #33: TTFT 3.3-4x faster, token gen 2-3x faster (bandwidth-bound)
- #214: ~1024 FP16 FLOPS/core/cycle. Projected M5 Max: ~70 TFLOPS FP16, ~130 TFLOPS INT8
- #565: Llama 7B Q4_0: 627.88 t/s. Qwen 14B 4-bit: 4.06x TTFT speedup
- #585: Same shader code works M1-M5. M1-M4: ALU fallback. M5: Neural Accelerator routing.
- #213: llama.cpp cooperative tensors disabled by default on M1-M3 "due to performance considerations"
What We Need
-
Supported operation types: Is it ONLY matmul? Or also convolution, elementwise, reduction? Finding #564 mentions convolution2d exists — does it route to Neural Accelerator?
-
Minimum tensor dimensions: Below what size does the Neural Accelerator not engage? Is there a minimum M/N/K for matmul2d? Small tensors likely route to ALU.
-
Data type constraints: FP16 confirmed. INT8 mentioned (#214, #560). BFloat16 M5-exclusive (#585). What about FP32? Mixed precision (FP16 × INT8 → INT32 mentioned in #560)?
-
Execution scope requirements: Does the Neural Accelerator require
threadgroupscope? Or doessimdgroups(N)also route to hardware? What aboutexecution_thread? -
Pipeline depth and latency: What's the startup cost of engaging the Neural Accelerator? For small operations, is ALU fallback actually faster due to lower latency?
-
Memory access patterns: Does the Neural Accelerator have its own memory path? Or does it use the same L1/threadgroup memory as the ALU? What are the bandwidth implications?
-
Concurrent ALU + Neural Accelerator: Can a kernel use BOTH the ALU and Neural Accelerator simultaneously (e.g., dequantize on ALU while matrix multiply on accelerator)?
-
Power and thermal behavior: Does sustained Neural Accelerator usage cause throttling? What's the sustained vs burst TFLOPS?
-
Fallback behavior specifics: Finding #585 says "shader-based fallback, no regression" on M1-M4. But #213 says cooperative tensors are "disabled by default on M1-M3 due to performance considerations." Which is it? Is there a regression on older hardware?
Research Areas
Area 1: Apple ML Research Blog — M5 Deep Dive
Source: "Exploring LLMs with MLX and Neural Accelerators in M5 GPU" (Apple ML Research)
Research targets:
- Detailed Neural Accelerator microarchitecture description
- Supported operation matrix (which tensor_ops route to hardware)
- Performance characteristics by tensor size
- Memory hierarchy interaction
Area 2: Metal 4 Tensor API Headers
Source: Xcode 26 SDK — <metal_tensor>, MPP headers
Research targets:
- Which tensor_ops functions exist (complete enumeration)
- Data type support per function
- Descriptor constraints and validation rules
- Execution scope requirements documented in headers
Area 3: Philip Turner's M5 Benchmarks
Source: github.com/philipturner/metal-benchmarks
Research targets:
- M5 microbenchmarks for matrix operations at various sizes
- Crossover point: where Neural Accelerator beats ALU
- Bandwidth measurements for tensor operations
- Register pressure impact of cooperative tensors
Area 4: llama.cpp M5 Performance Analysis
Source: llama.cpp benchmarks, GitHub discussions, PR #16634
Research targets:
- Why cooperative tensors disabled on M1-M3 (performance regression details)
- M5-specific optimizations beyond cooperative tensors
- Tile size selection rationale (128x64x32)
- Prefill vs generation performance breakdown on M5
Area 5: WWDC25 GPU Architecture Session
Source: WWDC25 sessions on M5 GPU and Metal 4
Research targets:
- Neural Accelerator block diagram and data path
- How the scheduler routes operations (automatic? hint-based?)
- Interaction with Dynamic Caching (M3+ feature)
- Power management and sustained performance characteristics
Impact
- Metal 4 kernel library: Need to know constraints to write kernels that actually use the accelerator
- Performance modeling: Cannot predict inference performance without understanding when 4x speedup applies
- Hardware targeting: Kernel specialization strategy (M1-M4 ALU path vs M5 accelerator path)
Recommended KB Addition
Skills: gpu-silicon, metal4-api, gpu-perf
Topics: Neural Accelerator operational limits, tensor dimension thresholds, data type support matrix
Estimated findings: 10-15 new findings from hardware characterization and documentation