KB Gap: M5 Neural Accelerator constraints and operational limits

## Context

The M5 introduces per-core Neural Accelerators delivering 4x peak AI compute vs M4. The KB has performance headlines (3.3-4x TTFT speedup, 627 tok/s on Llama 7B) but lacks the **operational constraints** needed to write kernels that reliably route to the Neural Accelerator hardware. Current KB depth: **~40%** (performance known, constraints unknown).

## Gap Description

To build a Metal 4 kernel library that targets M5 Neural Accelerators effectively, we need to know when operations WILL vs WON'T use the accelerator hardware. Without this, kernels may silently fall back to the ALU path (still correct, but missing 4x potential speedup).

### What We Have (8 findings)

- **#19**: 4x peak AI compute vs M4, 6x vs M1. Doubles FP16 throughput.
- **#31**: 128 matrix FMAs per compute partition per cycle
- **#32**: NOT directly exposed in MSL — only via MPP and Metal 4 Tensor APIs
- **#33**: TTFT 3.3-4x faster, token gen 2-3x faster (bandwidth-bound)
- **#214**: ~1024 FP16 FLOPS/core/cycle. Projected M5 Max: ~70 TFLOPS FP16, ~130 TFLOPS INT8
- **#565**: Llama 7B Q4_0: 627.88 t/s. Qwen 14B 4-bit: 4.06x TTFT speedup
- **#585**: Same shader code works M1-M5. M1-M4: ALU fallback. M5: Neural Accelerator routing.
- **#213**: llama.cpp cooperative tensors disabled by default on M1-M3 "due to performance considerations"

### What We Need

1. **Supported operation types**: Is it ONLY matmul? Or also convolution, elementwise, reduction? Finding #564 mentions convolution2d exists — does it route to Neural Accelerator?

2. **Minimum tensor dimensions**: Below what size does the Neural Accelerator not engage? Is there a minimum M/N/K for matmul2d? Small tensors likely route to ALU.

3. **Data type constraints**: FP16 confirmed. INT8 mentioned (#214, #560). BFloat16 M5-exclusive (#585). What about FP32? Mixed precision (FP16 × INT8 → INT32 mentioned in #560)?

4. **Execution scope requirements**: Does the Neural Accelerator require `threadgroup` scope? Or does `simdgroups(N)` also route to hardware? What about `execution_thread`?

5. **Pipeline depth and latency**: What's the startup cost of engaging the Neural Accelerator? For small operations, is ALU fallback actually faster due to lower latency?

6. **Memory access patterns**: Does the Neural Accelerator have its own memory path? Or does it use the same L1/threadgroup memory as the ALU? What are the bandwidth implications?

7. **Concurrent ALU + Neural Accelerator**: Can a kernel use BOTH the ALU and Neural Accelerator simultaneously (e.g., dequantize on ALU while matrix multiply on accelerator)?

8. **Power and thermal behavior**: Does sustained Neural Accelerator usage cause throttling? What's the sustained vs burst TFLOPS?

9. **Fallback behavior specifics**: Finding #585 says "shader-based fallback, no regression" on M1-M4. But #213 says cooperative tensors are "disabled by default on M1-M3 due to performance considerations." Which is it? Is there a regression on older hardware?

## Research Areas

### Area 1: Apple ML Research Blog — M5 Deep Dive
**Source**: "Exploring LLMs with MLX and Neural Accelerators in M5 GPU" (Apple ML Research)
**Research targets**:
- Detailed Neural Accelerator microarchitecture description
- Supported operation matrix (which tensor_ops route to hardware)
- Performance characteristics by tensor size
- Memory hierarchy interaction

### Area 2: Metal 4 Tensor API Headers
**Source**: Xcode 26 SDK — `<metal_tensor>`, MPP headers
**Research targets**:
- Which tensor_ops functions exist (complete enumeration)
- Data type support per function
- Descriptor constraints and validation rules
- Execution scope requirements documented in headers

### Area 3: Philip Turner's M5 Benchmarks
**Source**: `github.com/philipturner/metal-benchmarks`
**Research targets**:
- M5 microbenchmarks for matrix operations at various sizes
- Crossover point: where Neural Accelerator beats ALU
- Bandwidth measurements for tensor operations
- Register pressure impact of cooperative tensors

### Area 4: llama.cpp M5 Performance Analysis
**Source**: llama.cpp benchmarks, GitHub discussions, PR #16634
**Research targets**:
- Why cooperative tensors disabled on M1-M3 (performance regression details)
- M5-specific optimizations beyond cooperative tensors
- Tile size selection rationale (128x64x32)
- Prefill vs generation performance breakdown on M5

### Area 5: WWDC25 GPU Architecture Session
**Source**: WWDC25 sessions on M5 GPU and Metal 4
**Research targets**:
- Neural Accelerator block diagram and data path
- How the scheduler routes operations (automatic? hint-based?)
- Interaction with Dynamic Caching (M3+ feature)
- Power management and sustained performance characteristics

## Impact

- **Metal 4 kernel library**: Need to know constraints to write kernels that actually use the accelerator
- **Performance modeling**: Cannot predict inference performance without understanding when 4x speedup applies
- **Hardware targeting**: Kernel specialization strategy (M1-M4 ALU path vs M5 accelerator path)

## Recommended KB Addition

**Skills**: `gpu-silicon`, `metal4-api`, `gpu-perf`
**Topics**: Neural Accelerator operational limits, tensor dimension thresholds, data type support matrix
**Estimated findings**: 10-15 new findings from hardware characterization and documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KB Gap: M5 Neural Accelerator constraints and operational limits #15

Context

Gap Description

What We Have (8 findings)

What We Need

Research Areas

Area 1: Apple ML Research Blog — M5 Deep Dive

Area 2: Metal 4 Tensor API Headers

Area 3: Philip Turner's M5 Benchmarks

Area 4: llama.cpp M5 Performance Analysis

Area 5: WWDC25 GPU Architecture Session

Impact

Recommended KB Addition

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

KB Gap: M5 Neural Accelerator constraints and operational limits #15

Description

Context

Gap Description

What We Have (8 findings)

What We Need

Research Areas

Area 1: Apple ML Research Blog — M5 Deep Dive

Area 2: Metal 4 Tensor API Headers

Area 3: Philip Turner's M5 Benchmarks

Area 4: llama.cpp M5 Performance Analysis

Area 5: WWDC25 GPU Architecture Session

Impact

Recommended KB Addition

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions