ANE private API research: chaining, E5 runtime, custom MIL compilation by dev-erik · Pull Request #40 · maderix/ANE

dev-erik · 2026-03-04T13:26:50Z

TL;DR

We reverse-engineered three paths to direct ANE access on macOS 15 (M4 Max):

_ANEChainingRequest (legacy) -- multi-kernel pipelining API. We got it to validate but it requires Espresso IR that the in-memory MIL path cannot produce. Dead-end on macOS 15+.
E5 runtime (MLE5Engine) -- the modern ANE execution path used by CoreML internally. We validated its behaviour and found that CoreML's MLDelegateModel caching outperforms direct engine calls.
Custom MIL compilation (breakthrough) -- we can write arbitrary MIL text programs and compile them directly to the ANE via MLE5ProgramLibraryOnDeviceAOTCompilationImpl. We verified attention, linear layers, full transformer blocks, and backward pass matmuls all execute correctly on ANE hardware.

Bottom line: Custom MIL compilation is the viable path for direct ANE compute. The legacy chaining API is obsolete on macOS 15+. Training on ANE is theoretically possible but impractical due to read-only weights requiring recompilation (~10-50ms) after every update.

Experiments & Results

Phase 1: ChainingRequest API (Experiments A-P)

Systematically probed 12+ private Obj-C classes to understand the _ANEChainingRequest pipeline for multi-kernel pipelining (running ANE ops back-to-back without CPU round-trips).

Finding	Detail
`_ANEChainingRequest.validate`	Succeeds when using `_ANEBuffer` (wraps IOSurface with `symbolIndex`) instead of `_ANEIOSurfaceObject`
`_ANEIOSurfaceOutputSets`	Works with any non-NULL IOSurface as `statsSurRef`
`prepareChainingWithModel:`	Requires `_ANEModel` (disk-compiled Espresso IR), crashes with `_ANEInMemoryModel`
`_ANEClient.evaluateRealTimeWithModel:`	~1.7x faster than `evaluateWithQoS:` at small dims (64x64); no advantage at production dims (768x256)
`_ANESharedSignalEvent` / `_ANESharedWaitEvent`	Hardware fence primitives -- require `IOSurfaceSharedEvent`, work with `MTLSharedEvent`
Conclusion	Legacy API. On macOS 15+, CoreML uses the E5 runtime instead. Blocked at Code=15 (`ANEProgramChainingPrepare Failed`)

Phase 2: E5 Runtime Validation (Experiments W1-W5)

Validated the modern E5 execution path that CoreML uses internally on macOS 15+.

Finding	Detail
Runtime architecture	`MLModel` -> `MLDelegateModel` -> `MLE5Engine` -> `MLE5ExecutionStream` -> `e5rt_program_library`
`MLDelegateModel` caching	Faster than calling `MLE5Engine.predictionFromFeatures:` directly due to internal stream/operation caching
Manual `_executeStream:`	With fabricated `MLE5ExecutionStreamOperation` objects (handle=0x0) these are no-ops -- output validation is critical
Key classes	`MLE5ProgramLibrary`, `MLE5StaticShapeExecutionStreamOperationPool`, `MLE5ProgramLibraryOnDeviceAOTCompilationImpl`

Phase 3: Custom MIL -> ANE Execution (Experiments X1, Y1-Y3, Z1)

Breakthrough: write MIL text, compile to e5rt_program_library, execute on ANE via MLE5Engine.

Experiment	Operation	Result	Accuracy	Latency
X1	ReLU, GELU, Softmax, LayerNorm	All PASSED	Verified vs CPU reference	< 0.1ms
Y1	`scaled_dot_product_attention` (self-attn, 4 heads)	PASSED	max_diff = 0.000027	0.17ms
Y2	`linear` with embedded const weights (64x32 -> 64x16)	PASSED	max_diff = 0.001660	0.06ms
Y3	Full transformer block (LN + SDPA + Residual + FFN + GELU)	PASSED	Verified	0.21ms
Z1	Backward pass matmul (dX = dY @ W, dW = X^T @ dY)	PASSED	dX_diff=0.002, dW_diff=0.013	0.06ms

The compilation pipeline: MIL text -> MLE5ProgramLibraryOnDeviceAOTCompilationImpl -> createProgramLibraryHandleWithRespecialization: -> MLE5ProgramLibrary -> MLE5Engine (7-arg init) -> predictionFromFeatures:.

Additional Benchmarks

Experiment	What	Result
Throughput ceiling (Exp I)	Sequential 12-kernel execution overhead	~0.2ms/kernel at production dims, ~4.8ms total for 24 evals
Bench paths	Standard vs RT vs processRequest at 64x32, 256x128, 768x256	RT only helps at small dims; all paths converge at production dims

Files Added

Test Programs

File	Lines	What it does
`training/test_chaining_v2.m`	1,700	Deep 6-phase probe of `_ANEChainingRequest` and 12+ private ANE classes. Dumps methods, type encodings, properties. Tests `_ANEBuffer`, `_ANEIOSurfaceOutputSets`, `_ANEProgramIOSurfacesMapper`, `_ANESharedSignalEvent`/`WaitEvent`. Benchmarks standard vs RT eval paths.
`training/test_ane_model.m`	2,260	Experiments E-P: `_ANEModel` factory methods, `_ANECompiler` compilation, `prepareChainingWithModel:` crash investigation, `_ANEInputBuffersReady`/`_ANEOutputSetEnqueue` type encoding, `_ANEProgramForEvaluation.processRequest`, shared event construction, IOSurface mapper exploration.
`training/test_coreml_chaining.m`	1,003	Experiments Q-S: Uses CoreML-compiled models (via `MLModel compileModelAtURL:`) to extract `_ANEModel` objects. Tests `_ANEBuffer` creation with `symbolIndex`, `_ANEIOSurfaceOutputSets` with stats surfaces, chaining request validation, `prepareChainingWithModel:` with various parameter combinations.
`training/test_e5_validate.m`	817	Experiments W1-W5: E5 runtime validation. Extracts `MLE5Engine` and `MLE5ProgramLibrary` from compiled CoreML models. Tests `_executeStream:` with fabricated operations. Profiles `MLDelegateModel` vs direct `MLE5Engine`. Dumps all E5 class methods and properties.
`training/test_mil_custom.m`	915	Experiments X1, Y1-Y3, Z1: Custom MIL text compilation to ANE. Contains `compileAndCreateEngine` helper (the full MIL -> ANE pipeline), `findE5Container` for extracting `MLProgramE5Container`. Runs SDPA, linear-with-weights, full transformer block, and backward pass matmul -- all verified against CPU reference implementations.
`training/test_throughput_ceiling.m`	238	Experiment I: Compiles 12 identical conv kernels and measures sequential ANE throughput ceiling at production dimensions (768x256). Quantifies CPU round-trip overhead per kernel.
`training/test_bench_paths.m`	148	Benchmarks standard `evaluateWithQoS:` vs RT `evaluateRealTimeWithModel:` vs `processRequest` at three dimension sets (64x32, 256x128, 768x256). Shows RT advantage disappears at production dims.

Documentation

File	Lines	What it covers
`docs/ANE_CHAINING_RESEARCH.md`	1,112	Complete experiment logs with raw output for all phases. Class hierarchy diagram. MIL operations reference table (all ops verified on ANE). Architecture diagrams for chaining data flow.
`docs/ANE_INTERNALS.md`	464	ANE hardware architecture reference (cores, TOPS, SRAM by chip). Compilation pipeline (MIL -> coremlc -> espresso -> .hwx -> firmware). Two runtime paths (legacy Espresso vs modern E5). IOSurface I/O conventions. Community tools (ANETools, anecc, coreml_to_ane_hwx).

Modified Files

File	Change
`training/Makefile`	Added build targets and clean rules for all 7 new test programs
`training/ane_runtime.h`	Added `ane_eval_rt()` -- wrapper for `_ANEClient.evaluateRealTimeWithModel:` with fallback to standard eval

MIL Syntax Lessons Learned

These are non-obvious and not documented by Apple anywhere:

layer_norm epsilon type must match gamma/beta dtype (fp16, not fp32)
matmul requires both transpose_x and transpose_y as bool consts
concat requires interleave (bool) param and axis as int32 scalar (not tensor)
MLE5Engine uses a 7-argument initializer: initWithProgramLibrary:modelDescription:configuration:functionName:classProbabilitiesFeatureName:optionalInputDefaultValues:compilerVersionInfo:
MLProgramE5Container can be created via initWithModelAssetPath:configuration: from a .mlmodelc path
E5 runtime needs write access to ~/Library/Caches/<binary_name>/ for ANE specialization cache
MIL I/O names and shapes must exactly match the MLModelDescription passed to MLE5Engine
ANE I/O convention: fp32 at boundaries, fp16 internally -- use cast ops

Build & Run

cd training

# Build all experiments
make test_chaining_v2 test_ane_model test_coreml_chaining \
     test_e5_validate test_mil_custom test_throughput_ceiling test_bench_paths

# Run (each is standalone, pick any)
./test_mil_custom           # Custom MIL: SDPA, transformer block, backward pass
./test_e5_validate          # E5 runtime validation
./test_ane_model            # _ANEModel and chaining exploration
./test_chaining_v2          # ChainingRequest deep probe
./test_coreml_chaining      # CoreML-compiled model chaining
./test_throughput_ceiling   # Multi-kernel throughput ceiling
./test_bench_paths          # Standard vs RT eval path comparison

No external dependencies. System frameworks only (Foundation, CoreML, IOSurface, Metal, Accelerate). Requires macOS 15+ on Apple Silicon.

…pilation experiments

dev-erik mentioned this pull request Mar 4, 2026

Community contributions: M1-M4 compat, security fixes, docs, benchmarks, and community dashboard #25

Closed

6 tasks

dev-erik force-pushed the ane-chaining-research branch from f9c7183 to be80e51 Compare March 4, 2026 20:23

dev-erik changed the title ~~Add ANE ChainingRequest API prototype (test_chaining.m)~~ ANE chaining research: E5 runtime, custom MIL compilation, training feasibility Mar 4, 2026

dev-erik force-pushed the ane-chaining-research branch from be80e51 to dff5a68 Compare March 4, 2026 20:26

dev-erik changed the title ~~ANE chaining research: E5 runtime, custom MIL compilation, training feasibility~~ ANE private API research: chaining, E5 runtime, custom MIL compilation Mar 4, 2026

dev-erik force-pushed the ane-chaining-research branch from dff5a68 to 65d7813 Compare March 4, 2026 20:30

[test] ANE private API research: chaining, E5 runtime, custom MIL com…

99ba013

…pilation experiments

dev-erik force-pushed the ane-chaining-research branch from 65d7813 to 99ba013 Compare March 4, 2026 20:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ANE private API research: chaining, E5 runtime, custom MIL compilation#40

ANE private API research: chaining, E5 runtime, custom MIL compilation#40
dev-erik wants to merge 1 commit intomaderix:mainfrom
dev-erik:ane-chaining-research

dev-erik commented Mar 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dev-erik commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Experiments & Results

Phase 1: ChainingRequest API (Experiments A-P)

Phase 2: E5 Runtime Validation (Experiments W1-W5)

Phase 3: Custom MIL -> ANE Execution (Experiments X1, Y1-Y3, Z1)

Additional Benchmarks

Files Added

Test Programs

Documentation

Modified Files

MIL Syntax Lessons Learned

Build & Run

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dev-erik commented Mar 4, 2026 •

edited

Loading