Skip to content

ANE private API research: chaining, E5 runtime, custom MIL compilation#40

Open
dev-erik wants to merge 1 commit intomaderix:mainfrom
dev-erik:ane-chaining-research
Open

ANE private API research: chaining, E5 runtime, custom MIL compilation#40
dev-erik wants to merge 1 commit intomaderix:mainfrom
dev-erik:ane-chaining-research

Conversation

@dev-erik
Copy link

@dev-erik dev-erik commented Mar 4, 2026

TL;DR

We reverse-engineered three paths to direct ANE access on macOS 15 (M4 Max):

  1. _ANEChainingRequest (legacy) -- multi-kernel pipelining API. We got it to validate but it requires Espresso IR that the in-memory MIL path cannot produce. Dead-end on macOS 15+.
  2. E5 runtime (MLE5Engine) -- the modern ANE execution path used by CoreML internally. We validated its behaviour and found that CoreML's MLDelegateModel caching outperforms direct engine calls.
  3. Custom MIL compilation (breakthrough) -- we can write arbitrary MIL text programs and compile them directly to the ANE via MLE5ProgramLibraryOnDeviceAOTCompilationImpl. We verified attention, linear layers, full transformer blocks, and backward pass matmuls all execute correctly on ANE hardware.

Bottom line: Custom MIL compilation is the viable path for direct ANE compute. The legacy chaining API is obsolete on macOS 15+. Training on ANE is theoretically possible but impractical due to read-only weights requiring recompilation (~10-50ms) after every update.


Experiments & Results

Phase 1: ChainingRequest API (Experiments A-P)

Systematically probed 12+ private Obj-C classes to understand the _ANEChainingRequest pipeline for multi-kernel pipelining (running ANE ops back-to-back without CPU round-trips).

Finding Detail
_ANEChainingRequest.validate Succeeds when using _ANEBuffer (wraps IOSurface with symbolIndex) instead of _ANEIOSurfaceObject
_ANEIOSurfaceOutputSets Works with any non-NULL IOSurface as statsSurRef
prepareChainingWithModel: Requires _ANEModel (disk-compiled Espresso IR), crashes with _ANEInMemoryModel
_ANEClient.evaluateRealTimeWithModel: ~1.7x faster than evaluateWithQoS: at small dims (64x64); no advantage at production dims (768x256)
_ANESharedSignalEvent / _ANESharedWaitEvent Hardware fence primitives -- require IOSurfaceSharedEvent, work with MTLSharedEvent
Conclusion Legacy API. On macOS 15+, CoreML uses the E5 runtime instead. Blocked at Code=15 (ANEProgramChainingPrepare Failed)

Phase 2: E5 Runtime Validation (Experiments W1-W5)

Validated the modern E5 execution path that CoreML uses internally on macOS 15+.

Finding Detail
Runtime architecture MLModel -> MLDelegateModel -> MLE5Engine -> MLE5ExecutionStream -> e5rt_program_library
MLDelegateModel caching Faster than calling MLE5Engine.predictionFromFeatures: directly due to internal stream/operation caching
Manual _executeStream: With fabricated MLE5ExecutionStreamOperation objects (handle=0x0) these are no-ops -- output validation is critical
Key classes MLE5ProgramLibrary, MLE5StaticShapeExecutionStreamOperationPool, MLE5ProgramLibraryOnDeviceAOTCompilationImpl

Phase 3: Custom MIL -> ANE Execution (Experiments X1, Y1-Y3, Z1)

Breakthrough: write MIL text, compile to e5rt_program_library, execute on ANE via MLE5Engine.

Experiment Operation Result Accuracy Latency
X1 ReLU, GELU, Softmax, LayerNorm All PASSED Verified vs CPU reference < 0.1ms
Y1 scaled_dot_product_attention (self-attn, 4 heads) PASSED max_diff = 0.000027 0.17ms
Y2 linear with embedded const weights (64x32 -> 64x16) PASSED max_diff = 0.001660 0.06ms
Y3 Full transformer block (LN + SDPA + Residual + FFN + GELU) PASSED Verified 0.21ms
Z1 Backward pass matmul (dX = dY @ W, dW = X^T @ dY) PASSED dX_diff=0.002, dW_diff=0.013 0.06ms

The compilation pipeline: MIL text -> MLE5ProgramLibraryOnDeviceAOTCompilationImpl -> createProgramLibraryHandleWithRespecialization: -> MLE5ProgramLibrary -> MLE5Engine (7-arg init) -> predictionFromFeatures:.

Additional Benchmarks

Experiment What Result
Throughput ceiling (Exp I) Sequential 12-kernel execution overhead ~0.2ms/kernel at production dims, ~4.8ms total for 24 evals
Bench paths Standard vs RT vs processRequest at 64x32, 256x128, 768x256 RT only helps at small dims; all paths converge at production dims

Files Added

Test Programs

File Lines What it does
training/test_chaining_v2.m 1,700 Deep 6-phase probe of _ANEChainingRequest and 12+ private ANE classes. Dumps methods, type encodings, properties. Tests _ANEBuffer, _ANEIOSurfaceOutputSets, _ANEProgramIOSurfacesMapper, _ANESharedSignalEvent/WaitEvent. Benchmarks standard vs RT eval paths.
training/test_ane_model.m 2,260 Experiments E-P: _ANEModel factory methods, _ANECompiler compilation, prepareChainingWithModel: crash investigation, _ANEInputBuffersReady/_ANEOutputSetEnqueue type encoding, _ANEProgramForEvaluation.processRequest, shared event construction, IOSurface mapper exploration.
training/test_coreml_chaining.m 1,003 Experiments Q-S: Uses CoreML-compiled models (via MLModel compileModelAtURL:) to extract _ANEModel objects. Tests _ANEBuffer creation with symbolIndex, _ANEIOSurfaceOutputSets with stats surfaces, chaining request validation, prepareChainingWithModel: with various parameter combinations.
training/test_e5_validate.m 817 Experiments W1-W5: E5 runtime validation. Extracts MLE5Engine and MLE5ProgramLibrary from compiled CoreML models. Tests _executeStream: with fabricated operations. Profiles MLDelegateModel vs direct MLE5Engine. Dumps all E5 class methods and properties.
training/test_mil_custom.m 915 Experiments X1, Y1-Y3, Z1: Custom MIL text compilation to ANE. Contains compileAndCreateEngine helper (the full MIL -> ANE pipeline), findE5Container for extracting MLProgramE5Container. Runs SDPA, linear-with-weights, full transformer block, and backward pass matmul -- all verified against CPU reference implementations.
training/test_throughput_ceiling.m 238 Experiment I: Compiles 12 identical conv kernels and measures sequential ANE throughput ceiling at production dimensions (768x256). Quantifies CPU round-trip overhead per kernel.
training/test_bench_paths.m 148 Benchmarks standard evaluateWithQoS: vs RT evaluateRealTimeWithModel: vs processRequest at three dimension sets (64x32, 256x128, 768x256). Shows RT advantage disappears at production dims.

Documentation

File Lines What it covers
docs/ANE_CHAINING_RESEARCH.md 1,112 Complete experiment logs with raw output for all phases. Class hierarchy diagram. MIL operations reference table (all ops verified on ANE). Architecture diagrams for chaining data flow.
docs/ANE_INTERNALS.md 464 ANE hardware architecture reference (cores, TOPS, SRAM by chip). Compilation pipeline (MIL -> coremlc -> espresso -> .hwx -> firmware). Two runtime paths (legacy Espresso vs modern E5). IOSurface I/O conventions. Community tools (ANETools, anecc, coreml_to_ane_hwx).

Modified Files

File Change
training/Makefile Added build targets and clean rules for all 7 new test programs
training/ane_runtime.h Added ane_eval_rt() -- wrapper for _ANEClient.evaluateRealTimeWithModel: with fallback to standard eval

MIL Syntax Lessons Learned

These are non-obvious and not documented by Apple anywhere:

  • layer_norm epsilon type must match gamma/beta dtype (fp16, not fp32)
  • matmul requires both transpose_x and transpose_y as bool consts
  • concat requires interleave (bool) param and axis as int32 scalar (not tensor)
  • MLE5Engine uses a 7-argument initializer: initWithProgramLibrary:modelDescription:configuration:functionName:classProbabilitiesFeatureName:optionalInputDefaultValues:compilerVersionInfo:
  • MLProgramE5Container can be created via initWithModelAssetPath:configuration: from a .mlmodelc path
  • E5 runtime needs write access to ~/Library/Caches/<binary_name>/ for ANE specialization cache
  • MIL I/O names and shapes must exactly match the MLModelDescription passed to MLE5Engine
  • ANE I/O convention: fp32 at boundaries, fp16 internally -- use cast ops

Build & Run

cd training

# Build all experiments
make test_chaining_v2 test_ane_model test_coreml_chaining \
     test_e5_validate test_mil_custom test_throughput_ceiling test_bench_paths

# Run (each is standalone, pick any)
./test_mil_custom           # Custom MIL: SDPA, transformer block, backward pass
./test_e5_validate          # E5 runtime validation
./test_ane_model            # _ANEModel and chaining exploration
./test_chaining_v2          # ChainingRequest deep probe
./test_coreml_chaining      # CoreML-compiled model chaining
./test_throughput_ceiling   # Multi-kernel throughput ceiling
./test_bench_paths          # Standard vs RT eval path comparison

No external dependencies. System frameworks only (Foundation, CoreML, IOSurface, Metal, Accelerate). Requires macOS 15+ on Apple Silicon.

@dev-erik dev-erik force-pushed the ane-chaining-research branch from f9c7183 to be80e51 Compare March 4, 2026 20:23
@dev-erik dev-erik changed the title Add ANE ChainingRequest API prototype (test_chaining.m) ANE chaining research: E5 runtime, custom MIL compilation, training feasibility Mar 4, 2026
@dev-erik dev-erik force-pushed the ane-chaining-research branch from be80e51 to dff5a68 Compare March 4, 2026 20:26
@dev-erik dev-erik changed the title ANE chaining research: E5 runtime, custom MIL compilation, training feasibility ANE private API research: chaining, E5 runtime, custom MIL compilation Mar 4, 2026
@dev-erik dev-erik force-pushed the ane-chaining-research branch from dff5a68 to 65d7813 Compare March 4, 2026 20:30
@dev-erik dev-erik force-pushed the ane-chaining-research branch from 65d7813 to 99ba013 Compare March 4, 2026 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant