Skip to content

ANE probe tests + training telemetry for M5 optimization#2

Merged
maderix merged 2 commits intomaderix:mainfrom
m0at:m5-maximized
Mar 2, 2026
Merged

ANE probe tests + training telemetry for M5 optimization#2
maderix merged 2 commits intomaderix:mainfrom
m0at:m5-maximized

Conversation

@m0at
Copy link

@m0at m0at commented Mar 2, 2026

Summary

  • 4 standalone probe tests to characterize ANE behavior on M5 (H16 family), informing optimization paths to push utilization beyond the current 11.2%:
    • test_weight_reload — tests if weights can be hot-swapped via unload+load without recompilation (would eliminate the compilation bottleneck entirely)
    • test_perf_stats — enumerates _ANEPerformanceStats methods/properties and hardware counters
    • test_qos_sweep — measures compile/load/eval latency across QoS 0-63
    • test_ane_advanced — probes _ANESharedEvents, weightsBuffer IOSurface, procedureIndex, _ANEVirtualClient, _ANEChainingRequest
  • JSON telemetry on stderr from train_large.m — per-step timing breakdown and per-batch TFLOPS metrics, enabling real-time monitoring of ANE utilization during training
  • Makefile targets: make probes builds all tests, make clean removes them

Motivation

The current training pipeline achieves only 11.2% ANE utilization (1.78 of 15.8 TFLOPS) due to the compilation bottleneck — every weight update requires recompiling all 60 weight-bearing kernels via exec() restart. These probes determine whether faster paths exist (weight reload, weightsBuffer, async compile) before modifying the core training loop.

Test plan

  • make probes compiles all 4 test programs
  • make train_large compiles with telemetry additions
  • Each probe runs standalone and prints results to stdout
  • Telemetry output parseable as JSON lines on stderr: ./train_large 2>telem.jsonl
  • Existing training behavior unchanged (telemetry only adds to stderr)

🤖 Generated with Claude Code

noreply and others added 2 commits March 1, 2026 22:54
Four standalone probe tests to characterize the M5 ANE:
- test_weight_reload: Can weights be hot-swapped via unload+load without recompilation?
- test_perf_stats: Enumerate _ANEPerformanceStats methods/properties and hardware counters
- test_qos_sweep: Measure compile/load/eval latency across QoS 0-63
- test_ane_advanced: Probe SharedEvents, weightsBuffer IOSurface, procedureIndex, VirtualClient

Training telemetry (train_large.m):
- JSON lines to stderr with per-step timing breakdown and per-batch TFLOPS metrics
- Enables external monitoring tools to visualize ANE utilization in real-time

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… found

Key findings from running all 4 probes on Apple M5:

- Weight reload (unload+load after file overwrite) does NOT work — weights
  are baked at compile time, output is identical regardless of file changes
- weightsBuffer IOSurface parameter also does not override compiled weights
- All QoS values 0-63 work, no measurable latency difference (~0.07ms/eval)
- _ANEPerformanceStats has hwExecutionTime (ns) + perfCounterData
- _ANEChainingRequest supports loopback execution (output→input chaining)
- _ANEClient has real-time eval path and chaining preparation methods
- procedureIndex 0-15 all succeed on single-procedure models

Fixed probe tests to use fp32 I/O with cast (matching inmem_peak pattern)
and 64+ channel kernels (ANE minimum size requirement).

Full analysis in training/m5result.md.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@maderix
Copy link
Owner

maderix commented Mar 2, 2026

Hi thanks for the probe tests contribution, can you please help a screenshot or output from your system as well.

Curious to know the results from your setup.

Edit my bad - just saw the results.md 😅

@maderix maderix merged commit 893f58e into maderix:main Mar 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants