ANE probe tests + training telemetry for M5 optimization#2
Merged
maderix merged 2 commits intomaderix:mainfrom Mar 2, 2026
Merged
ANE probe tests + training telemetry for M5 optimization#2maderix merged 2 commits intomaderix:mainfrom
maderix merged 2 commits intomaderix:mainfrom
Conversation
Four standalone probe tests to characterize the M5 ANE: - test_weight_reload: Can weights be hot-swapped via unload+load without recompilation? - test_perf_stats: Enumerate _ANEPerformanceStats methods/properties and hardware counters - test_qos_sweep: Measure compile/load/eval latency across QoS 0-63 - test_ane_advanced: Probe SharedEvents, weightsBuffer IOSurface, procedureIndex, VirtualClient Training telemetry (train_large.m): - JSON lines to stderr with per-step timing breakdown and per-batch TFLOPS metrics - Enables external monitoring tools to visualize ANE utilization in real-time Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… found Key findings from running all 4 probes on Apple M5: - Weight reload (unload+load after file overwrite) does NOT work — weights are baked at compile time, output is identical regardless of file changes - weightsBuffer IOSurface parameter also does not override compiled weights - All QoS values 0-63 work, no measurable latency difference (~0.07ms/eval) - _ANEPerformanceStats has hwExecutionTime (ns) + perfCounterData - _ANEChainingRequest supports loopback execution (output→input chaining) - _ANEClient has real-time eval path and chaining preparation methods - procedureIndex 0-15 all succeed on single-procedure models Fixed probe tests to use fp32 I/O with cast (matching inmem_peak pattern) and 64+ channel kernels (ANE minimum size requirement). Full analysis in training/m5result.md. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Owner
|
Hi thanks for the probe tests contribution, can you please help a screenshot or output from your system as well. Curious to know the results from your setup. Edit my bad - just saw the results.md 😅 |
maderix
approved these changes
Mar 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
test_weight_reload— tests if weights can be hot-swapped via unload+load without recompilation (would eliminate the compilation bottleneck entirely)test_perf_stats— enumerates_ANEPerformanceStatsmethods/properties and hardware counterstest_qos_sweep— measures compile/load/eval latency across QoS 0-63test_ane_advanced— probes_ANESharedEvents,weightsBufferIOSurface,procedureIndex,_ANEVirtualClient,_ANEChainingRequesttrain_large.m— per-step timing breakdown and per-batch TFLOPS metrics, enabling real-time monitoring of ANE utilization during trainingmake probesbuilds all tests,make cleanremoves themMotivation
The current training pipeline achieves only 11.2% ANE utilization (1.78 of 15.8 TFLOPS) due to the compilation bottleneck — every weight update requires recompiling all 60 weight-bearing kernels via
exec()restart. These probes determine whether faster paths exist (weight reload, weightsBuffer, async compile) before modifying the core training loop.Test plan
make probescompiles all 4 test programsmake train_largecompiles with telemetry additions./train_large 2>telem.jsonl🤖 Generated with Claude Code