fix: correctness and safety improvements for training by alvgeppetto-debug · Pull Request #31 · maderix/ANE

alvgeppetto-debug · 2026-03-03T19:58:40Z

Fixes 7 correctness and safety issues identified during M3 Ultra benchmarking and code review.

Changes:

ANE eval error logging in ane_runtime.h
Forward pass error checking (ane_conv_eval returns bool) in forward.h
Thread-safe RMSNorm (removed global buffer) in stories_cpu_ops.h
Bounds checks on vocabulary indices in stories_cpu_ops.h
fread validation macro in model.h and tiny_train.m
Non-destructive recompile (compile to temp, swap on success) in model.h
Atomic checkpoints (tmp+rename pattern) in tiny_train.m

Both make train and make train_large compile cleanly on macOS.

- Validate all fread() return values in model_load_weights (model.h) - Check ane_eval() return values in ane_conv_eval (forward.h) and ane_eval_k (tiny_train.m) - Log error details on ANE eval failure (ane_runtime.h) - Thread-safe RMSNorm: replace global g_rms_tmp with local allocation (stories_cpu_ops.h) - Bounds-check token indices in cross_entropy_loss, embed_lookup, embed_backward - Atomic checkpoint writes via tmp+rename pattern (tiny_train.m) - Non-destructive recompile: compile new kernels first, swap only on success (model.h) - Validate fread() in load_checkpoint (tiny_train.m)

alvgeppetto-debug · 2026-03-03T20:01:07Z

Benchmark Results

Hardware: Apple M3 Ultra (28-core CPU, 96 GB RAM, 31.6 TFLOPS peak ANE dual-die)
Config: train_large --steps 30, synthetic 100K tokens

Metric	main (baseline)	fix/safety-correctness
Avg train	101.5 ms/step	101.2 ms/step
ANE TFLOPS	0.92	0.92
Total TFLOPS	1.72	1.72
Wall time (30 steps)	16.2s	16.2s
Compile overhead	77.6%	77.6%

No performance regression. Safety and correctness fixes have zero measurable overhead.

maderix · 2026-03-04T12:39:54Z

Good stuff — non-destructive recompile, atomic checkpoints, fread validation, and thread-safe RMSNorm are all real improvements. Merging now. We'll swap the assert() calls on token bounds to non-fatal logging (uint16_t is always >= 0 anyway, and a bad token should skip, not abort). Thanks!

Follow-up to PR #31 — assert() aborts on bad tokens, which is too harsh for training. Skip bad tokens with a warning instead.

alvgeppetto-debug · 2026-03-05T02:58:50Z

Your welcome! I think we should thank Claude ;-) under some nice stress prompts :D

maderix merged commit 05fc8f8 into maderix:main Mar 4, 2026

maderix pushed a commit that referenced this pull request Mar 4, 2026

Replace assert() with non-fatal bounds checks on token IDs

e986572

Follow-up to PR #31 — assert() aborts on bad tokens, which is too harsh for training. Skip bad tokens with a warning instead.

maderix mentioned this pull request Mar 4, 2026

Community contributions: M1-M4 compat, security fixes, docs, benchmarks, and community dashboard #25

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: correctness and safety improvements for training#31

fix: correctness and safety improvements for training#31
maderix merged 1 commit intomaderix:mainfrom
alvgeppetto-debug:fix/safety-correctness

alvgeppetto-debug commented Mar 3, 2026

Uh oh!

alvgeppetto-debug commented Mar 3, 2026 •

edited

Loading

Uh oh!

maderix commented Mar 4, 2026

Uh oh!

alvgeppetto-debug commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alvgeppetto-debug commented Mar 3, 2026

Uh oh!

alvgeppetto-debug commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results

Uh oh!

maderix commented Mar 4, 2026

Uh oh!

alvgeppetto-debug commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alvgeppetto-debug commented Mar 3, 2026 •

edited

Loading