Skip to content

fix: correctness and safety improvements for training#31

Merged
maderix merged 1 commit intomaderix:mainfrom
alvgeppetto-debug:fix/safety-correctness
Mar 4, 2026
Merged

fix: correctness and safety improvements for training#31
maderix merged 1 commit intomaderix:mainfrom
alvgeppetto-debug:fix/safety-correctness

Conversation

@alvgeppetto-debug
Copy link

Fixes 7 correctness and safety issues identified during M3 Ultra benchmarking and code review.

Changes:

  • ANE eval error logging in ane_runtime.h
  • Forward pass error checking (ane_conv_eval returns bool) in forward.h
  • Thread-safe RMSNorm (removed global buffer) in stories_cpu_ops.h
  • Bounds checks on vocabulary indices in stories_cpu_ops.h
  • fread validation macro in model.h and tiny_train.m
  • Non-destructive recompile (compile to temp, swap on success) in model.h
  • Atomic checkpoints (tmp+rename pattern) in tiny_train.m

Both make train and make train_large compile cleanly on macOS.

- Validate all fread() return values in model_load_weights (model.h)
- Check ane_eval() return values in ane_conv_eval (forward.h) and ane_eval_k (tiny_train.m)
- Log error details on ANE eval failure (ane_runtime.h)
- Thread-safe RMSNorm: replace global g_rms_tmp with local allocation (stories_cpu_ops.h)
- Bounds-check token indices in cross_entropy_loss, embed_lookup, embed_backward
- Atomic checkpoint writes via tmp+rename pattern (tiny_train.m)
- Non-destructive recompile: compile new kernels first, swap only on success (model.h)
- Validate fread() in load_checkpoint (tiny_train.m)
@alvgeppetto-debug
Copy link
Author

alvgeppetto-debug commented Mar 3, 2026

Benchmark Results

Hardware: Apple M3 Ultra (28-core CPU, 96 GB RAM, 31.6 TFLOPS peak ANE dual-die)
Config: train_large --steps 30, synthetic 100K tokens

Metric main (baseline) fix/safety-correctness
Avg train 101.5 ms/step 101.2 ms/step
ANE TFLOPS 0.92 0.92
Total TFLOPS 1.72 1.72
Wall time (30 steps) 16.2s 16.2s
Compile overhead 77.6% 77.6%

No performance regression. Safety and correctness fixes have zero measurable overhead.

@maderix
Copy link
Owner

maderix commented Mar 4, 2026

Good stuff — non-destructive recompile, atomic checkpoints, fread validation, and thread-safe RMSNorm are all real improvements. Merging now. We'll swap the assert() calls on token bounds to non-fatal logging (uint16_t is always >= 0 anyway, and a bad token should skip, not abort). Thanks!

@maderix maderix merged commit 05fc8f8 into maderix:main Mar 4, 2026
maderix pushed a commit that referenced this pull request Mar 4, 2026
Follow-up to PR #31 — assert() aborts on bad tokens, which is too
harsh for training. Skip bad tokens with a warning instead.
@alvgeppetto-debug
Copy link
Author

Your welcome! I think we should thank Claude ;-) under some nice stress prompts :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants