test(autograd): end-to-end tiny-transformer trains to decreasing loss — all params update (PMAT-921) by noahgift · Pull Request #2213 · paiml/aprender

noahgift · 2026-06-24T09:44:40Z

END-TO-END capability proof (PMAT-921)

Closes the gap the autograd severed-graph sweep (PMAT-907/911/913/914) left open: the graph was verified only by per-layer finite-difference gradchecks, never by training a real model to a loss target. A composition of individually-correct layers can still freeze a parameter on the integration path that no per-layer gradcheck exercises.

What this beat adds

A tiny transformer assembled from apr's own nn modules — one-hot embedding → TransformerEncoderLayer{LayerNorm + MHA + LayerNorm + FFN} → lm_head (vocab=hidden=32, heads=2, seq=8) — trained 200 Adam steps on a fixed deterministic memorize task. Two independent falsifier guards:

(a) loss collapse: final ≪ initial. Observed init ≈ 3.565 (≈ ln(32) = near-uniform) → final ≈ 1.4e-5.
(b) every param updates: for EVERY trainable group — embedding weight, attention q/k/v/out weight+bias, both LayerNorm γ+β, FFN linear1/linear2 weight+bias, lm_head weight+bias — ||p_final − p_init|| > 1e-6 AND a finite non-zero gradient was received.

Everything is LCG-seeded for CI determinism; tiny + bounded so it is a fast per-PR test, not a bench.

REAL BUG FOUND (the proof did its job)

TransformerEncoderLayer's FFN called nn::functional::gelu, which builds its output via Tensor::from_vec and severs the autograd graph — freezing ffn.linear1 (weight+bias) and norm2 (γ+β) in every end-to-end training run, while the isolated attention gradcheck stayed green.

Fix: route the FFN through the autograd-aware Tensor::gelu. Both paths use the identical tanh GELU approximation, so forward numerics are unchanged — all 14004 aprender-core lib tests still pass; only the backward edge is restored.

RED-confirmed two ways

Original functional-gelu FFN → ffn.linear1.weight/bias, norm2.gamma/beta report NO gradient (guard b RED).
Detaching the attention output edge → all attention q/k/v/out + norm1 params frozen (guard b RED) even though loss still drops via FFN+lm_head — proving guard (b) is an independent severed-graph detector that per-layer gradchecks miss in composition.

Contract

OBLIG-TRANSFORMER-END-TO-END-TRAINABLE — contracts/transformer-end-to-end-trainable-v1.yaml (pv validate + pv lint contracts/ pass, single-line falsifier ref).

🤖 Generated with Claude Code

… — all params update (PMAT-921) END-TO-END capability proof closing the gap the severed-graph sweep (PMAT-907/911/913/914) left open: the autograd graph was verified only by PER-LAYER finite-difference gradchecks, never by training a real model to a loss target. A composition of individually-correct layers can still freeze a parameter on the INTEGRATION path that no per-layer gradcheck exercises. This beat builds a tiny transformer from apr's own nn modules (one-hot embedding -> TransformerEncoderLayer{LayerNorm+MHA+LayerNorm+FFN} -> lm_head; vocab=hidden=32, heads=2, seq=8) and trains it 200 Adam steps on a fixed deterministic memorize task. Two independent falsifier guards: (a) final loss << initial (observed init ~3.57 ~= ln(32), final ~1e-5), and (b) EVERY trainable param group — embedding, attn q/k/v/out weight+bias, both LayerNorm gamma+beta, FFN linear1/linear2 weight+bias, lm_head weight+bias — genuinely CHANGED from init AND received a finite non-zero gradient. Everything is LCG-seeded for CI determinism; the model is tiny + bounded so it runs as a fast per-PR test, not a slow bench. REAL BUG FOUND (the proof did its job): TransformerEncoderLayer's FFN called nn::functional::gelu, which builds its output via Tensor::from_vec and SEVERS the autograd graph — freezing ffn.linear1 (weight+bias) and norm2 (gamma+beta) in EVERY end-to-end training run, while the isolated attention gradcheck stayed green. Fix: route the FFN through the autograd-aware Tensor::gelu. Both paths use the identical tanh GELU approximation, so forward numerics are unchanged (all 14004 aprender-core lib tests still pass); only the backward edge is restored. RED-confirmed two ways: (1) the original functional-gelu FFN reports NO gradient for linear1/norm2; (2) detaching the attention output edge freezes all attention q/k/v/out + norm1 params even though loss still drops via FFN+lm_head — proving guard (b) is an independent severed-graph detector that per-layer gradchecks miss in composition. Contract: OBLIG-TRANSFORMER-END-TO-END-TRAINABLE (contracts/transformer-end-to-end-trainable-v1.yaml; pv validate + pv lint pass). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rad-aware Tensor ops (sever-graph sweep, PMAT-922) PMAT-921 (#2213) proved the ENCODER FFN's nn::functional::gelu severed the autograd graph — it builds its output via Tensor::from_vec (no grad_fn), so gradient could not flow past it, freezing ffn.linear1 + norm2 in every real end-to-end training run while the isolated per-layer gradcheck stayed green. That is a CLASS: any nn LAYER whose forward calls a functional helper that rebuilds its output as a fresh leaf (Tensor::from_vec / Tensor::new) severs autograd for everything upstream. This sweep enumerates the rest and fixes them. ENUMERATION (functional::* called from crates/aprender-core/src/nn/** forward paths) and verdict: - rnn.rs sigmoid/tanh -> functional::sigmoid/tanh -> x.sigmoid()/x.tanh_() = AUTOGRAD-AWARE, not severed. - transformer/mod.rs softmax_last_dim -> records SoftmaxLastDimBackward (PMAT-914) = not severed. - normalization layer_norm / group_norm rms_norm = already fixed (PMAT-907). - encoder FFN gelu = fixed by PMAT-921. - SEVERED (this PR): 1. TransformerDecoderLayer::forward_with_memory FFN `gelu(&ff_out)` (positional_encoding.rs) — the exact decoder twin of the PMAT-921 encoder bug. Routed through the autograd-aware Tensor::gelu. 2. Dropout::forward (dropout/mod.rs) in TRAINING mode (p>0) rebuilt the scaled output via Tensor::new — severed. Now builds the inverted-dropout mask as a constant tensor and applies it via Tensor::mul (records MulBackward). 3. nn::functional::dropout (functional.rs), used by attention's apply_dropout, same Tensor::from_vec sever — same mask+mul fix. All three fixes preserve forward numerics exactly (per-element x*mask equals the old scaled value; Tensor::gelu uses the identical tanh GELU approximation as functional::gelu). Only the backward edge is restored. RED/GREEN + mutation-verified (revert one fix -> its falsifier goes RED): - decoder_ffn_gelu_grad_flows_to_linear1_and_norm3: with dropout off, the severed decoder gelu gives linear2 (downstream) a gradient but leaves linear1.weight + norm3.gamma (upstream) frozen; the fix restores both. - dropout_layer_grad_flows_to_input_in_training_mode - functional_dropout_grad_flows_to_input All 14007 aprender-core lib tests pass. Contract: transformer-end-to-end-trainable-v1.yaml extended with OBLIG-FUNCTIONAL-GELU-BACKWARD-GRAD (decoder) and OBLIG-FUNCTIONAL-DROPOUT-BACKWARD-GRAD plus single-line falsifier refs 002/003/004 (pv validate: 0 err/0 warn; pv lint contracts/: PASS). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

noahgift enabled auto-merge June 24, 2026 09:44

noahgift added this pull request to the merge queue Jun 24, 2026

noahgift mentioned this pull request Jun 24, 2026

fix(autograd): route nn-layer functional::* activations through autograd-aware Tensor ops (sever-graph sweep, PMAT-922) #2214

Merged

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 24, 2026

noahgift added this pull request to the merge queue Jun 24, 2026

Merged via the queue into main with commit 1b0d01a Jun 24, 2026
11 checks passed

noahgift deleted the beat/e2e-training-smoke-pmat921 branch June 24, 2026 11:33

noahgift mentioned this pull request Jun 24, 2026

chore(release): 0.55.0 — PMAT-918..921 wave (convert/export runnable + GPU parity reconciled + autograd training proven) #2218

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(autograd): end-to-end tiny-transformer trains to decreasing loss — all params update (PMAT-921)#2213

test(autograd): end-to-end tiny-transformer trains to decreasing loss — all params update (PMAT-921)#2213
noahgift merged 1 commit into
mainfrom
beat/e2e-training-smoke-pmat921

noahgift commented Jun 24, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

noahgift commented Jun 24, 2026

END-TO-END capability proof (PMAT-921)

What this beat adds

REAL BUG FOUND (the proof did its job)

RED-confirmed two ways

Contract

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant