test(autograd): end-to-end tiny-transformer trains to decreasing loss — all params update (PMAT-921)#2213
Merged
Merged
Conversation
… — all params update (PMAT-921)
END-TO-END capability proof closing the gap the severed-graph sweep
(PMAT-907/911/913/914) left open: the autograd graph was verified only by
PER-LAYER finite-difference gradchecks, never by training a real model to a
loss target. A composition of individually-correct layers can still freeze a
parameter on the INTEGRATION path that no per-layer gradcheck exercises.
This beat builds a tiny transformer from apr's own nn modules
(one-hot embedding -> TransformerEncoderLayer{LayerNorm+MHA+LayerNorm+FFN}
-> lm_head; vocab=hidden=32, heads=2, seq=8) and trains it 200 Adam steps on a
fixed deterministic memorize task. Two independent falsifier guards:
(a) final loss << initial (observed init ~3.57 ~= ln(32), final ~1e-5), and
(b) EVERY trainable param group — embedding, attn q/k/v/out weight+bias, both
LayerNorm gamma+beta, FFN linear1/linear2 weight+bias, lm_head weight+bias
— genuinely CHANGED from init AND received a finite non-zero gradient.
Everything is LCG-seeded for CI determinism; the model is tiny + bounded so it
runs as a fast per-PR test, not a slow bench.
REAL BUG FOUND (the proof did its job): TransformerEncoderLayer's FFN called
nn::functional::gelu, which builds its output via Tensor::from_vec and SEVERS
the autograd graph — freezing ffn.linear1 (weight+bias) and norm2 (gamma+beta)
in EVERY end-to-end training run, while the isolated attention gradcheck stayed
green. Fix: route the FFN through the autograd-aware Tensor::gelu. Both paths
use the identical tanh GELU approximation, so forward numerics are unchanged
(all 14004 aprender-core lib tests still pass); only the backward edge is
restored.
RED-confirmed two ways: (1) the original functional-gelu FFN reports NO gradient
for linear1/norm2; (2) detaching the attention output edge freezes all
attention q/k/v/out + norm1 params even though loss still drops via FFN+lm_head
— proving guard (b) is an independent severed-graph detector that per-layer
gradchecks miss in composition.
Contract: OBLIG-TRANSFORMER-END-TO-END-TRAINABLE
(contracts/transformer-end-to-end-trainable-v1.yaml; pv validate + pv lint pass).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Jun 24, 2026
…rad-aware Tensor ops (sever-graph sweep, PMAT-922) PMAT-921 (#2213) proved the ENCODER FFN's nn::functional::gelu severed the autograd graph — it builds its output via Tensor::from_vec (no grad_fn), so gradient could not flow past it, freezing ffn.linear1 + norm2 in every real end-to-end training run while the isolated per-layer gradcheck stayed green. That is a CLASS: any nn LAYER whose forward calls a functional helper that rebuilds its output as a fresh leaf (Tensor::from_vec / Tensor::new) severs autograd for everything upstream. This sweep enumerates the rest and fixes them. ENUMERATION (functional::* called from crates/aprender-core/src/nn/** forward paths) and verdict: - rnn.rs sigmoid/tanh -> functional::sigmoid/tanh -> x.sigmoid()/x.tanh_() = AUTOGRAD-AWARE, not severed. - transformer/mod.rs softmax_last_dim -> records SoftmaxLastDimBackward (PMAT-914) = not severed. - normalization layer_norm / group_norm rms_norm = already fixed (PMAT-907). - encoder FFN gelu = fixed by PMAT-921. - SEVERED (this PR): 1. TransformerDecoderLayer::forward_with_memory FFN `gelu(&ff_out)` (positional_encoding.rs) — the exact decoder twin of the PMAT-921 encoder bug. Routed through the autograd-aware Tensor::gelu. 2. Dropout::forward (dropout/mod.rs) in TRAINING mode (p>0) rebuilt the scaled output via Tensor::new — severed. Now builds the inverted-dropout mask as a constant tensor and applies it via Tensor::mul (records MulBackward). 3. nn::functional::dropout (functional.rs), used by attention's apply_dropout, same Tensor::from_vec sever — same mask+mul fix. All three fixes preserve forward numerics exactly (per-element x*mask equals the old scaled value; Tensor::gelu uses the identical tanh GELU approximation as functional::gelu). Only the backward edge is restored. RED/GREEN + mutation-verified (revert one fix -> its falsifier goes RED): - decoder_ffn_gelu_grad_flows_to_linear1_and_norm3: with dropout off, the severed decoder gelu gives linear2 (downstream) a gradient but leaves linear1.weight + norm3.gamma (upstream) frozen; the fix restores both. - dropout_layer_grad_flows_to_input_in_training_mode - functional_dropout_grad_flows_to_input All 14007 aprender-core lib tests pass. Contract: transformer-end-to-end-trainable-v1.yaml extended with OBLIG-FUNCTIONAL-GELU-BACKWARD-GRAD (decoder) and OBLIG-FUNCTIONAL-DROPOUT-BACKWARD-GRAD plus single-line falsifier refs 002/003/004 (pv validate: 0 err/0 warn; pv lint contracts/: PASS). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
END-TO-END capability proof (PMAT-921)
Closes the gap the autograd severed-graph sweep (PMAT-907/911/913/914) left open: the graph was verified only by per-layer finite-difference gradchecks, never by training a real model to a loss target. A composition of individually-correct layers can still freeze a parameter on the integration path that no per-layer gradcheck exercises.
What this beat adds
A tiny transformer assembled from apr's own
nnmodules — one-hot embedding →TransformerEncoderLayer{LayerNorm + MHA + LayerNorm + FFN}→lm_head(vocab=hidden=32, heads=2, seq=8) — trained 200 Adam steps on a fixed deterministic memorize task. Two independent falsifier guards:init ≈ 3.565(≈ln(32)= near-uniform) →final ≈ 1.4e-5.||p_final − p_init|| > 1e-6AND a finite non-zero gradient was received.Everything is LCG-seeded for CI determinism; tiny + bounded so it is a fast per-PR test, not a bench.
REAL BUG FOUND (the proof did its job)
TransformerEncoderLayer's FFN callednn::functional::gelu, which builds its output viaTensor::from_vecand severs the autograd graph — freezingffn.linear1(weight+bias) andnorm2(γ+β) in every end-to-end training run, while the isolated attention gradcheck stayed green.Fix: route the FFN through the autograd-aware
Tensor::gelu. Both paths use the identical tanh GELU approximation, so forward numerics are unchanged — all 14004aprender-corelib tests still pass; only the backward edge is restored.RED-confirmed two ways
ffn.linear1.weight/bias,norm2.gamma/betareport NO gradient (guard b RED).Contract
OBLIG-TRANSFORMER-END-TO-END-TRAINABLE—contracts/transformer-end-to-end-trainable-v1.yaml(pv validate+pv lint contracts/pass, single-line falsifier ref).🤖 Generated with Claude Code