Skip to content

test(autograd): end-to-end tiny-transformer trains to decreasing loss — all params update (PMAT-921)#2213

Merged
noahgift merged 1 commit into
mainfrom
beat/e2e-training-smoke-pmat921
Jun 24, 2026
Merged

test(autograd): end-to-end tiny-transformer trains to decreasing loss — all params update (PMAT-921)#2213
noahgift merged 1 commit into
mainfrom
beat/e2e-training-smoke-pmat921

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

END-TO-END capability proof (PMAT-921)

Closes the gap the autograd severed-graph sweep (PMAT-907/911/913/914) left open: the graph was verified only by per-layer finite-difference gradchecks, never by training a real model to a loss target. A composition of individually-correct layers can still freeze a parameter on the integration path that no per-layer gradcheck exercises.

What this beat adds

A tiny transformer assembled from apr's own nn modules — one-hot embedding → TransformerEncoderLayer{LayerNorm + MHA + LayerNorm + FFN}lm_head (vocab=hidden=32, heads=2, seq=8) — trained 200 Adam steps on a fixed deterministic memorize task. Two independent falsifier guards:

  • (a) loss collapse: final ≪ initial. Observed init ≈ 3.565 (≈ ln(32) = near-uniform) → final ≈ 1.4e-5.
  • (b) every param updates: for EVERY trainable group — embedding weight, attention q/k/v/out weight+bias, both LayerNorm γ+β, FFN linear1/linear2 weight+bias, lm_head weight+bias — ||p_final − p_init|| > 1e-6 AND a finite non-zero gradient was received.

Everything is LCG-seeded for CI determinism; tiny + bounded so it is a fast per-PR test, not a bench.

REAL BUG FOUND (the proof did its job)

TransformerEncoderLayer's FFN called nn::functional::gelu, which builds its output via Tensor::from_vec and severs the autograd graph — freezing ffn.linear1 (weight+bias) and norm2 (γ+β) in every end-to-end training run, while the isolated attention gradcheck stayed green.

Fix: route the FFN through the autograd-aware Tensor::gelu. Both paths use the identical tanh GELU approximation, so forward numerics are unchanged — all 14004 aprender-core lib tests still pass; only the backward edge is restored.

RED-confirmed two ways

  1. Original functional-gelu FFN → ffn.linear1.weight/bias, norm2.gamma/beta report NO gradient (guard b RED).
  2. Detaching the attention output edge → all attention q/k/v/out + norm1 params frozen (guard b RED) even though loss still drops via FFN+lm_head — proving guard (b) is an independent severed-graph detector that per-layer gradchecks miss in composition.

Contract

OBLIG-TRANSFORMER-END-TO-END-TRAINABLEcontracts/transformer-end-to-end-trainable-v1.yaml (pv validate + pv lint contracts/ pass, single-line falsifier ref).

🤖 Generated with Claude Code

… — all params update (PMAT-921)

END-TO-END capability proof closing the gap the severed-graph sweep
(PMAT-907/911/913/914) left open: the autograd graph was verified only by
PER-LAYER finite-difference gradchecks, never by training a real model to a
loss target. A composition of individually-correct layers can still freeze a
parameter on the INTEGRATION path that no per-layer gradcheck exercises.

This beat builds a tiny transformer from apr's own nn modules
(one-hot embedding -> TransformerEncoderLayer{LayerNorm+MHA+LayerNorm+FFN}
-> lm_head; vocab=hidden=32, heads=2, seq=8) and trains it 200 Adam steps on a
fixed deterministic memorize task. Two independent falsifier guards:
  (a) final loss << initial (observed init ~3.57 ~= ln(32), final ~1e-5), and
  (b) EVERY trainable param group — embedding, attn q/k/v/out weight+bias, both
      LayerNorm gamma+beta, FFN linear1/linear2 weight+bias, lm_head weight+bias
      — genuinely CHANGED from init AND received a finite non-zero gradient.
Everything is LCG-seeded for CI determinism; the model is tiny + bounded so it
runs as a fast per-PR test, not a slow bench.

REAL BUG FOUND (the proof did its job): TransformerEncoderLayer's FFN called
nn::functional::gelu, which builds its output via Tensor::from_vec and SEVERS
the autograd graph — freezing ffn.linear1 (weight+bias) and norm2 (gamma+beta)
in EVERY end-to-end training run, while the isolated attention gradcheck stayed
green. Fix: route the FFN through the autograd-aware Tensor::gelu. Both paths
use the identical tanh GELU approximation, so forward numerics are unchanged
(all 14004 aprender-core lib tests still pass); only the backward edge is
restored.

RED-confirmed two ways: (1) the original functional-gelu FFN reports NO gradient
for linear1/norm2; (2) detaching the attention output edge freezes all
attention q/k/v/out + norm1 params even though loss still drops via FFN+lm_head
— proving guard (b) is an independent severed-graph detector that per-layer
gradchecks miss in composition.

Contract: OBLIG-TRANSFORMER-END-TO-END-TRAINABLE
(contracts/transformer-end-to-end-trainable-v1.yaml; pv validate + pv lint pass).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge June 24, 2026 09:44
@noahgift noahgift added this pull request to the merge queue Jun 24, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 24, 2026
@noahgift noahgift added this pull request to the merge queue Jun 24, 2026
Merged via the queue into main with commit 1b0d01a Jun 24, 2026
11 checks passed
@noahgift noahgift deleted the beat/e2e-training-smoke-pmat921 branch June 24, 2026 11:33
noahgift added a commit that referenced this pull request Jun 24, 2026
…rad-aware Tensor ops (sever-graph sweep, PMAT-922)

PMAT-921 (#2213) proved the ENCODER FFN's nn::functional::gelu severed the
autograd graph — it builds its output via Tensor::from_vec (no grad_fn), so
gradient could not flow past it, freezing ffn.linear1 + norm2 in every real
end-to-end training run while the isolated per-layer gradcheck stayed green.
That is a CLASS: any nn LAYER whose forward calls a functional helper that
rebuilds its output as a fresh leaf (Tensor::from_vec / Tensor::new) severs
autograd for everything upstream. This sweep enumerates the rest and fixes them.

ENUMERATION (functional::* called from crates/aprender-core/src/nn/** forward
paths) and verdict:
  - rnn.rs sigmoid/tanh -> functional::sigmoid/tanh -> x.sigmoid()/x.tanh_()
    = AUTOGRAD-AWARE, not severed.
  - transformer/mod.rs softmax_last_dim -> records SoftmaxLastDimBackward (PMAT-914)
    = not severed.
  - normalization layer_norm / group_norm rms_norm = already fixed (PMAT-907).
  - encoder FFN gelu = fixed by PMAT-921.
  - SEVERED (this PR):
    1. TransformerDecoderLayer::forward_with_memory FFN `gelu(&ff_out)`
       (positional_encoding.rs) — the exact decoder twin of the PMAT-921 encoder
       bug. Routed through the autograd-aware Tensor::gelu.
    2. Dropout::forward (dropout/mod.rs) in TRAINING mode (p>0) rebuilt the scaled
       output via Tensor::new — severed. Now builds the inverted-dropout mask as a
       constant tensor and applies it via Tensor::mul (records MulBackward).
    3. nn::functional::dropout (functional.rs), used by attention's apply_dropout,
       same Tensor::from_vec sever — same mask+mul fix.

All three fixes preserve forward numerics exactly (per-element x*mask equals the
old scaled value; Tensor::gelu uses the identical tanh GELU approximation as
functional::gelu). Only the backward edge is restored.

RED/GREEN + mutation-verified (revert one fix -> its falsifier goes RED):
  - decoder_ffn_gelu_grad_flows_to_linear1_and_norm3: with dropout off, the
    severed decoder gelu gives linear2 (downstream) a gradient but leaves
    linear1.weight + norm3.gamma (upstream) frozen; the fix restores both.
  - dropout_layer_grad_flows_to_input_in_training_mode
  - functional_dropout_grad_flows_to_input
All 14007 aprender-core lib tests pass.

Contract: transformer-end-to-end-trainable-v1.yaml extended with
OBLIG-FUNCTIONAL-GELU-BACKWARD-GRAD (decoder) and
OBLIG-FUNCTIONAL-DROPOUT-BACKWARD-GRAD plus single-line falsifier refs 002/003/004
(pv validate: 0 err/0 warn; pv lint contracts/: PASS).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant