A from-scratch GPT that learns to write like a dead U.S. president, trained on the inaugural and State of the Union addresses of presidents who are no longer with us — George Washington (1790) through George H.W. Bush (1992).
It started as a "dead presidents" riff on Andrej Karpathy's
microGPT: a tiny scalar
autograd engine, a GPT-2–style pre-norm transformer (multi-head causal
self-attention + MLP + RMSNorm), an Adam optimizer, and a train/sample loop —
all in a few hundred lines of standard-library Python. That engine
(src/microgpt.py) is still here, unchanged and dependency-free. It is gloriously
slow, and that's the point: every multiply and every gradient is a visible
Python object.
But "gloriously slow" (≈ 8 seconds per training step) means real weights — the kind that produce recognizable words — are days of CPU away, and it makes a Karpathy-style autoresearch loop (edit → train a few minutes → keep if the metric improved → repeat) impossible. So this repo now has two engines for one model:
| engine | file | deps | speed | role |
|---|---|---|---|---|
| scalar | src/microgpt.py |
none (python3) |
~8 s/step | the exhibit — hand-traceable scalar autograd, and the canonical sampler |
| fast | src/fast.py |
NumPy | ~5,400 examples/s | the workhorse — same math, batched, for actually training good weights |
They are the same model: identical architecture, identical math, a shared
JSON checkpoint format. dp-verify proves it — given the same weights the two
engines produce identical logits to ~1e-16 (machine precision), and the fast
backend's hand-written gradients match finite differences to ~1e-5. So you
can train fast and sample faithfully through the slow, legible scalar engine — or
just read src/microgpt.py to see exactly what the fast one is doing underneath.
Two public-domain sources of U.S. presidential rhetoric (works of the federal government, 17 U.S.C. § 105):
- Inaugural addresses — the canonical NLTK
inauguralcorpus (51 addresses). - State of the Union addresses — from stdlib-js/datasets-sotu (204 addresses).
We deliberately include only deceased presidents. The five still living —
Bill Clinton, George W. Bush, Barack Obama, Donald Trump, Joe Biden — are
excluded from both sources; everyone else is in, including George H.W. Bush
(d. 2018), Jimmy Carter (d. 2024), Gerald Ford (d. 2006), and Ronald Reagan
(d. 2004). Note the corpus builder distinguishes the two Bushes: george_bush
(H.W.) is kept, only george_w_bush (G.W.) is dropped.
That's 255 addresses normalized into 30,725 lowercase sentences (~3.0M
characters; data/presidents.txt), a 33-character vocabulary plus a begin/end
marker (34 tokens total). The State of the Union addresses are the bulk of it,
and they shift the diction toward the administrative register — secretary of
the treasury, expenditures, appropriations, the present session of Congress —
where the inaugurals lean loftier.
Dependencies are managed with uv. The scalar engine needs nothing but Python; everything else uses NumPy.
uv sync # create the env, install NumPy, install the dp-* commands
# --- the exhibit: pure-Python scalar engine (no NumPy) -----------------------
python src/microgpt.py # train a tiny model the slow, legible way, then sample
# --- the workhorse: NumPy fast backend ---------------------------------------
uv run dp-build-corpus # (re)download + rebuild data/presidents.txt (committed, so optional)
uv run dp-train --steps 3000 --n-embd 40 --n-layer 3 --block-size 112 --save out.json
uv run dp-sample # sample from checkpoints/best.json (the trained winner)
uv run dp-sample --check-parity # re-confirm scalar == fast on this exact checkpoint
# --- the autoresearch sweep --------------------------------------------------
uv run dp-autoresearch --mode grid # reproducible baseline grid
uv run dp-autoresearch --mode collect # fold runs/ into a results.tsv leaderboard
uv run dp-optimize # train bigger candidates in parallel, adopt the best
# --- the equivalence gate ----------------------------------------------------
uv run dp-verifyA checkpoint trained by either engine loads in the other:
# train slowly in pure Python and save:
python src/microgpt.py --steps 500 --save ckpt.json
# ...or load a fast-trained checkpoint into the scalar engine and sample:
python src/microgpt.py --init-from checkpoints/best.json --steps 0 --n-samples 3The trained model is tiny (~260k params, ~1 MB as float32), so it runs entirely
client-side — no server. web/ is not tracked in this repo (it's
gitignored and built separately); uv run dp-export-web regenerates it locally
from the current checkpoint. It is a self-contained static app: a classic Web
Worker holds the weights and answers requests behind an OpenAI-shaped API
(client.chat.completions.create({messages, stream}) returning standard
chat.completion chunks with usage token counts). The chat UI seeds generation
from your message and continues it in dead-president voice, with live token/throughput
stats. It is not an assistant — a séance, not a chatbot.
uv run dp-export-web # checkpoints/best.json -> web/model.bin + web/model.json
python3 web/serve.py # serves web/ at http://localhost:8137 with caching OFFUse web/serve.py, not python3 -m http.server — the plain server lets the
browser cache the Worker/engine, which causes stale-mix bugs after you re-export
or edit (you'll see 304s and no model.bin fetch). serve.py sends
Cache-Control: no-store. If you ever do hit a stale page, hard-reload
(Cmd/Ctrl+Shift+R).
Both engines share these. The scalar default is a 8,568-parameter model; the
trained winner (below) is 464,928 (n_embd=96, n_layer=4, block_size=192, continuous).
| flag | scalar default | meaning |
|---|---|---|
--n-embd |
24 | embedding width |
--n-layer |
1 | transformer blocks |
--n-head |
4 | attention heads (n_embd % n_head == 0) |
--block-size |
32 | context length in characters |
--lr |
0.02 | base learning rate (cosine-decayed) |
--temperature |
0.8 | sampling temperature |
The fast backend adds --batch-size, --warmup, --min-lr-frac, and a held-out
validation split reported as bits per character (val_bpc) — the metric the
autoresearch loop minimizes.
The metric is val_bpc, bits per character on a held-out 10% of sentences.
An untrained model sits at log2(34) ≈ 5.09. The original scalar README
celebrated merely descending from there; the question is how low you can get,
and what architecture gets you there.
We ran a Karpathy-style sweep (dp-autoresearch): one metric, a fixed
3,000-step budget per experiment for fair ranking, keep-if-better. Seven
parallel "researcher" agents each explored one region — width, depth,
learning-rate schedule, batch size, context length, attention heads, and a
wildcard combo — for 59 experiments total, then the top configs were
re-run across 3 seeds to reject seed-luck (each seed also reshuffles the
train/val split, so this is a strong robustness test).
What actually moved the needle:
- Warmup was the master key. At a fixed learning rate with no warmup, every wider / deeper / longer-context model looked worse — apparent collapse. That was an init-time instability artifact, not a capacity ceiling. A short 150–250-step warmup flipped every region's verdict and unlocked the rest.
- Context length is the biggest single lever.
block_size24 → 112 dropped val_bpc monotonically (~2.47 → 2.09), plateauing by ~112. - Depth beats width at matched (~63k) parameters: narrow-and-deep
(
n_embd=40, n_layer=3) beat wide-and-shallow (n_embd=64, n_layer=1) by ~0.04 bpc — depth composes long-range features better than raw width here. - The optimizer wants a small, fully-decayed LR (cosine to zero); nonzero floors and hot LRs both hurt.
The winner (seed-robust mean val_bpc = 2.029 ± 0.011, ≈ 1.41 nats):
n_embd=40 n_layer=3 n_head=4 block_size=112 lr=0.018 batch_size=32 warmup=250
63,720 parameters
On the inaugural-only corpus this overfit: validation bottomed at step ~4,000 at val_bpc 2.04 and then rose while train-loss kept falling — textbook overfitting of a ~2,350-sentence training set.
The original sweep ran before the corpus included SOTU. Re-running the same winner config on the full ~30k-sentence corpus, the overfitting simply disappears — validation descends monotonically the whole way and is still improving at the end:
step 500 val_bpc 2.65
step 2500 val_bpc 1.97 (already past the inaugural-only best)
step 5000 val_bpc 1.85
step 10000 val_bpc 1.72
step 15000 val_bpc 1.667 ← still the minimum at the last step
At 63,720 parameters this reaches val_bpc 1.667 (≈ 1.16 nats) but is still descending at step 15,000. The lesson is the data, not the dial: ~12× more text turned an overfit model into an underfit one. So the next lever is scale.
A second sweep on the full corpus (dp-autoresearch scout grid, 4,000-step
budget, then the top configs trained to 15,000) searched larger models. Scaling
helps cleanly — e64/l4 at just 4k steps (1.58) already beat the e40/l3 winner —
and the converged finalists:
| config | params | val_bpc | steps |
|---|---|---|---|
| e72 / l4 — winner | 259,992 | 1.530 | 15,000 |
| e64 / l4 + batch 64 | 206,528 | 1.534 | 8,000 |
| e64 / l4 | 206,528 | 1.549 | 15,000 |
That e72/l4 model (val_bpc 1.530) held until v3. A follow-up isolated the
real remaining lever: continuous cross-sentence training. (A v2 experiment
added per-document "voice" conditioning + more scale and regressed — the
headers put the model out-of-distribution at eval and it overfit; dropped.)
Re-running at e96/l4, block 192, trained continuously with no conditioning
converged to val_bpc 1.475 (465k params) — the current checkpoints/best.json.
The progression, in one line: 5.09 → 2.04 → 1.667 → 1.530 → 1.475 bits/char — untrained → inaugural-only winner → + State of the Union → scaled up → continuous long-context.
1.475 bits/char is well below the old "real words" threshold of ~1.8 nats. From
the current checkpoints/best.json (e96/l4, continuous) at temperature 0.5,
seeded by a prompt:
the economy → of the state of the union and last year the chinese means to the
state of the progress of the solution.
my fellow americans → have been so much in exchange and attentioned political
expenditures are in the fiscal year ending june...
we must → continue to be no progress with regard to the provision of the nation.
Not grammatical — but unmistakably presidential, spanning two centuries of
State of the Union diction: fiscal year ending june, postmaster-general,
expenditures, the state of the union, the soviet union. Continuous training
gives noticeably cleaner runs (fewer made-up words) than the per-sentence v1.
Lower --temperature (e.g. 0.4) for more conservative diction; raise it (0.7)
for more invention.
Sampling through the scalar engine is provably identical (logits match to
~1.8e-14 on this checkpoint) but impractically slow — minutes per character with
a 112-token context and 4 layers — the "gloriously slow" exhibit taken to its
logical end. Use dp-sample (the fast backend) for the trained model; keep the
tiny scalar default for hand-traceable runs.
src/ # the deadpres package (import name: src)
├── microgpt.py # scalar Value-autograd GPT — dependency-free, self-contained, the readable exhibit
├── fast.py # NumPy backend: FastGPT, Adam, train, sample
├── data.py # Vocab, corpus load, train/val split, batching, bits/char metric
├── checkpoint.py # shared JSON checkpoint save/load
├── corpus.py # build the corpus (download/filter/normalize inaugural + SOTU)
├── research.py # autoresearch sweep + results.tsv + parallel train-and-adopt
└── cli/ # one thin entry point per dp-* command
web/ # client-side browser app (Web Worker + OpenAI-shaped API)
data/ # presidents.txt + raw/ (inaugural) + raw_sotu/ (State of the Union)
checkpoints/best.json # the trained winner's weights (e96/l4 continuous, val_bpc 1.475)
Commands (installed by uv sync): dp-build-corpus, dp-train, dp-sample,
dp-autoresearch, dp-optimize, dp-verify, dp-export-web. The scalar exhibit
runs with plain Python: python src/microgpt.py.
The presidential addresses are U.S. government works in the public domain. The code here is offered under the MIT license. Architecture and philosophy are indebted to Andrej Karpathy's micrograd / microGPT, and the search loop to his autoresearch.