Skip to content

iammatthias/dead-presidents

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🪦 Dead Presidents GPT

A from-scratch GPT that learns to write like a dead U.S. president, trained on the inaugural and State of the Union addresses of presidents who are no longer with us — George Washington (1790) through George H.W. Bush (1992).

It started as a "dead presidents" riff on Andrej Karpathy's microGPT: a tiny scalar autograd engine, a GPT-2–style pre-norm transformer (multi-head causal self-attention + MLP + RMSNorm), an Adam optimizer, and a train/sample loop — all in a few hundred lines of standard-library Python. That engine (src/microgpt.py) is still here, unchanged and dependency-free. It is gloriously slow, and that's the point: every multiply and every gradient is a visible Python object.

But "gloriously slow" (≈ 8 seconds per training step) means real weights — the kind that produce recognizable words — are days of CPU away, and it makes a Karpathy-style autoresearch loop (edit → train a few minutes → keep if the metric improved → repeat) impossible. So this repo now has two engines for one model:

engine file deps speed role
scalar src/microgpt.py none (python3) ~8 s/step the exhibit — hand-traceable scalar autograd, and the canonical sampler
fast src/fast.py NumPy ~5,400 examples/s the workhorse — same math, batched, for actually training good weights

They are the same model: identical architecture, identical math, a shared JSON checkpoint format. dp-verify proves it — given the same weights the two engines produce identical logits to ~1e-16 (machine precision), and the fast backend's hand-written gradients match finite differences to ~1e-5. So you can train fast and sample faithfully through the slow, legible scalar engine — or just read src/microgpt.py to see exactly what the fast one is doing underneath.

The corpus

Two public-domain sources of U.S. presidential rhetoric (works of the federal government, 17 U.S.C. § 105):

  • Inaugural addresses — the canonical NLTK inaugural corpus (51 addresses).
  • State of the Union addresses — from stdlib-js/datasets-sotu (204 addresses).

We deliberately include only deceased presidents. The five still living — Bill Clinton, George W. Bush, Barack Obama, Donald Trump, Joe Biden — are excluded from both sources; everyone else is in, including George H.W. Bush (d. 2018), Jimmy Carter (d. 2024), Gerald Ford (d. 2006), and Ronald Reagan (d. 2004). Note the corpus builder distinguishes the two Bushes: george_bush (H.W.) is kept, only george_w_bush (G.W.) is dropped.

That's 255 addresses normalized into 30,725 lowercase sentences (~3.0M characters; data/presidents.txt), a 33-character vocabulary plus a begin/end marker (34 tokens total). The State of the Union addresses are the bulk of it, and they shift the diction toward the administrative register — secretary of the treasury, expenditures, appropriations, the present session of Congress — where the inaugurals lean loftier.

Quick start

Dependencies are managed with uv. The scalar engine needs nothing but Python; everything else uses NumPy.

uv sync                      # create the env, install NumPy, install the dp-* commands

# --- the exhibit: pure-Python scalar engine (no NumPy) -----------------------
python src/microgpt.py       # train a tiny model the slow, legible way, then sample

# --- the workhorse: NumPy fast backend ---------------------------------------
uv run dp-build-corpus       # (re)download + rebuild data/presidents.txt (committed, so optional)
uv run dp-train --steps 3000 --n-embd 40 --n-layer 3 --block-size 112 --save out.json
uv run dp-sample             # sample from checkpoints/best.json (the trained winner)
uv run dp-sample --check-parity   # re-confirm scalar == fast on this exact checkpoint

# --- the autoresearch sweep --------------------------------------------------
uv run dp-autoresearch --mode grid    # reproducible baseline grid
uv run dp-autoresearch --mode collect # fold runs/ into a results.tsv leaderboard
uv run dp-optimize                    # train bigger candidates in parallel, adopt the best

# --- the equivalence gate ----------------------------------------------------
uv run dp-verify

A checkpoint trained by either engine loads in the other:

# train slowly in pure Python and save:
python src/microgpt.py --steps 500 --save ckpt.json
# ...or load a fast-trained checkpoint into the scalar engine and sample:
python src/microgpt.py --init-from checkpoints/best.json --steps 0 --n-samples 3

Run it in your browser

The trained model is tiny (~260k params, ~1 MB as float32), so it runs entirely client-side — no server. web/ is not tracked in this repo (it's gitignored and built separately); uv run dp-export-web regenerates it locally from the current checkpoint. It is a self-contained static app: a classic Web Worker holds the weights and answers requests behind an OpenAI-shaped API (client.chat.completions.create({messages, stream}) returning standard chat.completion chunks with usage token counts). The chat UI seeds generation from your message and continues it in dead-president voice, with live token/throughput stats. It is not an assistant — a séance, not a chatbot.

uv run dp-export-web         # checkpoints/best.json -> web/model.bin + web/model.json
python3 web/serve.py         # serves web/ at http://localhost:8137 with caching OFF

Use web/serve.py, not python3 -m http.server — the plain server lets the browser cache the Worker/engine, which causes stale-mix bugs after you re-export or edit (you'll see 304s and no model.bin fetch). serve.py sends Cache-Control: no-store. If you ever do hit a stale page, hard-reload (Cmd/Ctrl+Shift+R).

Architecture knobs

Both engines share these. The scalar default is a 8,568-parameter model; the trained winner (below) is 464,928 (n_embd=96, n_layer=4, block_size=192, continuous).

flag scalar default meaning
--n-embd 24 embedding width
--n-layer 1 transformer blocks
--n-head 4 attention heads (n_embd % n_head == 0)
--block-size 32 context length in characters
--lr 0.02 base learning rate (cosine-decayed)
--temperature 0.8 sampling temperature

The fast backend adds --batch-size, --warmup, --min-lr-frac, and a held-out validation split reported as bits per character (val_bpc) — the metric the autoresearch loop minimizes.

Autoresearch: how we found good weights

The metric is val_bpc, bits per character on a held-out 10% of sentences. An untrained model sits at log2(34) ≈ 5.09. The original scalar README celebrated merely descending from there; the question is how low you can get, and what architecture gets you there.

We ran a Karpathy-style sweep (dp-autoresearch): one metric, a fixed 3,000-step budget per experiment for fair ranking, keep-if-better. Seven parallel "researcher" agents each explored one region — width, depth, learning-rate schedule, batch size, context length, attention heads, and a wildcard combo — for 59 experiments total, then the top configs were re-run across 3 seeds to reject seed-luck (each seed also reshuffles the train/val split, so this is a strong robustness test).

What actually moved the needle:

  1. Warmup was the master key. At a fixed learning rate with no warmup, every wider / deeper / longer-context model looked worse — apparent collapse. That was an init-time instability artifact, not a capacity ceiling. A short 150–250-step warmup flipped every region's verdict and unlocked the rest.
  2. Context length is the biggest single lever. block_size 24 → 112 dropped val_bpc monotonically (~2.47 → 2.09), plateauing by ~112.
  3. Depth beats width at matched (~63k) parameters: narrow-and-deep (n_embd=40, n_layer=3) beat wide-and-shallow (n_embd=64, n_layer=1) by ~0.04 bpc — depth composes long-range features better than raw width here.
  4. The optimizer wants a small, fully-decayed LR (cosine to zero); nonzero floors and hot LRs both hurt.

The winner (seed-robust mean val_bpc = 2.029 ± 0.011, ≈ 1.41 nats):

n_embd=40  n_layer=3  n_head=4  block_size=112  lr=0.018  batch_size=32  warmup=250
63,720 parameters

On the inaugural-only corpus this overfit: validation bottomed at step ~4,000 at val_bpc 2.04 and then rose while train-loss kept falling — textbook overfitting of a ~2,350-sentence training set.

Adding the State of the Union addresses fixed that

The original sweep ran before the corpus included SOTU. Re-running the same winner config on the full ~30k-sentence corpus, the overfitting simply disappears — validation descends monotonically the whole way and is still improving at the end:

step   500   val_bpc 2.65
step  2500   val_bpc 1.97   (already past the inaugural-only best)
step  5000   val_bpc 1.85
step 10000   val_bpc 1.72
step 15000   val_bpc 1.667  ← still the minimum at the last step

At 63,720 parameters this reaches val_bpc 1.667 (≈ 1.16 nats) but is still descending at step 15,000. The lesson is the data, not the dial: ~12× more text turned an overfit model into an underfit one. So the next lever is scale.

Re-sweeping for scale

A second sweep on the full corpus (dp-autoresearch scout grid, 4,000-step budget, then the top configs trained to 15,000) searched larger models. Scaling helps cleanly — e64/l4 at just 4k steps (1.58) already beat the e40/l3 winner — and the converged finalists:

config params val_bpc steps
e72 / l4 — winner 259,992 1.530 15,000
e64 / l4 + batch 64 206,528 1.534 8,000
e64 / l4 206,528 1.549 15,000

That e72/l4 model (val_bpc 1.530) held until v3. A follow-up isolated the real remaining lever: continuous cross-sentence training. (A v2 experiment added per-document "voice" conditioning + more scale and regressed — the headers put the model out-of-distribution at eval and it overfit; dropped.) Re-running at e96/l4, block 192, trained continuously with no conditioning converged to val_bpc 1.475 (465k params) — the current checkpoints/best.json.

The progression, in one line: 5.09 → 2.04 → 1.667 → 1.530 → 1.475 bits/char — untrained → inaugural-only winner → + State of the Union → scaled up → continuous long-context.

What the samples actually look like

1.475 bits/char is well below the old "real words" threshold of ~1.8 nats. From the current checkpoints/best.json (e96/l4, continuous) at temperature 0.5, seeded by a prompt:

the economy → of the state of the union and last year the chinese means to the
              state of the progress of the solution.
my fellow americans → have been so much in exchange and attentioned political
              expenditures are in the fiscal year ending june...
we must → continue to be no progress with regard to the provision of the nation.

Not grammatical — but unmistakably presidential, spanning two centuries of State of the Union diction: fiscal year ending june, postmaster-general, expenditures, the state of the union, the soviet union. Continuous training gives noticeably cleaner runs (fewer made-up words) than the per-sentence v1. Lower --temperature (e.g. 0.4) for more conservative diction; raise it (0.7) for more invention.

Sampling through the scalar engine is provably identical (logits match to ~1.8e-14 on this checkpoint) but impractically slow — minutes per character with a 112-token context and 4 layers — the "gloriously slow" exhibit taken to its logical end. Use dp-sample (the fast backend) for the trained model; keep the tiny scalar default for hand-traceable runs.

Layout

src/                       # the deadpres package (import name: src)
├── microgpt.py            # scalar Value-autograd GPT — dependency-free, self-contained, the readable exhibit
├── fast.py                # NumPy backend: FastGPT, Adam, train, sample
├── data.py                # Vocab, corpus load, train/val split, batching, bits/char metric
├── checkpoint.py          # shared JSON checkpoint save/load
├── corpus.py              # build the corpus (download/filter/normalize inaugural + SOTU)
├── research.py            # autoresearch sweep + results.tsv + parallel train-and-adopt
└── cli/                   # one thin entry point per dp-* command
web/                       # client-side browser app (Web Worker + OpenAI-shaped API)
data/                      # presidents.txt + raw/ (inaugural) + raw_sotu/ (State of the Union)
checkpoints/best.json      # the trained winner's weights (e96/l4 continuous, val_bpc 1.475)

Commands (installed by uv sync): dp-build-corpus, dp-train, dp-sample, dp-autoresearch, dp-optimize, dp-verify, dp-export-web. The scalar exhibit runs with plain Python: python src/microgpt.py.

Provenance & license

The presidential addresses are U.S. government works in the public domain. The code here is offered under the MIT license. Architecture and philosophy are indebted to Andrej Karpathy's micrograd / microGPT, and the search loop to his autoresearch.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors