TL;DR: I spent a weekend building a 51-million parameter language model that's supposed to be an F1 expert. It confidently tells you that Lewis Hamilton won the 1950 championship and that the Monaco Grand Prix is held in Azerbaijan. 10/10, would overfit again.
I wanted to understand how LLMs actually work. Not the hand-wavy "attention is all you need" blog post version -- the real thing. Transformers, tokenizers, training loops, loss curves, the works.
My approach was simple: pair-program the entire thing with Claude (the AI, not the F1 team principal). Claude did all the heavy lifting on the code while I steered the direction and asked questions when something didn't make sense. The result is this repo -- a complete, from-scratch transformer language model that I actually understand, trained on Formula 1 data, that generates text like a fever dream about motorsport. All trained on a MacBook Pro M5 Pro with a 16-core GPU and 48GB of unified memory.
It was a fun weekend.
The project started as "Token Forge" -- a tiny 0.8M parameter character-level transformer trained on 6KB of Shakespeare. It could generate text like:
Tho thou hath thee thine...
Groundbreaking stuff. Each character was its own token, the context window was 64 characters (~10 words), and the whole thing ran on CPU because we hadn't even bothered to check if the GPU worked yet.
But here's the thing -- at 0.8M parameters, this model is literally 0.0000005% the size of GPT-3. That's like comparing a paper airplane to a 747 and wondering why it can't cross the Atlantic.
When Claude asked what domain I wanted to focus on, I picked Formula 1. Why? Because it's a constrained domain with lots of structured knowledge (drivers, teams, circuits, regulations), and because I wanted to see if a tiny model could learn to talk about Verstappen without calling him a Shakespearean actor.
This kicked off a complete rewrite:
- Renamed Token Forge to Slipstream (much cooler)
- Scrapped the 6KB Shakespeare dataset for ~10MB of F1 Wikipedia articles (408 articles covering every season from 1950-2025, 100+ drivers, 40+ teams, 50+ circuits)
- Upgraded from character-level tokenization to BPE (GPT-2's tokenizer via tiktoken), going from 65 tokens to 50,257
- Scaled from 0.8M to 51M parameters (embed_dim=512, 8 heads, 8 layers)
- Added MPS support so the M5's GPU could actually do something
Naturally, 51M parameters wasn't enough. Claude and I decided to go full GPT-2 Small at 124M parameters. The M5 has 48GB of RAM! We're barely using any of it!
RuntimeError: MPS backend out of memory (MPS allocated: 61.92 GiB,
other allocations: 1.13 GiB, max allowed: 63.65 GiB)
The batch size was way too ambitious. After cutting it down, the model fit in memory just fine -- but training crawled at 4,000 tokens/second with an ETA of 228 hours. For a weekend project.
We also realized our F1 corpus was only ~2.5M tokens. Training 124M parameters on 2.5M tokens would be like trying to become an F1 expert by reading one Wikipedia article 1,300 times. We sheepishly scaled back to 51M parameters, which was a much better match for the data we had.
The 51M model trained in about 3.5 hours on MPS at ~10k tokens/second. The loss curve told the classic overfitting story:
| Step | Train Loss | Val Loss |
|---|---|---|
| 0 | 10.95 | 10.95 |
| 2000 | 2.15 | 4.12 (best!) |
| 5000 | 0.56 | 4.89 |
| 8000 | 0.26 | 5.37 |
The model memorized the training data by step 3000 and spent the rest of training getting really, really good at reproducing Wikipedia articles about the 1953 Argentine Grand Prix verbatim. The best checkpoint was at step 2000 (~6.5 epochs), which is the sweet spot between "hasn't learned anything" and "has memorized everything."
The completion model actually produces recognizable F1 text:
slipstream> Max Verstappen
Max Verstappen to win the race.
At this, Ricciardo took pole position for the first time since 1977
at the 2007 Canadian Grand Prix. On lap 5, he lost the lead to
Fernando Alonso; a late overtake on Jenson Button, Lewis Hamilton,
who collided and both were uninjured...
It knows driver names! It knows team names! It knows what a "pole position" is! It just... gets every single fact wrong. Ricciardo did not take pole at the 2007 Canadian Grand Prix. In 2007, Ricciardo was 17 years old.
The key insight: this is a completion model, not a Q&A model. It continues text like Wikipedia. So asking it questions produces garbage -- you need to prompt it like an article opening ("Lewis Hamilton is a Formula One racing driver who...") to get coherent output. We built a CLI with prompt templates to handle this.
Obviously, the next step was to try fine-tuning it into a Q&A model. How hard could it be?
I had Claude generate 423 Q&A training pairs and 49 held-out evaluation questions. We fine-tuned with a lower learning rate (1e-4 vs 3e-4) to avoid "catastrophic forgetting" -- a real term that sounds like it should describe a plot point in a Fast & Furious movie but actually means "the model forgets everything it learned during pretraining."
First attempt (155 Q&A pairs, LR=3e-5): 0 out of 49 correct. The model produced random F1 word salad. The learning rate was so low it barely learned anything before overfitting.
Second attempt (423 Q&A pairs, LR=1e-4): ~12 out of 49 decent answers. Progress! It nailed some questions:
| Question | Model's Answer | Verdict |
|---|---|---|
| "How many consecutive titles did Schumacher win with Ferrari?" | "Five consecutive World Championships from 2000 to 2004." | Correct! |
| "How many championships did Senna win?" | "Three with McLaren, in 1988, 1990, and 1991." | Correct! |
| "What was controversial about the 2021 season finale?" | Nearly perfect description of the Abu Dhabi controversy | Impressive! |
| "Who is considered the greatest driver never to win the championship?" | "Lewis Hamilton won the 2008 World Championship..." | ...that's not what I asked |
| "What was Raikkonen's racing nickname?" | "Kimi Raikkonen is Finnish. He was born in Espoo." | Technically true, but not answering the question |
The model learned the Q&A format but can't reliably associate specific questions with specific answers. Ask about Prost, get an answer about Senna. Ask about Monaco, get a description of Monza. It's like talking to someone who knows a lot about F1 but isn't really listening to your questions.
- Scale matters. 51M parameters sounds like a lot until you realize GPT-4 has ~1.8 trillion. Our model is trying to store all of F1 history in the same number of parameters as a JPEG.
- Data matters more than you think. We only had ~2.5M tokens of training data. Chinchilla scaling laws suggest you want ~20x more tokens than parameters. We had... 0.05x. Oops.
- Tokenization is huge. Switching from character-level to BPE was the single biggest quality improvement. The model stopped wasting capacity learning to spell "Verstappen."
- Overfitting is inevitable at small scale. With 51M params and 2.5M tokens, the model memorizes the data in ~6 epochs. The rest is just getting better at regurgitation.
- Q&A requires much more capacity than completion. Generating plausible-sounding F1 text is easy. Answering specific factual questions correctly requires the model to store and retrieve individual facts, which needs way more parameters.
- Claude did all the heavy lifting on code. I mostly said "yes, do that" and asked "wait, what does that mean?" when something was confusing.
- The back-and-forth was genuinely educational. Having an AI explain why it chose specific hyperparameters, rather than just reading a textbook, made the concepts stick.
- The learning came from asking follow-up questions. "Why a cosine schedule?" teaches you more than just implementing one.
slipstream/
model.py - Transformer (fused multi-head attention, FFN, SlipstreamLM)
tokenizer.py - BPE tokenizer (GPT-2 via tiktoken) + legacy CharTokenizer
train.py - Training loop (LR schedule, gradient accumulation, checkpointing)
generate.py - Interactive CLI with prompt templates and Q&A mode
finetune_qa.py - Q&A fine-tuning script
eval_qa.py - Evaluation script for held-out Q&A questions
download_data.py - F1 Wikipedia article downloader and cleaner
experiment_log.md - Detailed decision log (the unabridged version of this README)
# Install dependencies
pip install torch tiktoken requests
# Download F1 training data (~10MB from Wikipedia)
python download_data.py
# Pretrain the model (~3.5 hours on Apple Silicon)
python train.py
# Chat with it
python generate.py --checkpoint model_checkpoint_best.pt
# Fine-tune on Q&A (optional, ~45 min)
python finetune_qa.py
python generate.py --checkpoint model_checkpoint_qa_best.pt
# Run evaluation
python eval_qa.py| Metric | Value |
|---|---|
| Parameters | 51,197,440 |
| Training data | ~10MB (408 Wikipedia articles) |
| Training tokens | ~2.5M BPE tokens |
| Training time | 3.65 hours (M5 Pro MPS) |
| Throughput | ~10,000 tokens/sec |
| Best val loss | 4.12 (step 2000) |
| Q&A accuracy | ~24% (12/49) |
| ChatGPT's Q&A accuracy | ~95%+ |
| Difference in parameters | ~35,000x |
Absolutely. Here's what would actually help:
- More data. 2.5M tokens is nothing. A serious training run wants 10-100x more. Scrape F1 forums, race reports, technical articles.
- Bigger model. 51M params can learn the vibe of F1 but not the facts. Even 1B params would be transformative.
- Better hardware. Training on MPS is fine for learning, but a proper GPU cluster would enable much larger experiments.
- Early stopping. We trained for 8,000 steps when the best checkpoint was at 2,000. Watching the val loss and stopping would save time and improve results.
- Mixed training for Q&A. Fine-tuning only on Q&A pairs causes catastrophic forgetting. Mixing in corpus data (50/50) during fine-tuning would help.
But honestly? That's not the point. The point was to understand how these things work, and now I do. The model is terrible, and that's fine.
- Claude (Anthropic) -- for writing most of the code, explaining everything, and not judging me when I set the batch size so high it tried to allocate 62GB
- Wikipedia -- for the F1 data (408 articles that my model will confidently misquote)
- The F1 community -- for 75 years of drama, data, and "what was the FIA thinking?"
Built over a weekend on a MacBook Pro M5 Pro (16-core GPU, 48GB unified memory) with Claude and questionable judgment. The model thinks Lewis Hamilton won the 1950 championship. He did not.
See my other work at: https://grayforgelabs.com/