Single script to download and manage all 22 WHATWG specifications, optimized for LLM token consumption.
# Install dependencies
npm install
# Set your Anthropic API key
export ANTHROPIC_API_KEY="your-api-key-here"
# Download and optimize all 22 specs
node specs.js download
# Or use npm scripts
npm run download
# Check what's downloaded
npm run status
# Remove all specs
npm run clean| Command | Description |
|---|---|
node specs.js download |
Download and optimize all 22 WHATWG specs |
node specs.js status |
Show which specs are downloaded with sizes and token counts |
node specs.js list |
List all 22 available specs with URLs |
node specs.js clean |
Remove all downloaded specs from working directory |
node specs.js help |
Show help message |
Or use npm scripts: npm run download, npm run status, npm run list, npm run clean
When you run node specs.js download:
- ✅ Creates temporary working directory in
/tmp - ✅ Downloads each spec HTML to
/tmp(not your working directory) - ✅ Converts HTML to markdown in
/tmpusing pandoc - ✅ Uses opencode AI to intelligently optimize while preserving ALL technical content:
- Removes table of contents
- Removes references sections
- Removes acknowledgments / acknowledgements
- Removes intellectual property sections
- Removes licensing / copyright sections
- Removes index sections
- Removes all
{#anchor-id}and{.class}metadata - Removes section numbering
[1.2.3] - Removes base64 images and SVG diagrams
- Removes decorative elements
- Preserves 100% of technical specifications, algorithms, examples, and definitions
- ✅ Saves optimized
<spec>.mdto working directory - ✅ Automatically cleans up entire
/tmpdirectory
Result: Only clean, optimized .md files with complete technical content in your working directory!
| Spec | Original HTML | Optimized MD | Reduction | Tokens (est) |
|---|---|---|---|---|
| html | ~14.0 MB | ~5.4 MB | ~61% | ~1,800,000 |
| webidl | ~2.5 MB | ~400 KB | ~84% | ~133,000 |
| streams | ~1.8 MB | ~375 KB | ~79% | ~125,000 |
| dom | ~2.9 MB | ~340 KB | ~88% | ~113,000 |
| fetch | ~1.9 MB | ~250 KB | ~87% | ~83,000 |
| encoding | ~450 KB | ~105 KB | ~77% | ~35,000 |
| url | ~710 KB | ~105 KB | ~85% | ~35,000 |
| urlpattern | ~350 KB | ~80 KB | ~77% | ~27,000 |
| infra | ~280 KB | ~72 KB | ~74% | ~24,000 |
| xhr | ~340 KB | ~70 KB | ~79% | ~23,000 |
| mimesniff | ~260 KB | ~53 KB | ~80% | ~18,000 |
| cookiestore | ~240 KB | ~50 KB | ~79% | ~17,000 |
| websockets | ~180 KB | ~33 KB | ~82% | ~11,000 |
| storage | ~110 KB | ~27 KB | ~75% | ~9,000 |
| quirks | ~90 KB | ~22 KB | ~76% | ~7,000 |
| fullscreen | ~95 KB | ~22 KB | ~77% | ~7,000 |
| notifications | ~90 KB | ~21 KB | ~77% | ~7,000 |
| console | ~90 KB | ~12 KB | ~87% | ~4,000 |
| compat | ~50 KB | ~10 KB | ~80% | ~3,000 |
| compression | ~40 KB | ~8 KB | ~80% | ~3,000 |
| fs | ~150 KB | ~5 KB | ~97% | ~2,000 |
| testutils | ~10 KB | ~1 KB | ~90% | ~300 |
| Metric | Before | After | Saved |
|---|---|---|---|
| Total Size | ~27.4 MB | ~8.5 MB | ~18.9 MB (69%) |
| Total Tokens | ~9.1M | ~2.8M | ~6.3M tokens (69%) |
Average reduction: ~70% across all specifications!
After running ./specs.sh download, you'll have these 22 files:
compat.md ~10 KB ~3K tokens
compression.md ~8 KB ~3K tokens
console.md ~12 KB ~4K tokens
cookiestore.md ~50 KB ~17K tokens
dom.md ~340 KB ~113K tokens
encoding.md ~105 KB ~35K tokens
fetch.md ~250 KB ~83K tokens
fs.md ~5 KB ~2K tokens
fullscreen.md ~22 KB ~7K tokens
html.md ~5.4 MB ~1.8M tokens
infra.md ~72 KB ~24K tokens
mimesniff.md ~53 KB ~18K tokens
notifications.md ~21 KB ~7K tokens
quirks.md ~22 KB ~7K tokens
storage.md ~27 KB ~9K tokens
streams.md ~375 KB ~125K tokens
testutils.md ~1 KB ~300 tokens
url.md ~105 KB ~35K tokens
urlpattern.md ~80 KB ~27K tokens
webidl.md ~400 KB ~133K tokens
websockets.md ~33 KB ~11K tokens
xhr.md ~70 KB ~23K tokens
Total: ~8.5 MB, ~2.8M tokens (down from ~27.4 MB, ~9.1M tokens)
✅ All specification content:
- All technical prose and definitions
- All algorithms and processing models
- All normative requirements
- All code examples and IDL interfaces
- Complete section hierarchy
- External reference links
✅ 100% specification quality - zero loss of technical content
❌ Non-specification sections:
- Table of contents
- References sections
- Acknowledgments / Acknowledgements
- Intellectual property sections
- Licensing / Copyright sections
- Index sections
❌ Metadata and formatting:
- All
{#anchor-id}patterns - All
{.css-class}attributes - All
{x-internal="..."}metadata - Section numbering
[1.2.3] - Base64-encoded images
- SVG diagrams
- Decorative separator lines
- Excessive whitespace
- 70% token reduction - fit more specs in context
- Lower API costs - pay for fewer tokens
- Faster processing - less data to parse
- Better context utilization - pure technical content
- Clean references - no metadata clutter
- Easy searching - pure specification prose
- Version control - efficient diffs
- Fast loading - smaller file sizes
- Readable markdown - clean formatting
- Complete content - all technical details
- Portable - standard markdown format
- Focused - specification content only
- curl - for downloading specs
- pandoc - for HTML to markdown conversion
- node - for running the script
- Anthropic API key - for Claude Sonnet 4.5 (1M context window)
macOS:
brew install pandoc node
npm installUbuntu/Debian:
sudo apt install curl pandoc nodejs npm
npm install# List all 22 available specs
node specs.js list
# Download and optimize all specs (takes 5-15 minutes)
node specs.js download
# Your working directory stays clean during the process!
# All temporary work happens in /tmp
# Check what was downloaded with sizes and token counts
node specs.js status
# Use the optimized specs
cat dom.md | head -n 50
# Clean up when done
node specs.js clean- Creates unique temp directory:
/tmp/tmp.XXXXXX - All HTML downloads go to
/tmp(never your working directory) - All intermediate markdown files stay in
/tmp - Automatic cleanup via
trapon script exit - Your working directory only receives final optimized
.mdfiles
- Download HTML to
/tmp - Convert with pandoc in
/tmp - Optimize with opencode SDK:
- Uses Claude AI to intelligently analyze the specification
- Removes boilerplate (TOC, references, acknowledgments, metadata)
- Preserves ALL technical content (specs, algorithms, examples, definitions)
- Smart detection of what's essential vs. removable
- Context-aware optimization (not just pattern matching)
- Save optimized
.mdto working directory - Cleanup entire
/tmpdirectory automatically
With 200K token context:
- HTML spec (~1.8M tokens) alone
- OR: DOM + Fetch + Streams + WebIDL + URL + 10 more smaller specs
With 1M token context:
- HTML + all 21 other specs (~2.8M total)
With 2M+ token context:
- All 22 specs multiple times!
- Processing all 22 specs takes 5-15 minutes (depending on connection)
- Requires active internet connection
- All temporary files automatically deleted from
/tmp - Safe to re-run
downloadanytime to update specs cleancommand removes specs from working directory only- No temporary files ever appear in your working directory
- Version: 2.0.0 (pure Node.js using opencode SDK for intelligent optimization)
The script downloads all 22 specs, but you can modify the SPECS array in specs.js to download only specific ones:
// Edit specs.js and modify the SPECS array
const SPECS = ["dom", "fetch", "url"]; // Only these 3Use status command to see exactly what you have:
node specs.js statusShows each spec with file size and estimated token count.
"pandoc: command not found"
brew install pandoc # macOS
sudo apt install pandoc # Linux"curl: command not found"
# Curl is pre-installed on most systems
# If missing, install via package manager
sudo apt install curl # LinuxFailed download
- Check internet connection
- Verify
spec.whatwg.orgis accessible - Check firewall settings
- Try again (network issues are transient)
Not enough disk space in /tmp
- The script needs ~300MB free in
/tmp - Clean up
/tmpmanually if needed - Downloads happen one at a time to minimize space usage
This tool processes publicly available WHATWG specifications.
Processed specifications retain their original WHATWG licenses:
- Most specs: Creative Commons Attribution 4.0 International License
- Code portions: BSD 3-Clause License
This tool itself:
- Use freely for any purpose
- No warranty provided
- Provided as-is
Ready to optimize? Run npm run download to get started!
Download all 22 WHATWG specifications, optimized and ready for LLM processing with 70% fewer tokens.
Complete list from https://spec.whatwg.org/ in alphabetical order.