Second-semester core course in the Columbia Journalism School Data Journalism MS program. This course teaches students to work with datasets that require more than pandas and a laptop - government databases with millions of records, document dumps from investigations, and long-term data projects that multiple journalists need to access.
Under active development and so, so, so much of the writing is lazily done via Claude Code. Have I verified that any of these investigations it's talking about exist? Absolutely not!
- 📚 Curriculum: Week-by-week topics, concepts, and skills
- 📝 Assignments: Progressive exercises with Foundation/Extension/Innovation tiers
- 🔧 Tech Stack: All tools and technologies with documentation links
- 📖 Readings: Investigations and methodologies for each week
- 🏃 Speed Run: Self-study guide for learning the tech stack
Weeks 1-2: Large, Large Databases
- "Look, you just queried 10 million rows on your laptop"
- DuckDB, PostgreSQL
- Assignment • Readings • Tech
Weeks 3-4: Cloud Infrastructure
- "You just shared a giant database with zero configuration"
- Datasette Cloud, Backblaze B2, DigitalOcean
- Assignment • Readings • Tech
Weeks 5-6: Long-term data projects
- "Your scraper ran automatically while you slept"
- GitHub Actions scrapers, versioning, documenting
- Assignment • Readings • Tech
Weeks 7-8: Collaborative Investigation
- "Cross-newroom, cross-border, cross-language, cross-everything investigations"
- OpenAleph, DocumentCloud, Datasette
- Assignment • Readings • Tech
Week 9: Graph Databases
- "Connecting the dots of people and companies"
- Neo4j, Cypher
- Assignment • Readings • Tech
Weeks 10-11: Public-Facing Tools
- "Your investigation tool is live on the internet"
- Flask, Jinja2, render.com
- Assignment • Readings • Tech
Weeks 12-13: AI Document Processing
- "AI just read 100 documents in 30 seconds"
- LLMs via API, NotebookLM, LM Studio
- Assignment • Readings • Tech
Week 14: Sustainability & Handoffs
- "You're free!"
- GitHub Actions, Documentation
- Assignment • Readings