Wiki Builder is a specialized toolkit designed to create raw material for the Karathy LLM Wiki. Its primary goal is to provide high-quality, structured, and clean learning materials for AI agents by crawling and converting online documentation into LLM-friendly Markdown format.
- Specialized Crawlers: Support for 15+ documentation sites and blogs (e.g.,
linux.die.net,nextjs.org,ruanyifeng.com) using Crawlee. - Intelligent Cleaning: Automated framework detection (Hexo, Docusaurus, Next.js, etc.) and content cleaning for high-quality Markdown.
- Zero-Dependency Converter:
html2textutility that transforms complex HTML into clean Markdown. - Anti-Crawling Strategies: Integrated rate limiting, concurrency control, and browser-like headers.
- Structured Data: Automatically organizes crawled content into a hierarchical directory structure.
- Modern Tech Stack: Built with Next.js, TypeScript, and Tailwind CSS.
wiki-builder/
├── crawler/ # Crawler scripts for various sites
├── utils/
│ ├── html2text/ # HTML-to-Markdown converter
│ └── clean/ # Post-processing and cleaning utilities
├── data/ # Generated Markdown files (ignored)
├── plans/ # Implementation plans and design docs
├── storage/ # Crawlee local storage (ignored)
├── app/ # Next.js frontend for browsing the wiki
└── public/ # Static assets
- Node.js (v20 or later)
- npm, pnpm, or yarn
npm install
# or
pnpm installTo run a specific crawler (e.g., ruanyifeng.com):
npm run crawl:ruanyifengOr run any crawler script directly:
npm run crawl crawler/some-site.tsNote: Available crawler scripts are located in the crawler/ directory.
Note: You can configure maxRequestsPerCrawl in the crawler scripts for testing.
npm run devOpen http://localhost:3000 to browse your local wiki.
Located in utils/html2text, this is a standalone TypeScript library that converts HTML AST to Markdown AST and then serializes it. It handles:
- Tables
- Nested lists
- Code blocks (with syntax highlighting support)
- Definition lists
- Images and Links
MIT