Professional scraping tool for GitBook sites - Handles versioned documentation, API discovery, and dynamic content. Built specifically for GitBook's unique architecture.
GitBook is Different:
- ✅ JSON-based navigation (manifest.json, summary.json)
- ✅ API-driven content loading
- ✅ Version management and spaces
- ✅ Client-side rendering
- ✅ Different HTML structure than Mintlify
This Tool Handles:
- ✅ Multiple discovery methods (GitBook Content API, API manifest, summary.json, sitemap, crawling)
- ✅ Deep sidebar crawling - Recursively discovers all pages from navigation
- ✅ GitBook-specific selectors (.markdown-section, .page-inner)
- ✅ Metadata extraction (space ID, page ID, last modified)
- ✅ Versioned documentation support
- ✅ MCP tool generation
- ✅ Code example extraction with filename support
- ✅ Configurable crawl depth - Control how deep to follow links
- ✅ JSON-LD extraction - Discovers pages from structured data
- ✅ Automatic zip archiving - Create compressed archives of scraped docs
Note: While GitBook offers PDF export (
/~gitbook/pdf?page=...), this scraper extracts structured markdown, code examples, and metadata - far more useful for AI agents and programmatic access than PDFs.
npm install# Scrape Sentry docs
npm run scrape -- https://docs.sentry.io --output ./output/sentry
# Scrape Stripe docs
npm run scrape -- https://docs.stripe.com --output ./output/stripe
# Scrape Gitcoin docs
npm run scrape -- https://docs.gitcoin.co --output ./output/gitcoin# Conservative rate limiting for large sites
npm run scrape -- https://docs.sentry.io \
--output ./output/sentry \
--concurrent 2 \
--delay 2000
# Deep crawling with recursive link following (default)
npm run scrape -- https://docs.monad.xyz \
--output ./output/monad \
--crawl-depth 5
# Disable recursive crawling (API discovery only)
npm run scrape -- https://docs.stripe.com \
--output ./output/stripe \
--no-follow-links
# Create a zip archive of the scraped documentation
npm run scrape -- https://docs.example.com \
--output ./output/example \
--zip
# Use headless browser for JavaScript-heavy sites
npm run scrape -- https://docs.stripe.com \
--output ./output/stripe \
--use-browser
# Scrape specific version
npm run scrape -- https://docs.api.com \
--output ./output/api-v2 \
--version v2
# Complete deep crawl with all sidebar pages and zip
npm run scrape -- https://docs.example.com \
--output ./output/example \
--crawl-depth 10 \
--follow-links \
--concurrent 3 \
--zip- Deep Crawling Guide - Complete guide to sidebar crawling and recursive discovery
- Output Formats - Different output formats for various use cases
- Zip Archive Guide - How to create and use compressed archives
- Examples - Real-world scraping examples (Sentry, Stripe, etc.)
output/sentry/
├── COMPLETE.md # All docs in one file (AI-optimized!)
├── INDEX.md # Navigation index
├── metadata.json # Structured metadata with GitBook IDs
├── mcp-tools.json # Auto-generated MCP tools
├── examples/ # Extracted code examples
│ ├── javascript-examples.md
│ ├── python-examples.md
│ ├── curl-examples.md
│ └── INDEX.md
└── api/ # Individual API pages
├── events.md
├── projects.md
└── ...
Use the --zip flag to automatically create a compressed archive:
npm run scrape -- https://docs.example.com --output ./docs --zipOutput:
- Creates the documentation in
./docs/ - Automatically generates
docs.zipin the same parent directory - Displays archive size after creation
- Perfect for sharing or archiving documentation
Benefits:
- 📦 Easy distribution and sharing
- 💾 Compressed storage (typically 60-80% size reduction)
- 📤 Quick upload/download
- 🔒 Single-file archiving for backup
{
"baseUrl": "https://docs.sentry.io",
"scrapedAt": "2025-11-24T...",
"totalPages": 234,
"sections": ["api", "platforms", "product"],
"pages": [
{
"title": "Event Ingestion",
"url": "https://docs.sentry.io/api/events/",
"section": "api",
"hasApi": true,
"metadata": {
"spaceId": "xxx",
"pageId": "yyy",
"lastModified": "2025-11-20T..."
}
}
]
}# 1. Scrape Sentry documentation
npm run scrape -- https://docs.sentry.io --output ./output/sentry
# 2. Generate MCP tools
npm run generate-mcp -- ./output/sentry
# 3. Review generated tools
cat ./output/sentry/mcp-tools/mcp-tools.json
# 4. Implement MCP server using the definitions
# (All API signatures, parameters, and examples are ready!)# Extract all code examples
npm run extract-examples -- ./output/sentry
# Filter by language
npm run extract-examples -- ./output/sentry --language python
# Output: organized by language in examples/# Scrape complete docs for offline use
npm run scrape -- https://docs.stripe.com --output ./offline/stripe
# Use COMPLETE.md for full-text search
grep -i "webhook" ./offline/stripe/COMPLETE.md# Load into AI context
cat output/sentry/COMPLETE.md
# Perfect for:
# - Asking questions about the API
# - Generating integration code
# - Understanding authentication flows
# - Building MCP tools| Feature | GitBook | Mintlify |
|---|---|---|
| Navigation | manifest.json, summary.json | sitemap.xml |
| Content | API-driven, client-side | Server-rendered |
| Versioning | Built-in spaces/versions | URL-based |
| Metadata | spaceId, pageId | Basic |
| Rendering | Dynamic (may need browser) | Static HTML |
| Rate Limits | More conservative needed | Standard |
Key Differences:
- GitBook uses slower rate limits (2 concurrent, 1500ms default)
- GitBook may require
--use-browserfor JavaScript-rendered content - GitBook provides richer metadata (space IDs, page IDs, versioning)
- GitBook has different content selectors (
.markdown-sectionvsmain)
- Sentry - Error tracking
- Stripe - Payment APIs
- Gitcoin - Web3 grants
- Linear - Issue tracking
- Zapier - Automation
And hundreds more GitBook sites...
# Light rate limiting (small sites)
npm run scrape -- <url> --concurrent 3 --delay 1000
# Medium rate limiting (default)
npm run scrape -- <url> --concurrent 2 --delay 1500
# Heavy rate limiting (large sites, respectful)
npm run scrape -- <url> --concurrent 1 --delay 3000# Use when JavaScript rendering is required
npm run scrape -- <url> --use-browser
# When to use:
# - Content doesn't load without JavaScript
# - Navigation is dynamically rendered
# - API calls are made client-sideExample generated tool:
{
"name": "api_create_event",
"description": "Create Event",
"inputSchema": {
"type": "object",
"properties": {
"project_id": {
"type": "string",
"description": "The ID of the project"
},
"event_type": {
"type": "string",
"description": "Type of event (error, transaction)"
}
},
"required": ["project_id", "event_type"]
}
}tools/gitbook-scraper/
├── src/
│ ├── scraper.ts # Main GitBook scraper
│ ├── mcp-generator.ts # MCP tool generator
│ └── example-extractor.ts # Code example extractor
├── examples/
│ ├── sentry.md # Sentry workflow
│ └── stripe.md # Stripe workflow
├── templates/
│ ├── mcp-tool-rest-api.md
│ └── mcp-server-template.md
├── output/ # Scraped docs go here
├── package.json
└── README.md
# Test with Sentry docs
npm test
# Custom test
npm run scrape -- https://docs.your-gitbook.com --output ./test-outputContributions welcome! Areas for improvement:
- Puppeteer Integration - Full browser-based scraping
- Version Detection - Auto-detect and scrape all versions
- Authentication - Support for private GitBook spaces
- Incremental Updates - Only scrape changed pages
- PDF Export - Generate PDF from scraped docs
MIT License - see LICENSE file for details
Built with ❤️ by nich 👉 nich on X
- Mintlify Scraper - For Mintlify documentation sites
Please respect the sites you scrape:
- ✅ Use reasonable rate limits (default: 2 concurrent, 1500ms delay)
- ✅ Only scrape publicly accessible documentation
- ✅ Respect robots.txt
- ✅ Cache results to avoid re-scraping
- ✅ Use longer delays for large scraping jobs
- ❌ Don't scrape private or paywalled content
- ❌ Don't overload servers with aggressive scraping
Example: Respectful large-scale scraping
# Scrape 500+ pages with conservative settings
npm run scrape -- https://docs.large-site.com \
--output ./output \
--concurrent 1 \
--delay 3000Different users, different needs. This scraper generates multiple output formats optimized for various use cases.
| User Type | Primary Need | Best Output Format |
|---|---|---|
| AI Agents | Full context, single file | COMPLETE.md |
| MCP Servers | Tool definitions, API schemas | mcp-tools.json |
| Developers | Searchable docs, code examples | INDEX.md + sections |
| RAG Systems | Chunked content, embeddings | chunks/ (JSON) |
| API Clients | Endpoints, parameters, schemas | api-spec.json |
| Data Scientists | Structured data, analytics | metadata.json |
| Documentation Sites | Static site generation | docusaurus/ |
| LLM Fine-tuning | Q&A pairs, conversations | training-data.jsonl |
For: Claude, GPT, AI agents needing full context
Format:
# Complete Documentation
> All pages in one file for easy context loading
## Section: Getting Started
### Page: Installation
Content here...
## Section: API Reference
### Page: Authentication
Content here...Use Cases:
- Load entire docs into AI context
- Semantic search across all content
- Complete offline reference
For: Developers browsing documentation
Format:
# Documentation Index
## Table of Contents
### API Reference
- [Authentication](api/authentication.md)
- [Endpoints](api/endpoints.md)
### Guides
- [Quick Start](guides/quick-start.md)Use Cases:
- Browse documentation structure
- Find specific pages quickly
- Link to individual files
For: Analytics, dashboards, data processing
Format:
{
"baseUrl": "https://docs.example.com",
"scrapedAt": "2025-11-24T...",
"totalPages": 42,
"sections": ["api", "guides"],
"pages": [
{
"title": "Authentication",
"url": "https://...",
"section": "api",
"hasApi": true,
"codeExamplesCount": 5,
"metadata": {
"spaceId": "xxx",
"pageId": "yyy"
}
}
]
}Use Cases:
- Generate analytics dashboards
- Track documentation changes
- Build documentation graphs
For: Static site generators, version control
Format:
output/
├── api/
│ ├── authentication.md
│ └── endpoints.md
└── guides/
└── quick-start.md
Use Cases:
- Import into Docusaurus/MkDocs
- Track changes with git
- Selective content loading
For: Building MCP servers
Format:
[
{
"name": "api_create_resource",
"description": "Create a new resource",
"inputSchema": {
"type": "object",
"properties": {
"name": { "type": "string" }
},
"required": ["name"]
}
}
]Use Cases:
- Auto-generate MCP servers
- API client generation
- Type-safe tool definitions
For: Vector databases, semantic search, embeddings
Format:
[
{
"id": "auth_001",
"section": "api",
"title": "Authentication",
"chunk": "Authentication uses OAuth 2.0...",
"metadata": {
"url": "https://...",
"type": "concept",
"keywords": ["auth", "oauth", "security"]
},
"embedding": null
}
]Size: 500-1000 tokens per chunk (optimal for embeddings)
Use Cases:
- Load into Pinecone/Weaviate
- Semantic search
- RAG applications
Generation:
npm run generate-chunks -- ./output/sentry --size 800For: API client generators, Postman, testing
Format:
{
"openapi": "3.0.0",
"info": {
"title": "Sentry API",
"version": "1.0.0"
},
"paths": {
"/api/0/projects/{id}/": {
"get": {
"summary": "Retrieve Project",
"parameters": [...],
"responses": {...}
}
}
}
}Use Cases:
- Generate SDK clients
- Import to Postman
- API testing tools
Generation:
npm run generate-api-spec -- ./output/stripeFor: Fine-tuning language models
Format:
{"prompt": "How do I authenticate?", "completion": "Authentication uses OAuth 2.0. First, obtain a client ID..."}
{"prompt": "What are rate limits?", "completion": "Rate limits are 100 requests per second..."}Use Cases:
- Fine-tune GPT models
- Train domain-specific chatbots
- Q&A dataset creation
Generation:
npm run generate-training-data -- ./output/sentry --pairs 1000For: Publishing documentation sites
Structure:
docusaurus/
├── docs/
│ ├── intro.md
│ ├── api/
│ └── guides/
├── sidebars.js
└── docusaurus.config.js
Use Cases:
- Deploy to Netlify/Vercel
- Custom documentation site
- Offline documentation
Generation:
npm run generate-docusaurus -- ./output/stripeFor: Developers learning APIs
Format:
examples/
├── javascript/
│ ├── authentication.js
│ └── create-resource.js
├── python/
│ ├── authentication.py
│ └── create_resource.py
└── INDEX.md
Use Cases:
- Copy-paste examples
- IDE snippets
- Tutorial creation
Already implemented!
npm run extract-examples -- ./output/sentry- ✅ COMPLETE.md - Implemented
- ✅ metadata.json - Implemented
- ✅ mcp-tools.json - Implemented
- ✅ Individual files - Implemented
- ✅ examples/ - Implemented
- 🔜 chunks.json - High priority (RAG use case)
- 🔜 api-spec.json - Medium priority (API clients)
- 🔜 training-data.jsonl - Medium priority (LLM training)
- 🔜 docusaurus/ - Low priority (manual setup easy)
class ChunkGenerator {
async generateChunks(docsPath: string, chunkSize: number = 800) {
// Split content into semantic chunks
// Add metadata and keywords
// Optimize for embeddings
// Generate chunks.json
}
}class ApiSpecGenerator {
async generateOpenApiSpec(docsPath: string) {
// Parse API endpoints from metadata
// Extract parameters and responses
// Generate OpenAPI 3.0 spec
}
}class TrainingDataGenerator {
async generateQAPairs(docsPath: string) {
// Extract headings as questions
// Use content as answers
// Generate conversation pairs
}
}# Get everything in one file
cat output/sentry/COMPLETE.md# Generate tools, then build server
npm run generate-mcp -- ./output/stripe# Generate embeddings-ready chunks
npm run generate-chunks -- ./output/sentry --size 800# Generate OpenAPI spec, then use with codegen
npm run generate-api-spec -- ./output/stripe
openapi-generator generate -i api-spec.json -g typescript-axios# Generate Q&A pairs for fine-tuning
npm run generate-training-data -- ./output/sentry --pairs 1000| Format | Size (Sentry) | Size (Stripe) | Use Case |
|---|---|---|---|
| COMPLETE.md | ~2.5 MB | ~4.2 MB | AI context |
| metadata.json | ~45 KB | ~78 KB | Analytics |
| mcp-tools.json | ~120 KB | ~215 KB | MCP servers |
| Individual files | ~2.8 MB | ~4.5 MB | Static sites |
| chunks.json | ~3.2 MB* | ~5.1 MB* | RAG (estimated) |
| api-spec.json | ~85 KB* | ~145 KB* | API clients (estimated) |
| training-data.jsonl | ~1.8 MB* | ~3.0 MB* | LLM training (estimated) |
npm run scrape -- <url> --output ./docs
npm run generate-mcp -- ./docs
# Use mcp-tools.json + examples/npm run scrape -- <url> --output ./docs
npm run generate-chunks -- ./docs --size 800
# Use chunks.json with vector DBnpm run scrape -- <url> --output ./docs
npm run generate-api-spec -- ./docs
# Use api-spec.json with codegennpm run scrape -- <url> --output ./docs
# Use COMPLETE.md directlynpm run scrape -- <url> --output ./docs
npm run generate-docusaurus -- ./docs
cd docusaurus && npm run buildWhich output format should we prioritize next?
This tool exists to make documentation more accessible for AI agents and developers. Use it responsibly!
Questions? Open an issue or reach out!