🔧 GitBook Documentation Scraper & MCP Development Tool

Professional scraping tool for GitBook sites - Handles versioned documentation, API discovery, and dynamic content. Built specifically for GitBook's unique architecture.

🌟 Why GitBook Scraper?

GitBook is Different:

✅ JSON-based navigation (manifest.json, summary.json)
✅ API-driven content loading
✅ Version management and spaces
✅ Client-side rendering
✅ Different HTML structure than Mintlify

This Tool Handles:

✅ Multiple discovery methods (GitBook Content API, API manifest, summary.json, sitemap, crawling)
✅ Deep sidebar crawling - Recursively discovers all pages from navigation
✅ GitBook-specific selectors (.markdown-section, .page-inner)
✅ Metadata extraction (space ID, page ID, last modified)
✅ Versioned documentation support
✅ MCP tool generation
✅ Code example extraction with filename support
✅ Configurable crawl depth - Control how deep to follow links
✅ JSON-LD extraction - Discovers pages from structured data
✅ Automatic zip archiving - Create compressed archives of scraped docs

Note: While GitBook offers PDF export (/~gitbook/pdf?page=...), this scraper extracts structured markdown, code examples, and metadata - far more useful for AI agents and programmatic access than PDFs.

🚀 Quick Start

Installation

npm install

Basic Scraping

# Scrape Sentry docs
npm run scrape -- https://docs.sentry.io --output ./output/sentry

# Scrape Stripe docs
npm run scrape -- https://docs.stripe.com --output ./output/stripe

# Scrape Gitcoin docs
npm run scrape -- https://docs.gitcoin.co --output ./output/gitcoin

Advanced Usage

# Conservative rate limiting for large sites
npm run scrape -- https://docs.sentry.io \
  --output ./output/sentry \
  --concurrent 2 \
  --delay 2000

# Deep crawling with recursive link following (default)
npm run scrape -- https://docs.monad.xyz \
  --output ./output/monad \
  --crawl-depth 5

# Disable recursive crawling (API discovery only)
npm run scrape -- https://docs.stripe.com \
  --output ./output/stripe \
  --no-follow-links

# Create a zip archive of the scraped documentation
npm run scrape -- https://docs.example.com \
  --output ./output/example \
  --zip

# Use headless browser for JavaScript-heavy sites
npm run scrape -- https://docs.stripe.com \
  --output ./output/stripe \
  --use-browser

# Scrape specific version
npm run scrape -- https://docs.api.com \
  --output ./output/api-v2 \
  --version v2

# Complete deep crawl with all sidebar pages and zip
npm run scrape -- https://docs.example.com \
  --output ./output/example \
  --crawl-depth 10 \
  --follow-links \
  --concurrent 3 \
  --zip

📚 Documentation

Deep Crawling Guide - Complete guide to sidebar crawling and recursive discovery
Output Formats - Different output formats for various use cases
Zip Archive Guide - How to create and use compressed archives
Examples - Real-world scraping examples (Sentry, Stripe, etc.)

📁 Output Structure

output/sentry/
├── COMPLETE.md                 # All docs in one file (AI-optimized!)
├── INDEX.md                    # Navigation index
├── metadata.json               # Structured metadata with GitBook IDs
├── mcp-tools.json             # Auto-generated MCP tools
├── examples/                   # Extracted code examples
│   ├── javascript-examples.md
│   ├── python-examples.md
│   ├── curl-examples.md
│   └── INDEX.md
└── api/                        # Individual API pages
    ├── events.md
    ├── projects.md
    └── ...

Zip Archive Creation

Use the --zip flag to automatically create a compressed archive:

npm run scrape -- https://docs.example.com --output ./docs --zip

Output:

Creates the documentation in ./docs/
Automatically generates docs.zip in the same parent directory
Displays archive size after creation
Perfect for sharing or archiving documentation

Benefits:

📦 Easy distribution and sharing
💾 Compressed storage (typically 60-80% size reduction)
📤 Quick upload/download
🔒 Single-file archiving for backup

Metadata Format

{
  "baseUrl": "https://docs.sentry.io",
  "scrapedAt": "2025-11-24T...",
  "totalPages": 234,
  "sections": ["api", "platforms", "product"],
  "pages": [
    {
      "title": "Event Ingestion",
      "url": "https://docs.sentry.io/api/events/",
      "section": "api",
      "hasApi": true,
      "metadata": {
        "spaceId": "xxx",
        "pageId": "yyy",
        "lastModified": "2025-11-20T..."
      }
    }
  ]
}

🎯 Use Cases

1. Build MCP Servers from GitBook APIs

# 1. Scrape Sentry documentation
npm run scrape -- https://docs.sentry.io --output ./output/sentry

# 2. Generate MCP tools
npm run generate-mcp -- ./output/sentry

# 3. Review generated tools
cat ./output/sentry/mcp-tools/mcp-tools.json

# 4. Implement MCP server using the definitions
# (All API signatures, parameters, and examples are ready!)

2. Extract Code Examples

# Extract all code examples
npm run extract-examples -- ./output/sentry

# Filter by language
npm run extract-examples -- ./output/sentry --language python

# Output: organized by language in examples/

3. Offline Documentation

# Scrape complete docs for offline use
npm run scrape -- https://docs.stripe.com --output ./offline/stripe

# Use COMPLETE.md for full-text search
grep -i "webhook" ./offline/stripe/COMPLETE.md

4. AI Agent Context

# Load into AI context
cat output/sentry/COMPLETE.md

# Perfect for:
# - Asking questions about the API
# - Generating integration code
# - Understanding authentication flows
# - Building MCP tools

🔄 GitBook vs Mintlify Comparison

Feature	GitBook	Mintlify
Navigation	manifest.json, summary.json	sitemap.xml
Content	API-driven, client-side	Server-rendered
Versioning	Built-in spaces/versions	URL-based
Metadata	spaceId, pageId	Basic
Rendering	Dynamic (may need browser)	Static HTML
Rate Limits	More conservative needed	Standard

Key Differences:

GitBook uses slower rate limits (2 concurrent, 1500ms default)
GitBook may require --use-browser for JavaScript-rendered content
GitBook provides richer metadata (space IDs, page IDs, versioning)
GitBook has different content selectors (.markdown-section vs main)

🌍 Community Tested GitBooks

Developer Tools

Sentry - Error tracking
Stripe - Payment APIs
Gitcoin - Web3 grants
Linear - Issue tracking
Zapier - Automation

Web3 & Blockchain

Alchemy - Web3 infrastructure
Infura - Ethereum node
Moralis - Web3 APIs

Infrastructure

Railway - Deployment
Render - Cloud platform
Fly.io - App hosting

And hundreds more GitBook sites...

⚙️ Configuration

Rate Limiting

# Light rate limiting (small sites)
npm run scrape -- <url> --concurrent 3 --delay 1000

# Medium rate limiting (default)
npm run scrape -- <url> --concurrent 2 --delay 1500

# Heavy rate limiting (large sites, respectful)
npm run scrape -- <url> --concurrent 1 --delay 3000

Browser Mode

# Use when JavaScript rendering is required
npm run scrape -- <url> --use-browser

# When to use:
# - Content doesn't load without JavaScript
# - Navigation is dynamically rendered
# - API calls are made client-side

📚 Generated MCP Tools

Example generated tool:

{
  "name": "api_create_event",
  "description": "Create Event",
  "inputSchema": {
    "type": "object",
    "properties": {
      "project_id": {
        "type": "string",
        "description": "The ID of the project"
      },
      "event_type": {
        "type": "string",
        "description": "Type of event (error, transaction)"
      }
    },
    "required": ["project_id", "event_type"]
  }
}

🔧 Development

Project Structure

tools/gitbook-scraper/
├── src/
│   ├── scraper.ts           # Main GitBook scraper
│   ├── mcp-generator.ts     # MCP tool generator
│   └── example-extractor.ts # Code example extractor
├── examples/
│   ├── sentry.md            # Sentry workflow
│   └── stripe.md            # Stripe workflow
├── templates/
│   ├── mcp-tool-rest-api.md
│   └── mcp-server-template.md
├── output/                   # Scraped docs go here
├── package.json
└── README.md

Running Tests

# Test with Sentry docs
npm test

# Custom test
npm run scrape -- https://docs.your-gitbook.com --output ./test-output

🤝 Contributing

Contributions welcome! Areas for improvement:

Puppeteer Integration - Full browser-based scraping
Version Detection - Auto-detect and scrape all versions
Authentication - Support for private GitBook spaces
Incremental Updates - Only scrape changed pages
PDF Export - Generate PDF from scraped docs

📄 License

MIT License - see LICENSE file for details

Built with ❤️ by nich 👉 nich on X

🔗 Related Tools

Mintlify Scraper - For Mintlify documentation sites

🚨 Responsible Use

Please respect the sites you scrape:

✅ Use reasonable rate limits (default: 2 concurrent, 1500ms delay)
✅ Only scrape publicly accessible documentation
✅ Respect robots.txt
✅ Cache results to avoid re-scraping
✅ Use longer delays for large scraping jobs
❌ Don't scrape private or paywalled content
❌ Don't overload servers with aggressive scraping

Example: Respectful large-scale scraping

# Scrape 500+ pages with conservative settings
npm run scrape -- https://docs.large-site.com \
  --output ./output \
  --concurrent 1 \
  --delay 3000

GitBook Scraper Output Formats

Different users, different needs. This scraper generates multiple output formats optimized for various use cases.

🎯 Use Case Matrix

User Type	Primary Need	Best Output Format
AI Agents	Full context, single file	`COMPLETE.md`
MCP Servers	Tool definitions, API schemas	`mcp-tools.json`
Developers	Searchable docs, code examples	`INDEX.md` + sections
RAG Systems	Chunked content, embeddings	`chunks/` (JSON)
API Clients	Endpoints, parameters, schemas	`api-spec.json`
Data Scientists	Structured data, analytics	`metadata.json`
Documentation Sites	Static site generation	`docusaurus/`
LLM Fine-tuning	Q&A pairs, conversations	`training-data.jsonl`

📦 Current Outputs (v1.0)

1. COMPLETE.md - AI Agent Optimized

For: Claude, GPT, AI agents needing full context

Format:

# Complete Documentation

> All pages in one file for easy context loading

## Section: Getting Started
### Page: Installation
Content here...

## Section: API Reference
### Page: Authentication
Content here...

Use Cases:

Load entire docs into AI context
Semantic search across all content
Complete offline reference

2. INDEX.md - Human Navigation

For: Developers browsing documentation

Format:

# Documentation Index

## Table of Contents

### API Reference
- [Authentication](api/authentication.md)
- [Endpoints](api/endpoints.md)

### Guides
- [Quick Start](guides/quick-start.md)

Use Cases:

Browse documentation structure
Find specific pages quickly
Link to individual files

3. metadata.json - Structured Data

For: Analytics, dashboards, data processing

Format:

{
  "baseUrl": "https://docs.example.com",
  "scrapedAt": "2025-11-24T...",
  "totalPages": 42,
  "sections": ["api", "guides"],
  "pages": [
    {
      "title": "Authentication",
      "url": "https://...",
      "section": "api",
      "hasApi": true,
      "codeExamplesCount": 5,
      "metadata": {
        "spaceId": "xxx",
        "pageId": "yyy"
      }
    }
  ]
}

Use Cases:

Generate analytics dashboards
Track documentation changes
Build documentation graphs

4. Individual Markdown Files - Modular Content

For: Static site generators, version control

Format:

output/
├── api/
│   ├── authentication.md
│   └── endpoints.md
└── guides/
    └── quick-start.md

Use Cases:

Import into Docusaurus/MkDocs
Track changes with git
Selective content loading

5. mcp-tools.json - MCP Tool Definitions

For: Building MCP servers

Format:

[
  {
    "name": "api_create_resource",
    "description": "Create a new resource",
    "inputSchema": {
      "type": "object",
      "properties": {
        "name": { "type": "string" }
      },
      "required": ["name"]
    }
  }
]

Use Cases:

Auto-generate MCP servers
API client generation
Type-safe tool definitions

🚀 Proposed New Outputs (v2.0)

6. chunks.json - RAG Optimized

For: Vector databases, semantic search, embeddings

Format:

[
  {
    "id": "auth_001",
    "section": "api",
    "title": "Authentication",
    "chunk": "Authentication uses OAuth 2.0...",
    "metadata": {
      "url": "https://...",
      "type": "concept",
      "keywords": ["auth", "oauth", "security"]
    },
    "embedding": null
  }
]

Size: 500-1000 tokens per chunk (optimal for embeddings)

Use Cases:

Load into Pinecone/Weaviate
Semantic search
RAG applications

Generation:

npm run generate-chunks -- ./output/sentry --size 800

7. api-spec.json - OpenAPI Style

For: API client generators, Postman, testing

Format:

{
  "openapi": "3.0.0",
  "info": {
    "title": "Sentry API",
    "version": "1.0.0"
  },
  "paths": {
    "/api/0/projects/{id}/": {
      "get": {
        "summary": "Retrieve Project",
        "parameters": [...],
        "responses": {...}
      }
    }
  }
}

Use Cases:

Generate SDK clients
Import to Postman
API testing tools

Generation:

npm run generate-api-spec -- ./output/stripe

8. training-data.jsonl - LLM Training

For: Fine-tuning language models

Format:

{"prompt": "How do I authenticate?", "completion": "Authentication uses OAuth 2.0. First, obtain a client ID..."}
{"prompt": "What are rate limits?", "completion": "Rate limits are 100 requests per second..."}

Use Cases:

Fine-tune GPT models
Train domain-specific chatbots
Q&A dataset creation

Generation:

npm run generate-training-data -- ./output/sentry --pairs 1000

9. docusaurus/ - Static Site Ready

For: Publishing documentation sites

Structure:

docusaurus/
├── docs/
│   ├── intro.md
│   ├── api/
│   └── guides/
├── sidebars.js
└── docusaurus.config.js

Use Cases:

Deploy to Netlify/Vercel
Custom documentation site
Offline documentation

Generation:

npm run generate-docusaurus -- ./output/stripe

10. examples/ - Code Examples Library

For: Developers learning APIs

Format:

examples/
├── javascript/
│   ├── authentication.js
│   └── create-resource.js
├── python/
│   ├── authentication.py
│   └── create_resource.py
└── INDEX.md

Use Cases:

Copy-paste examples
IDE snippets
Tutorial creation

Already implemented!

npm run extract-examples -- ./output/sentry

🎨 Custom Output Generators

Priority Order (Based on User Demand)

✅ COMPLETE.md - Implemented
✅ metadata.json - Implemented
✅ mcp-tools.json - Implemented
✅ Individual files - Implemented
✅ examples/ - Implemented
🔜 chunks.json - High priority (RAG use case)
🔜 api-spec.json - Medium priority (API clients)
🔜 training-data.jsonl - Medium priority (LLM training)
🔜 docusaurus/ - Low priority (manual setup easy)

💡 Implementation Plan

Phase 1: RAG Optimization (chunks.json)

class ChunkGenerator {
  async generateChunks(docsPath: string, chunkSize: number = 800) {
    // Split content into semantic chunks
    // Add metadata and keywords
    // Optimize for embeddings
    // Generate chunks.json
  }
}

Phase 2: API Specification (api-spec.json)

class ApiSpecGenerator {
  async generateOpenApiSpec(docsPath: string) {
    // Parse API endpoints from metadata
    // Extract parameters and responses
    // Generate OpenAPI 3.0 spec
  }
}

Phase 3: Training Data (training-data.jsonl)

class TrainingDataGenerator {
  async generateQAPairs(docsPath: string) {
    // Extract headings as questions
    // Use content as answers
    // Generate conversation pairs
  }
}

🔧 Usage Examples

For AI Agents

# Get everything in one file
cat output/sentry/COMPLETE.md

For MCP Servers

# Generate tools, then build server
npm run generate-mcp -- ./output/stripe

For RAG Systems

# Generate embeddings-ready chunks
npm run generate-chunks -- ./output/sentry --size 800

For API Clients

# Generate OpenAPI spec, then use with codegen
npm run generate-api-spec -- ./output/stripe
openapi-generator generate -i api-spec.json -g typescript-axios

For LLM Training

# Generate Q&A pairs for fine-tuning
npm run generate-training-data -- ./output/sentry --pairs 1000

📊 Output Size Comparison

Format	Size (Sentry)	Size (Stripe)	Use Case
COMPLETE.md	~2.5 MB	~4.2 MB	AI context
metadata.json	~45 KB	~78 KB	Analytics
mcp-tools.json	~120 KB	~215 KB	MCP servers
Individual files	~2.8 MB	~4.5 MB	Static sites
chunks.json	~3.2 MB*	~5.1 MB*	RAG (estimated)
api-spec.json	~85 KB*	~145 KB*	API clients (estimated)
training-data.jsonl	~1.8 MB*	~3.0 MB*	LLM training (estimated)

*Estimated based on content analysis

🎯 Recommendation by Use Case

Building an MCP Server

npm run scrape -- <url> --output ./docs
npm run generate-mcp -- ./docs
# Use mcp-tools.json + examples/

RAG/Semantic Search

npm run scrape -- <url> --output ./docs
npm run generate-chunks -- ./docs --size 800
# Use chunks.json with vector DB

API Client Library

npm run scrape -- <url> --output ./docs
npm run generate-api-spec -- ./docs
# Use api-spec.json with codegen

AI Agent Context

npm run scrape -- <url> --output ./docs
# Use COMPLETE.md directly

Documentation Website

npm run scrape -- <url> --output ./docs
npm run generate-docusaurus -- ./docs
cd docusaurus && npm run build

Which output format should we prioritize next?

This tool exists to make documentation more accessible for AI agents and developers. Use it responsibly!

Questions? Open an issue or reach out!

Built with ❤️ by nich 👉 nich on X

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github		.github
examples		examples
src		src
templates		templates
.gitignore		.gitignore
CRAWLING.md		CRAWLING.md
LICENSE		LICENSE
OUTPUT_FORMATS.md		OUTPUT_FORMATS.md
README.md		README.md
TRANSFER.sh		TRANSFER.sh
package-lock.json		package-lock.json
package.json		package.json

Uh oh!

License

nirholas/gitbook-ai-toolkit

Folders and files

Latest commit

History

Repository files navigation

🔧 GitBook Documentation Scraper & MCP Development Tool

🌟 Why GitBook Scraper?

🚀 Quick Start

Installation

Basic Scraping

Advanced Usage

📚 Documentation

📁 Output Structure

Zip Archive Creation

Metadata Format

🎯 Use Cases

1. Build MCP Servers from GitBook APIs

2. Extract Code Examples

3. Offline Documentation

4. AI Agent Context

🔄 GitBook vs Mintlify Comparison

🌍 Community Tested GitBooks

Developer Tools

Web3 & Blockchain

Infrastructure

⚙️ Configuration

Rate Limiting

Browser Mode

📚 Generated MCP Tools

🔧 Development

Project Structure

Running Tests

🤝 Contributing

📄 License

🔗 Related Tools

🚨 Responsible Use

GitBook Scraper Output Formats

🎯 Use Case Matrix

📦 Current Outputs (v1.0)

1. COMPLETE.md - AI Agent Optimized

2. INDEX.md - Human Navigation

3. metadata.json - Structured Data

4. Individual Markdown Files - Modular Content

5. mcp-tools.json - MCP Tool Definitions

🚀 Proposed New Outputs (v2.0)

6. chunks.json - RAG Optimized

7. api-spec.json - OpenAPI Style

8. training-data.jsonl - LLM Training

9. docusaurus/ - Static Site Ready

10. examples/ - Code Examples Library

🎨 Custom Output Generators

Priority Order (Based on User Demand)

💡 Implementation Plan

Phase 1: RAG Optimization (chunks.json)

Phase 2: API Specification (api-spec.json)

Phase 3: Training Data (training-data.jsonl)

🔧 Usage Examples

For AI Agents

For MCP Servers

For RAG Systems

For API Clients

For LLM Training

📊 Output Size Comparison

*Estimated based on content analysis

🎯 Recommendation by Use Case

Building an MCP Server

RAG/Semantic Search

API Client Library

AI Agent Context

Documentation Website

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages