mdscrape

A fast, concurrent CLI tool for scraping documentation websites and converting them to clean Markdown files. Perfect for building knowledge bases for AI agents.

Features

Fast concurrent scraping - configurable thread pool (default: 10 threads)
Smart content extraction - automatically finds main content, removes nav/footer/ads
Preserves formatting - code blocks with syntax hints, tables, lists, inline code, links, headings
Mirror folder structure - /docs/api/auth/ becomes docs/api/auth.md
YAML frontmatter - includes title and source URL for reference
Beautiful progress UI - real-time stats, per-thread activity display (auto-detects TTY)
Flexible filtering - limit by URL prefix, exclude patterns, set max depth

Installation

Homebrew (macOS/Linux)

brew tap jotka/tap
brew install mdscrape

From source

git clone https://github.com/jotka/mdscrape.git
cd mdscrape
go build -o mdscrape .

Quick start

# Scrape Next.js documentation
mdscrape https://nextjs.org/docs/

# Scrape Docker reference with custom output folder
mdscrape https://docs.docker.com/reference/ -o ./docker-docs

# Limit depth and threads for gentler scraping
mdscrape https://react.dev/reference/ -d 5 -t 3

# Preview what would be scraped (dry run)
mdscrape https://docs.python.org/3/library/ --dry-run

Usage

mdscrape <url> [options]

Options

Option	Short	Default	Description
`<url>`		required	Starting URL to scrape (positional argument)
`--limit`	`-l`	start URL	Only scrape URLs with this prefix
`--output`	`-o`	auto	Output directory
`--threads`	`-t`	10	Concurrent download threads
`--depth`	`-d`	50	Maximum link depth to follow
`--selector`	`-s`	auto	CSS selector for content (e.g., `article`)
`--exclude`	`-e`	none	URL patterns to skip (repeatable)
`--delay`		100	Milliseconds between requests
`--dry-run`		false	Show plan without downloading
`--quiet`	`-q`	false	Minimal output, no progress UI
`--verbose`	`-v`	false	Show every file downloaded

Examples

# Scrape only the API section
mdscrape https://docs.example.com/api/

# Limit to specific subsection
mdscrape https://docs.example.com/ -l https://docs.example.com/api/

# Use specific content selector
mdscrape https://docs.example.com/ -s "main.docs-content"

# Exclude changelog and blog
mdscrape https://docs.example.com/ -e "/changelog" -e "/blog"

# Gentle scraping with delays
mdscrape https://docs.example.com/ -t 2 --delay 500

Output

File structure

URLs are converted to a matching folder hierarchy:

https://docs.docker.com/reference/cli/docker/run/
                                ↓
output/cli/docker/run.md

output/
├── index.md
├── getting-started.md
├── cli/
│   └── docker/
│       ├── build.md
│       ├── run.md
│       └── compose/
│           ├── up.md
│           └── down.md
├── api/
│   └── engine/
│       └── index.md
└── reference/
    └── dockerfile.md

Markdown format

Each page includes YAML frontmatter:

---
title: "docker run"
source: "https://docs.docker.com/reference/cli/docker/run/"
---

# docker run

Run a command in a new container.

## Usage

\`\`\`bash
docker run [OPTIONS] IMAGE [COMMAND] [ARG...]
\`\`\`

## Options

| Option | Description |
|--------|-------------|
| `-d` | Run in detached mode |
| `-p` | Publish port |
...

How it works

Crawl - Discovers pages using Colly, respecting depth limits and URL filters
Extract - Parses HTML with GoQuery, removes boilerplate (nav, footer, scripts)
Convert - Transforms to Markdown using html-to-markdown
Save - Writes files with folder structure matching the URL path

Smart content detection

The scraper automatically tries these selectors to find main content:

main, article, [role="main"], .content, .main-content,
.post-content, .article-content, .markdown-body, .docs-content

Override with --selector if needed.

Elements removed

Navigation, headers, footers, sidebars, ads, scripts, and other non-content elements are stripped:

nav, header, footer, .sidebar, .toc, .breadcrumb,
.pagination, .comments, .advertisement, script, style

Use cases

AI context - Build knowledge bases for Claude, GPT, or other LLMs
Offline docs - Read documentation without internet
Documentation migration - Convert sites to Markdown for static site generators
Archiving - Preserve documentation snapshots

Building AI agents with scraped docs

The scraped markdown files are ideal for creating specialized AI coding agents. This approach is often more effective than using MCP (Model Context Protocol) servers or real-time web fetching, which can waste context window space on navigation elements, ads, and irrelevant content with each request. Pre-scraped markdown files are clean, deduplicated, and always available - the AI can reference them instantly without network latency or token overhead from fetching raw HTML.

For example, to build a "Next.js expert" agent:

# Scrape Next.js documentation
mdscrape https://nextjs.org/docs/ -o nextjs-docs

# Create a specialized agent in .claude/agents/
mkdir -p .claude/agents
cat > .claude/agents/nextjs-expert.md << 'EOF'
You are an expert Next.js developer. When answering questions about Next.js,
refer to the documentation in the `nextjs-docs/` folder for accurate,
up-to-date information about APIs, patterns, and best practices.
EOF

Now you can invoke this agent in Claude Code with /nextjs-expert. The AI will have access to the complete, current documentation as context, enabling more accurate and framework-specific responses. The markdown format preserves code examples, API references, and structural information that helps the AI understand and apply the documentation correctly.

You can combine multiple documentation sources to create specialized agents:

mdscrape https://nextjs.org/docs/ -o docs/nextjs
mdscrape https://tailwindcss.com/docs/ -o docs/tailwind
mdscrape https://www.prisma.io/docs/ -o docs/prisma

Then reference all three in your project instructions to create a full-stack expert agent.

Limitations

Static HTML only - Cannot scrape JavaScript-rendered content (SPAs, React apps). Use browser-based tools for sites that require JS execution.
No authentication - Cannot access pages behind login. For private documentation, consider exporting directly from the source.
No robots.txt support - Does not check robots.txt. Be respectful and use appropriate delays when scraping.
No image downloads - Images remain as remote URLs. Markdown files reference original image locations.
No resume capability - Interrupted scrapes must restart from the beginning. For large sites, consider scraping in sections.
Rate limiting - Some sites may block or throttle requests. Use --delay and reduce --threads if you encounter 429 errors.

License

MIT

Examples

Folder structure mirroring the documentation site:

Markdown output with preserved code blocks and formatting:

Using scraped docs with Claude Code agents:

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github		.github
cmd		cmd
internal		internal
screenshots		screenshots
.gitignore		.gitignore
.goreleaser.yaml		.goreleaser.yaml
CLAUDE.md		CLAUDE.md
README.md		README.md
build.sh		build.sh
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

mdscrape

Features

Installation

Homebrew (macOS/Linux)

From source

Quick start

Usage

Options

Examples

Output

File structure

Markdown format

How it works

Smart content detection

Elements removed

Use cases

Building AI agents with scraped docs

Limitations

License

Examples

About

Uh oh!

Releases 7

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

jotka/mdscrape

Folders and files

Latest commit

History

Repository files navigation

mdscrape

Features

Installation

Homebrew (macOS/Linux)

From source

Quick start

Usage

Options

Examples

Output

File structure

Markdown format

How it works

Smart content detection

Elements removed

Use cases

Building AI agents with scraped docs

Limitations

License

Examples

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages