Turn your Confluence space into Markdown files on your hard drive and use that local copy to:
- build a second brain
- feed AI/RAG pipelines
- version docs in Git
- browse offline in your preferred Markdown editor
- power local full-text search across docs
- support onboarding with a portable docs snapshot
- preserve runbooks for incident response and disaster recovery
- export knowledge for compliance and audit evidence trails
- reduce vendor lock-in by keeping docs in an open format.
What you get:
- One local
.mdfile per crawled page, with stable filenames that include page IDs. - Links between crawled pages rewritten to relative local links.
- External or out-of-scope links preserved as original URLs.
- Page comments appended under a
## Commentssection. - A
metadata.jsonindex with page metadata plus incoming/outgoing link graph data.
- Starts from configured seed pages and traverses linked Confluence pages up to a configurable depth.
- Converts each page from Confluence storage format to Markdown and writes it to the local filesystem.
- Runs a second pass to rewrite links between crawled pages as relative local links.
- Leaves all other URLs unchanged — including links to uncrawled Confluence pages and external sites.
- Downloads page attachments and rewrites attachment references to point to the downloaded files.
- Appends page comments under a
## Commentssection in each page file. - Writes a single
metadata.jsonwith crawl metadata and a bidirectional link graph. - Supports two run modes:
- full: crawl all pages reachable from seeds up to max depth.
- updates: run the same seed-based traversal as full mode, but selectively re-process dirty pages while reusing clean-page artifacts.
Pre-built binaries are available on the Releases page.
- Download the archive for your platform.
- Extract the binary (
confluence2mdorconfluence2md.exe). - Run it from the directory containing your
config.yaml.
- A valid Atlassian API token with read access to the target spaces
You can generate an Atlassian API token from your Atlassian account security page:
If you run into authentication issues, see Operations and Troubleshooting.
Copy config.example.yaml to config.yaml and fill in the required values.
confluence:
# Your Atlassian account email
username: you@example.com
# Atlassian API token (https://id.atlassian.com/manage-profile/security/api-tokens)
# Can also be set via env var: CONFLUENCE_TOKEN
token: ""
crawl:
# One or more seed page URLs or page IDs to start from
seeds:
- https://your-org.atlassian.net/wiki/spaces/SPACE/pages/123456/Page+Title
# Maximum link-follow depth from each seed (0 = seed pages only)
max_depth: 3
# Maximum concurrent API requests
concurrency: 5
# Requests per minute (Confluence Cloud limit is ~300/min per token)
rate_limit_rpm: 250
output:
# Directory to write Markdown files, attachments, and metadata
dir: ./output
attachments:
# Download attachments referenced by crawled pages
download: true
# Skip attachments larger than this size (0 = no limit)
max_size_mb: 100
retry:
# Maximum number of retries for transient API errors (429, 5xx)
max_attempts: 5
# Initial backoff in milliseconds (doubles with each retry + jitter)
initial_backoff_ms: 1000Now run a full crawl:
confluence2md --mode fullNote: full mode clears the configured output directory before crawling.
Once it completes, open output/ to inspect the generated Markdown files.
# Full crawl (crawls all pages reachable from seeds up to max depth)
confluence2md --mode full
# Incremental update (same seed traversal, selective page re-processing)
confluence2md --mode updates
# Validate config and Confluence API credentials without crawling
confluence2md validateThe tool looks for config.yaml in the current directory by default. Use --config to specify a different path:
confluence2md --config /path/to/config.yaml --mode full
confluence2md --config /path/to/config.yaml validateAfter each run a summary is printed to stdout:
=== Crawl Complete ===
Mode: full
Total pages crawled: 13
Pages written successfully: 13
Pages with errors: 0
Internal crawl links discovered (edge count): 14
Unique internal target pages linked: 12
External links skipped (host filter): 5
Pages with rewritten links: 4/13
Markdown links rewritten to local paths: 14/25
Pages with comments appended: 1
Total comments fetched: 2
Pages with comment fetch warnings: 0
Output directory: ./output
For updates mode, the summary additionally reports:
- Reachable pages
- Pages re-rendered
- Pages reused without full re-processing
- Re-render saves (count and percent)
- Managed files added/updated/deleted
- Attachments downloaded/reused
- Output commit status
- Checkpoint advanced status (successful-checkpoint advancement)
Note: current output commit behavior is direct-write (non-transactional).
Every page is saved as:
{title-slug}_{page-id}.md
The page ID is always included so renames (title changes) are detectable and the file can be consistently identified across runs. Attachments are saved under an attachments/ directory alongside the pages.
How link rewriting works — two passes:
- Crawl pass: all pages are fetched and written to disk with their original Confluence URLs still intact. As each page is saved, its page ID and local filename are recorded in
metadata.json. No links are rewritten yet. - Rewrite pass: once the full crawled set is known, every page is scanned. For each link, if the target page ID exists in
metadata.json(i.e. it was crawled), the URL is replaced with a relative local path. If not — whether it is a Confluence page that was out of scope, beyond max depth, or a completely different site — the original URL is left unchanged.
Because the rewrite pass only runs after crawling is complete, every decision is a simple lookup with no iteration or guesswork.
output/
├── metadata.json # all-pages index: metadata + link graph
├── {title-slug}_{page-id}.md # one file per page (comments appended at bottom)
└── attachments/
└── {page-id}_{original-filename}
Install Task, then:
task build # builds bin/confluence2md.exe
task test # runs all tests
task lint # runs golangci-lintGitHub Releases publish platform binaries as compressed archives:
- Linux/macOS:
.tar.gz - Windows:
.zip
Supported release targets:
linux/amd64linux/arm64darwin/amd64darwin/arm64windows/amd64windows/arm64
Executable name inside archives:
- Linux/macOS:
confluence2md - Windows:
confluence2md.exe
The diagram below shows how the full crawl mode works at a high level. For more details on specific parts of the implementation, see the linked docs:
- Operations and troubleshooting
- Markdown conversion internals
- Attachments retrieval internals
- Comments fetching internals
confluence2md/
├── .gitignore
├── LICENSE
├── README.md
├── CONTRIBUTING.md
├── Taskfile.yml
├── config.example.yaml
├── config.yaml # local runtime config (gitignored)
├── go.mod / go.sum
├── bin/ # compiled binaries (gitignored)
│
├── cmd/
│ └── crawler/
│ ├── main.go # CLI command wiring + thin run coordinator
│ ├── run_pipeline.go # run phases, per-page handlers, finalization, summary output
│ ├── setup.go # config summary, client/auth checks, seed resolution
│ ├── link_utils.go # markdown/link/attachment placeholder rewrites
│ ├── finalize.go # graph rebuild + rewrite + artifact reconciliation
│ ├── helpers.go # small shared helpers for run pipeline
│ └── main_test.go # command/finalization/reconciliation tests
│
└── internal/
├── config/
│ └── config.go # struct, Load(), Validate()
│
├── confluence/
│ ├── client.go # Confluence API client methods
│ ├── comments_client.go # comment fetch + author enrichment flow
│ ├── http_helpers.go # authenticated request/response helpers
│ ├── parsing.go # shared API parsing helpers
│ └── models.go # local types mapped from API responses
│
├── crawl/
│ └── full.go # crawl session orchestration and traversal
│
├── convert/
│ ├── markdown.go # orchestrates page → markdown pipeline
│ ├── parser.go # XML parser for Confluence storage format
│ ├── parser_macros.go # macro renderers/handlers
│ └── comments.go # formats comments into ## Comments section
│
├── links/
│ ├── rewriter.go # pass-2 link rewrite using metadata map
│ └── extractor.go # extracts page IDs from storage format links
│
├── store/
│ ├── fs.go # page writes + metadata.json persistence
│ └── attachments.go # attachment file download/persistence helpers
- Optional Git integration to commit changes after each crawl with a configurable message template.
- Support for crawling Confluence Server/Data Center instances (currently only Cloud API is supported).
MIT