`confluence2md` - Confluence to Markdown crawler and converter

Turn your Confluence space into Markdown files on your hard drive and use that local copy to:

build a second brain
feed AI/RAG pipelines
version docs in Git
browse offline in your preferred Markdown editor
power local full-text search across docs
support onboarding with a portable docs snapshot
preserve runbooks for incident response and disaster recovery
export knowledge for compliance and audit evidence trails
reduce vendor lock-in by keeping docs in an open format.

What you get:

One local .md file per crawled page, with stable filenames that include page IDs.
Links between crawled pages rewritten to relative local links.
External or out-of-scope links preserved as original URLs.
Page comments appended under a ## Comments section.
A metadata.json index with page metadata plus incoming/outgoing link graph data.

What It Does

Starts from configured seed pages and traverses linked Confluence pages up to a configurable depth.
Converts each page from Confluence storage format to Markdown and writes it to the local filesystem.
Runs a second pass to rewrite links between crawled pages as relative local links.
Leaves all other URLs unchanged — including links to uncrawled Confluence pages and external sites.
Downloads page attachments and rewrites attachment references to point to the downloaded files.
Appends page comments under a ## Comments section in each page file.
Writes a single metadata.json with crawl metadata and a bidirectional link graph.
Supports two run modes:
- full: crawl all pages reachable from seeds up to max depth.
- updates: run the same seed-based traversal as full mode, but selectively re-process dirty pages while reusing clean-page artifacts.

Download

Pre-built binaries are available on the Releases page.

Download the archive for your platform.
Extract the binary (confluence2md or confluence2md.exe).
Run it from the directory containing your config.yaml.

How To Use

Requirements

A valid Atlassian API token with read access to the target spaces

You can generate an Atlassian API token from your Atlassian account security page:

https://id.atlassian.com/manage-profile/security/api-tokens

If you run into authentication issues, see Operations and Troubleshooting.

Configuration

Copy config.example.yaml to config.yaml and fill in the required values.

confluence:
  # Your Atlassian account email
  username: you@example.com
  # Atlassian API token (https://id.atlassian.com/manage-profile/security/api-tokens)
  # Can also be set via env var: CONFLUENCE_TOKEN
  token: ""

crawl:
  # One or more seed page URLs or page IDs to start from
  seeds:
    - https://your-org.atlassian.net/wiki/spaces/SPACE/pages/123456/Page+Title
  # Maximum link-follow depth from each seed (0 = seed pages only)
  max_depth: 3
  # Maximum concurrent API requests
  concurrency: 5
  # Requests per minute (Confluence Cloud limit is ~300/min per token)
  rate_limit_rpm: 250

output:
  # Directory to write Markdown files, attachments, and metadata
  dir: ./output

attachments:
  # Download attachments referenced by crawled pages
  download: true
  # Skip attachments larger than this size (0 = no limit)
  max_size_mb: 100

retry:
  # Maximum number of retries for transient API errors (429, 5xx)
  max_attempts: 5
  # Initial backoff in milliseconds (doubles with each retry + jitter)
  initial_backoff_ms: 1000

Quickstart

Now run a full crawl:

confluence2md --mode full

Note: full mode clears the configured output directory before crawling.

Once it completes, open output/ to inspect the generated Markdown files.

CLI Usage

# Full crawl (crawls all pages reachable from seeds up to max depth)
confluence2md --mode full

# Incremental update (same seed traversal, selective page re-processing)
confluence2md --mode updates

# Validate config and Confluence API credentials without crawling
confluence2md validate

The tool looks for config.yaml in the current directory by default. Use --config to specify a different path:

confluence2md --config /path/to/config.yaml --mode full
confluence2md --config /path/to/config.yaml validate

After each run a summary is printed to stdout:

=== Crawl Complete ===
Mode: full
Total pages crawled: 13
Pages written successfully: 13
Pages with errors: 0
Internal crawl links discovered (edge count): 14
Unique internal target pages linked: 12
External links skipped (host filter): 5
Pages with rewritten links: 4/13
Markdown links rewritten to local paths: 14/25
Pages with comments appended: 1
Total comments fetched: 2
Pages with comment fetch warnings: 0
Output directory: ./output

For updates mode, the summary additionally reports:

Reachable pages
Pages re-rendered
Pages reused without full re-processing
Re-render saves (count and percent)
Managed files added/updated/deleted
Attachments downloaded/reused
Output commit status
Checkpoint advanced status (successful-checkpoint advancement)

Note: current output commit behavior is direct-write (non-transactional).

How It Works

Filename Conventions

Every page is saved as:

{title-slug}_{page-id}.md

The page ID is always included so renames (title changes) are detectable and the file can be consistently identified across runs. Attachments are saved under an attachments/ directory alongside the pages.

How link rewriting works — two passes:

Crawl pass: all pages are fetched and written to disk with their original Confluence URLs still intact. As each page is saved, its page ID and local filename are recorded in metadata.json. No links are rewritten yet.
Rewrite pass: once the full crawled set is known, every page is scanned. For each link, if the target page ID exists in metadata.json (i.e. it was crawled), the URL is replaced with a relative local path. If not — whether it is a Confluence page that was out of scope, beyond max depth, or a completely different site — the original URL is left unchanged.

Because the rewrite pass only runs after crawling is complete, every decision is a simple lookup with no iteration or guesswork.

Output Layout

output/
├── metadata.json                  # all-pages index: metadata + link graph
├── {title-slug}_{page-id}.md      # one file per page (comments appended at bottom)
└── attachments/
  └── {page-id}_{original-filename}

Building, Testing, and Internals

Building

Install Task, then:

task build        # builds bin/confluence2md.exe
task test         # runs all tests
task lint         # runs golangci-lint

Release Artifacts

GitHub Releases publish platform binaries as compressed archives:

Linux/macOS: .tar.gz
Windows: .zip

Supported release targets:

linux/amd64
linux/arm64
darwin/amd64
darwin/arm64
windows/amd64
windows/arm64

Executable name inside archives:

Linux/macOS: confluence2md
Windows: confluence2md.exe

Internals Documentation

The diagram below shows how the full crawl mode works at a high level. For more details on specific parts of the implementation, see the linked docs:

Project Structure

confluence2md/
├── .gitignore
├── LICENSE
├── README.md
├── CONTRIBUTING.md
├── Taskfile.yml
├── config.example.yaml
├── config.yaml                      # local runtime config (gitignored)
├── go.mod / go.sum
├── bin/                             # compiled binaries (gitignored)
│
├── cmd/
│   └── crawler/
│       ├── main.go                  # CLI command wiring + thin run coordinator
│       ├── run_pipeline.go          # run phases, per-page handlers, finalization, summary output
│       ├── setup.go                 # config summary, client/auth checks, seed resolution
│       ├── link_utils.go            # markdown/link/attachment placeholder rewrites
│       ├── finalize.go              # graph rebuild + rewrite + artifact reconciliation
│       ├── helpers.go               # small shared helpers for run pipeline
│       └── main_test.go             # command/finalization/reconciliation tests
│
└── internal/
    ├── config/
    │   └── config.go                # struct, Load(), Validate()
    │
    ├── confluence/
    │   ├── client.go                # Confluence API client methods
    │   ├── comments_client.go       # comment fetch + author enrichment flow
    │   ├── http_helpers.go          # authenticated request/response helpers
    │   ├── parsing.go               # shared API parsing helpers
    │   └── models.go                # local types mapped from API responses
    │
    ├── crawl/
    │   └── full.go                  # crawl session orchestration and traversal
    │
    ├── convert/
    │   ├── markdown.go              # orchestrates page → markdown pipeline
    │   ├── parser.go                # XML parser for Confluence storage format
    │   ├── parser_macros.go         # macro renderers/handlers
    │   └── comments.go              # formats comments into ## Comments section
    │
    ├── links/
    │   ├── rewriter.go              # pass-2 link rewrite using metadata map
    │   └── extractor.go             # extracts page IDs from storage format links
    │
    ├── store/
    │   ├── fs.go                    # page writes + metadata.json persistence
    │   └── attachments.go           # attachment file download/persistence helpers

Backlog and Roadmap

Optional Git integration to commit changes after each crawl with a configurable message template.
Support for crawling Confluence Server/Data Center instances (currently only Cloud API is supported).

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`confluence2md` - Confluence to Markdown crawler and converter

What It Does

Download

How To Use

Requirements

Configuration

Quickstart

CLI Usage

How It Works

Filename Conventions

Output Layout

Building, Testing, and Internals

Building

Release Artifacts

Internals Documentation

Project Structure

Backlog and Roadmap

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github/workflows		.github/workflows
cmd/crawler		cmd/crawler
docs		docs
internal		internal
.gitignore		.gitignore
.goreleaser.yaml		.goreleaser.yaml
.release-please-config.json		.release-please-config.json
.release-please-manifest.json		.release-please-manifest.json
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
Taskfile.yml		Taskfile.yml
config.example.yaml		config.example.yaml
go.mod		go.mod
go.sum		go.sum

Folders and files

Latest commit

History

Repository files navigation

confluence2md - Confluence to Markdown crawler and converter

What It Does

Download

How To Use

Requirements

Configuration

Quickstart

CLI Usage

How It Works

Filename Conventions

Output Layout

Building, Testing, and Internals

Building

Release Artifacts

Internals Documentation

Project Structure

Backlog and Roadmap

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`confluence2md` - Confluence to Markdown crawler and converter

Packages