doc2md

PDF to Markdown. Word to Markdown. Excel, PowerPoint, YouTube transcripts — converted locally, zero LLM tokens spent.

Drop files in a folder. Run one command. Get .md files ready for graphify, Claude, or any other AI tool.

Install

# 1. Clone and copy to your Claude skills folder
git clone https://github.com/luckmattos/doc2md.git
cp -r doc2md ~/.claude/skills/doc2md

# 2. Install the dependency
pip install markitdown[all]

Quickstart

Create two folders in your project:

.raw-docs/    ← drop your files here
docs/         ← converted .md files appear here

Then run:

/doc2md

That's it. Every supported file in .raw-docs/ becomes a .md in docs/. Run it again — only new or changed files are converted (MD5 cache).

What you can convert

Files

Extension	Type	Notes
`.pdf`	PDF	Text extracted directly. Scanned PDFs (image-only) return a warning.
`.docx`	Word	Text, tables, and headings
`.pptx`	PowerPoint	Slide titles, body, and speaker notes
`.xlsx` / `.xls`	Excel	Spreadsheet data as Markdown tables
`.csv`	CSV	Tabular data as Markdown table
`.epub`	Ebook	Chapters and metadata
`.ipynb`	Jupyter Notebook	Code cells, outputs, and markdown blocks
`.msg`	Outlook Email	Sender, recipient, subject, and body
`.zip`	ZIP archive	Extracts and converts all supported files inside

URLs — no download needed

Source	What you get
YouTube (`youtube.com/watch?v=...`)	Full transcript + title + duration + metadata
Wikipedia (`wikipedia.org/wiki/...`)	Full article as clean Markdown
Any HTML page	Main content extracted, no menus or ads

Not supported

Format	Alternative
`.mp3` / `.wav` / `.m4a` / `.mp4`	Use OpenAI Whisper
`.png` / `.jpg` / `.jpeg`	graphify handles images with vision subagents
Scanned PDFs (image-only)	Run OCR first

Commands

/doc2md                         # process .raw-docs/ → docs/ (default)
/doc2md ./folder                # convert a specific folder
/doc2md ./report.pdf            # single file
/doc2md --url https://youtube.com/watch?v=xxx
/doc2md ./folder --watch        # auto-convert new files every 3s
/doc2md ./folder --out ./other  # custom output folder
/doc2md ./file --force          # skip cache, re-convert
/doc2md ./folder --list         # show cache status

Why this saves money

When an AI agent reads a PDF directly, it processes the full binary — layout, fonts, metadata, structure — before reaching the actual content. doc2md extracts only the text, locally, with zero LLM calls.

Example: 50-page report, ~12,000 words

	Tokens	Cost (Claude Sonnet)
❌ PDF read directly by agent	~90,000	~$0.27
✅ Pre-processed with doc2md	~9,000	~$0.027
Savings	81% fewer tokens	~$0.24 per doc

At 100 docs/month: ~$24 saved, zero extra effort.

Works great with graphify

Pre-process your docs before graphify runs — subagents read clean .md instead of raw PDFs.

/doc2md          # 0 LLM tokens — converts everything locally
/graphify docs/  # reads .md: 60-80% fewer tokens than reading PDFs directly

Full pipeline when you also have audio:

/whisper .raw-docs/meeting.mp3 --out docs/   # audio → .md (0 tokens)
/doc2md                                       # docs → .md (0 tokens)
/graphify docs/                               # all together (optimized tokens)

Your sources
    │
    ├── PDF, DOCX, XLSX, PPTX ──→ [ doc2md  ] ──→ .md
    ├── YouTube, Wikipedia ──────→ [ doc2md  ] ──→ .md
    └── Audio (.mp3, .wav) ──────→ [ whisper ] ──→ .md
                                                     │
                                               [ graphify ]
                                                     │
                                                  [ Claude ]
                              (receives structured text, not raw binaries)

Stack

markitdown by Microsoft — conversion engine
Python stdlib — hashlib, argparse, json, pathlib
Local MD5 cache — never re-converts unchanged files

No LLM. No cloud. No API key.

Keywords: pdf-to-markdown, docx-to-markdown, youtube-transcript, token-optimization, llm-pipeline, local, offline, claude-skill, graphify, whisper

Author

luckmattos - built with AI coding agents and tools. MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
scripts		scripts
.gitignore		.gitignore
README.md		README.md
SKILL.md		SKILL.md
banner.svg		banner.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

doc2md

Install

Quickstart

What you can convert

Commands

Why this saves money

Works great with graphify

Stack

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

doc2md

Install

Quickstart

What you can convert

Commands

Why this saves money

Works great with graphify

Stack

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages