Skip to content

jschof1/pdf2md

pdf2md

Convert any PDF into clean, well-structured Markdown — powered by AI.

License: MIT Shell Script AI: Gemini

pdf2md extracts text from PDFs and uses AI to produce properly formatted Markdown with headings, tables, bullet lists, bold text, blockquotes, and more. No setup beyond a free API key.


✨ What it does

Turns this PDF...

Raw PDF text

...into this Markdown:

Clean Markdown output

The AI understands document structure. It produces:

  • Proper heading hierarchy (h1h2h3)
  • Markdown tables for tabular data and comparisons
  • / for good/bad items
  • Blockquotes for callouts and severity labels
  • Bold for key terms and metrics
  • Numbered lists for steps, bullet lists for items

🚀 Install

One-line install:

curl -fsSL https://raw.githubusercontent.com/jschof1/pdf2md/main/install.sh | bash

Or manually:

git clone https://github.com/jschof1/pdf2md.git
cd pdf2md
sudo cp pdf2md /usr/local/bin/

Prerequisites:


🔑 API Key

pdf2md uses Google Gemini (free tier). Get your key:

  1. Go to https://aistudio.google.com/apikey
  2. Create a key
  3. Set it in your shell:
export GEMINI_API_KEY="your-key-here"

Add it to ~/.bashrc, ~/.zshrc, or your shell profile for persistence.

Cost: Gemini Flash is free for reasonable usage. A 14-page report costs less than $0.01.


📖 Usage

# Convert a PDF (AI-formatted)
pdf2md report.pdf

# Custom output path
pdf2md report.pdf ./output/report.md

# Raw text extraction only (no API key needed)
pdf2md report.pdf --no-ai

# Smaller chunks for very dense documents
pdf2md report.pdf --chunk 3

# Use a different Gemini model
pdf2md report.pdf --model gemini-2.5-pro

# Suppress progress output
pdf2md report.pdf --quiet

All Options

pdf2md v1.0.0 — Convert PDF to Markdown (AI-enhanced)

Usage:
  pdf2md <input.pdf> [output.md] [options]

Arguments:
  input.pdf   Path to PDF file
  output.md   Output path (default: same name, .md extension)

Options:
  --no-ai       Basic text extraction only (no AI formatting)
  --model MODEL Gemini model to use (default: gemini-2.5-flash)
  --chunk N     Pages per AI chunk (default: 5)
  -q, --quiet   Suppress progress output
  -v, --version Print version
  -h, --help    Show this help

Environment:
  GEMINI_API_KEY  Required for AI mode. Get one free:
                  https://aistudio.google.com/apikey

⚙️ How it works

  1. Extract — Uses PyMuPDF to pull text from every page
  2. Chunk — Splits the document into page-based chunks (default: 5 pages) to stay within model output limits
  3. Format — Sends each chunk to Gemini with a detailed formatting prompt
  4. Join — Combines all chunks into a single Markdown file

Chunking means it handles documents of any length — 5 pages or 500.


📊 When to use pdf2md

Scenario pdf2md
Converting reports and audits to Markdown
Making PDF content editable in a wiki or CMS
Extracting structured data from PDFs
Preparing PDF content for AI/RAG pipelines
Converting scanned/image-based PDFs ❌ (needs OCR first)

🔧 Dependencies

Dependency Why Installed how
python3 Text extraction via PyMuPDF System package
pymupdf PDF parsing library Auto-installed by pdf2md
curl API calls to Gemini Pre-installed on macOS/Linux
GEMINI_API_KEY AI formatting Free at Google AI Studio

🤝 Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.

Ideas for contributions:

  • Support for OpenAI / Anthropic / local models
  • OCR support for image-based PDFs
  • Batch processing (convert a folder of PDFs)
  • Config file support (~/.pdf2mdrc)
  • Progress bar

📝 License

MIT — use it however you like.


⭐ Star History

If pdf2md saved you time, consider giving it a star — it helps others find it.

Stars

About

Convert any PDF into clean, well-structured Markdown — powered by AI. One bash script, zero config.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages