PyMuPDF PDF Parser - Clawdbot Skill

A Clawdbot skill for fast, lightweight PDF parsing using PyMuPDF (fitz). Ideal for quick text extraction when speed matters.

Features

Fast processing — Parses PDFs in ~1 second per page
Lightweight — Single pip dependency, no heavy models
Markdown output — Clean text extraction with page markers
JSON output — Simple structured text per page
Image extraction — Optional embedded image extraction
NixOS compatible — Includes notes for libstdc++ issues

Version

Current: v1.0.0

See CHANGELOG for version history.

Installation

Prerequisites

Python 3.8+
PyMuPDF: pip install pymupdf
Clawdbot installed

Install the skill

# Clone the repo
git clone https://github.com/kesslerio/PyMuPDF-PDF-Parser-Clawdbot-Skill.git

# Or copy the pymupdf-pdf/ folder to your Clawdbot skills directory
cp -r PyMuPDF-PDF-Parser-Clawdbot-Skill/pymupdf-pdf ~/.clawdbot/skills/

# Install dependency
pip install pymupdf

NixOS users

If you hit libstdc++ import errors:

export LD_LIBRARY_PATH=/nix/store/<your-gcc-lib-path>/lib

See pymupdf-pdf/references/pymupdf-notes.md for details.

Usage

Quick start

# Run from the skill directory
./scripts/pymupdf_parse.py /path/to/document.pdf

Options

./scripts/pymupdf_parse.py /path/to/document.pdf --format json
./scripts/pymupdf_parse.py /path/to/document.pdf --format both --images
./scripts/pymupdf_parse.py /path/to/document.pdf --outroot ./my-output

Option	Default	Description
`--format`	`md`	Output format: `md`, `json`, or `both`
`--outroot`	`./pymupdf-output`	Output root directory
`--images`	off	Extract embedded images
`--tables`	off	Extract line-based table approximation
`--lang`	`en`	Language hint (stored in JSON metadata)

Output

Creates a per-document folder under the output root:

./pymupdf-output/
└── document-name/
    ├── output.md      # Markdown with page markers
    ├── output.json    # Simple JSON (~1KB, text per page)
    ├── images/        # Extracted images (if --images)
    └── tables.json    # Line-based tables (if --tables)

Output quality

PyMuPDF produces fast, minimal output:

Plain text extraction (no layout preservation)
Simple JSON with text per page
Optional image extraction

Best for: Quick text extraction, batch processing, or when speed matters.

Comparison with MinerU

Aspect	PyMuPDF	MinerU
Speed	Fast (~1s/page)	Slower (~15-30s/page)
JSON output	Minimal (~1KB, text only)	Rich (~50KB+, layout data)
Image extraction	Optional	Automatic
Layout preservation	Basic	Excellent
Dependencies	Light (pip install)	Heavy (~20GB models)

Use PyMuPDF when: Speed matters or for simple text extraction.
Use MinerU when: Quality and structure matter more than speed.

License

Apache 2.0

Contributing

Issues and PRs welcome. Please test changes with various PDF types before submitting.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
references		references
scripts		scripts
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
SKILL.md		SKILL.md
VERSION		VERSION

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyMuPDF PDF Parser - Clawdbot Skill

Features

Version

Installation

Prerequisites

Install the skill

NixOS users

Usage

Quick start

Options

Output

Output quality

Comparison with MinerU

License

Contributing

Related

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

kesslerio/PyMuPDF-PDF-Parser-openclaw-skill

Folders and files

Latest commit

History

Repository files navigation

PyMuPDF PDF Parser - Clawdbot Skill

Features

Version

Installation

Prerequisites

Install the skill

NixOS users

Usage

Quick start

Options

Output

Output quality

Comparison with MinerU

License

Contributing

Related

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages