First public release of ParseRails — fast, light, cgo-free document parsing for Go.
ParseRails is the Go counterpart to run-llama/liteparse: it wraps Google's PDFium (the engine behind Chrome's PDF viewer) compiled to WebAssembly (wazero) — so the default build needs no cgo and no system libraries.
Highlights
- Spatial text extraction — every word with its bounding box, page, and (opt-in) font size.
wordandlinegranularity. - Page rendering —
RenderPagerasterizes any page to animage.Image. - Pluggable OCR with automatic fallback for scanned/image-only pages — bundled cgo-free Tesseract adapter (
ocr/tesseract) + HTTP adapter (ocr/httpocr). - Office formats — DOCX/PPTX/XLSX/ODT/RTF via headless LibreOffice.
ExtractText— whole-page text fast path for RAG/search (no boxes), several times cheaper than full parsing.- Files & batches —
ParseFile(PDF or office) and concurrentParseFiles. - CLI —
go install github.com/promptrails/parserails/cmd/parserails@latest(parse/render). - Two backends — cgo-free WASM by default, or native
-tags parserails_cgo(links libpdfium) for max throughput on controlled hosts.
Docs & examples
- Docs: https://promptrails.github.io/parserails/
- Runnable examples (each its own module + Dockerfile):
examples/extract-text,examples/parse-server,examples/native-cgo - Benchmarks vs
ledongthuc/pdf,pdfcpu,unipdfunderbenchmark/
go get github.com/promptrails/parserails@v0.1.0