ParseRails

Fast, light, cgo-free document parsing for Go. Spatial text extraction with bounding boxes, page screenshots, and pluggable OCR — no cloud, no LLM required.

ParseRails is the Go counterpart to run-llama/liteparse. It wraps the same engine liteparse uses — Google's PDFium — through klippa-app/go-pdfium, running it in pure-Go WebAssembly (wazero) mode. That means no CGO_ENABLED=1, no native libraries to install in your Docker image, and clean cross-compilation.

Why

Most Go PDF text extractors give you a flat string and lose layout. ParseRails keeps spatial structure — every word with its bounding box, page, and font — which is what downstream RAG, table reconstruction, and LLM-vision pipelines actually need.

Features

Capability	Engine	Status
PDF text + per-word bounding boxes	PDFium (WASM)	✅
Word / line granularity, opt-in font size	PDFium (WASM)	✅
Page screenshot rendering (`RenderPage`)	PDFium (WASM)	✅
Pluggable OCR + automatic fallback	interface	✅
Tesseract OCR adapter (cgo-free, subprocess)	`tesseract` CLI	✅
HTTP OCR adapter (remote OCR servers)	`net/http`	✅
Office formats (DOCX/PPTX/XLSX/...)	LibreOffice headless → PDF	✅
Concurrent batch parsing	—	✅
CLI (`go install`)	—	✅
`ExtractText` whole-page text fast path	PDFium	✅
Native cgo backend (`-tags parserails_cgo`)	libpdfium	✅

Install

As a library:

go get github.com/promptrails/parserails

As a command:

go install github.com/promptrails/parserails/cmd/parserails@latest

parserails parse  invoice.pdf          # extract text
parserails parse  --json report.docx   # JSON, office docs via LibreOffice
parserails render --dpi 150 doc.pdf    # render page 0 → doc-p0.png

No system dependencies for PDF. PDFium ships as a WASM module loaded at runtime via wazero. Office formats need libreoffice/soffice on PATH; OCR needs the tesseract binary (only when enabled).

Usage

package main

import (
	"context"
	"fmt"
	"os"

	"github.com/promptrails/parserails"
)

func main() {
	p, err := parserails.New() // initializes a pooled PDFium WASM runtime
	if err != nil {
		panic(err)
	}
	defer p.Close()

	pdf, _ := os.ReadFile("invoice.pdf")

	doc, err := p.Parse(context.Background(), pdf)
	if err != nil {
		panic(err)
	}

	for _, w := range doc.Words() {
		fmt.Printf("p%d %q [%.1f %.1f %.1f %.1f]\n",
			w.Page, w.Text, w.X0, w.Y0, w.X1, w.Y1)
	}
}

Word

type Word struct {
	Text           string
	Page           int
	X0, Y0, X1, Y1 float64 // bounding box, PDF user-space coordinates
	FontSize       float64 // 0 unless New(WithFontInfo()) is used
}

Granularity

Word-level boxes are precise but extract every character across the WASM boundary. For per-line boxes at ~6× the speed:

p, _ := parserails.New(parserails.WithGranularity(parserails.GranularityLine))

OCR (pluggable)

Scanned/image-only pages have no extractable text. ParseRails renders those pages and falls back to an OCR backend you provide — automatically:

import "github.com/promptrails/parserails/ocr/tesseract"

p, _ := parserails.New(
	parserails.WithOCR(tesseract.New(tesseract.Config{Lang: "eng"})),
)

The bundled ocr/tesseract adapter shells out to the tesseract binary (no cgo, no libtesseract). Implement the one-method OCR interface for any other engine or a remote service. An HTTP adapter is on the roadmap.

Screenshots

img, _ := p.RenderPage(ctx, pdf, parserails.RenderRequest{Page: 0, DPI: 150})
png.Encode(out, img) // standard image.Image

Text-only fast path

When you need just the text (RAG ingestion, search indexing) and not boxes, use ExtractText — it uses PDFium's whole-page text API (one call per page) and is several times cheaper than Parse:

text, _ := p.ExtractText(ctx, pdf)        // plain string, no boxes
text, _ = p.ExtractFileText(ctx, "x.docx") // PDF or office doc

Backends: WASM (default) vs. native cgo

Same API, two build-time backends:

go build .                       # PDFium as WASM — cgo-free, portable (default)
go build -tags parserails_cgo .  # PDFium native — much faster/lighter, needs libpdfium

The default is cgo-free and needs no system libraries. The parserails_cgo backend links libpdfium for far higher throughput on controlled hosts (e.g. a Dockerized worker). parserails.Backend reports which is active. See docs/architecture.md.

Architecture

cgo-free by default — PDFium runs as WASM under wazero; opt into the native parserails_cgo backend for maximum throughput.
Pooled runtime — instances are reused across requests (PDF parsing is CPU-bound); one pool per process.
Layered — domain → parser → ocr → service, mirroring the PromptRails service conventions so it drops cleanly into a standalone parse service.

Benchmarks

ParseRails is benchmarked against ledongthuc/pdf, pdfcpu/pdfcpu, and unidoc/unipdf in benchmark/ — a separate Go module, so those (heavy, partly AGPL/commercial) dependencies never enter this module's graph.

cd benchmark && go test -bench=. -benchmem ./...

ParseRails is the only one of the four that returns per-word bounding boxes; the pure-Go readers are far faster but give flat text only. See benchmark/README.md for numbers and the (important) caveats on what each library actually measures.

Files & batches

doc, _ := p.ParseFile(ctx, "report.docx")           // PDF or office doc
results := p.ParseFiles(ctx, paths, 4)               // concurrent batch

ParseFile converts office documents to PDF via LibreOffice, then parses. ParseFiles runs a bounded-concurrency batch and captures per-file errors.

Examples

Runnable, self-contained examples (each its own module + Dockerfile) live in examples/:

extract-text — the basics, fully static cgo-free binary
parse-server — HTTP service with OCR + office support
native-cgo — native PDFium backend for max throughput

See docs/examples.md.

Status

The PDF core and the full liteparse-style feature set (spatial text, rendering, OCR, office formats, CLI, batch) are implemented, with both a cgo-free WASM backend and a native parserails_cgo backend. See docs/roadmap.md for the detailed status and what's still being considered (e.g. native XLSX cell extraction without LibreOffice).

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
benchmark		benchmark
cmd/parserails		cmd/parserails
docs		docs
examples		examples
ocr		ocr
.dockerignore		.dockerignore
.gitignore		.gitignore
.golangci.yml		.golangci.yml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
doc.go		doc.go
file.go		file.go
file_test.go		file_test.go
go.mod		go.mod
go.sum		go.sum
ocr.go		ocr.go
office.go		office.go
parser.go		parser.go
parser_test.go		parser_test.go
pool_cgo.go		pool_cgo.go
pool_default.go		pool_default.go
render.go		render.go
render_test.go		render_test.go
text.go		text.go
text_test.go		text_test.go
types.go		types.go
word_test.go		word_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ParseRails

Why

Features

Install

Usage

Word

Granularity

OCR (pluggable)

Screenshots

Text-only fast path

Backends: WASM (default) vs. native cgo

Architecture

Benchmarks

Files & batches

Examples

Status

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ParseRails

Why

Features

Install

Usage

Word

Granularity

OCR (pluggable)

Screenshots

Text-only fast path

Backends: WASM (default) vs. native cgo

Architecture

Benchmarks

Files & batches

Examples

Status

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages