v0.6.0: fix(rag): switch PDF text extraction to go-fitz (libmupdf), add timeout

mudler released this 06 May 22:47

· 7 commits to main since this release

v0.6.0

7a2086c

The previous PDF parser (github.com/dslipak/pdf v0.0.2) is a single-
maintainer pure-Go library that can hang indefinitely on certain xref
layouts, encrypted PDFs, scanned-image-only files, and a few other
structural quirks. With no timeout and no context, a single bad PDF
sits inside `chunkFile` → `fileToText` → `pdf.GetPlainText()` forever
and blocks the entire `/api/collections/<id>/upload` request until
something upstream fires its own timeout.

Reproduced live: a 9KB invoice PDF (`invoice-5368158.pdf`) made
LocalRecall hang at exactly the gap between "Storing files indexKeys"
and "Chunked file" log lines, never recovering. The downstream
embedding model itself was healthy throughout.

Switch the parser to github.com/gen2brain/go-fitz, which wraps
libmupdf — Mozilla's PDF parser, the same one Firefox uses. Far more
robust against malformed PDFs; bundles its native libs so the build
stays single-step (LocalRecall already uses cgo for vectorscale).

Belt-and-braces: extraction runs on a goroutine with a wall-clock
timeout (default 60s, override via LOCALRECALL_PDF_EXTRACT_TIMEOUT).
Even libmupdf can occasionally be slow; a single adversarial PDF can't
poison the upload queue any more — it fails with a clear error and the
caller can skip/retry.

Iterates pages and concatenates with double-newline separators (so
chunkers that respect paragraph boundaries get a sane signal). The
existing UTF-8 / null-byte sanitization is preserved — libmupdf can
still emit byte sequences that PostgreSQL rejects.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.6.0: fix(rag): switch PDF text extraction to go-fitz (libmupdf), add timeout

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!