You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The previous PDF parser (github.com/dslipak/pdf v0.0.2) is a single-
maintainer pure-Go library that can hang indefinitely on certain xref
layouts, encrypted PDFs, scanned-image-only files, and a few other
structural quirks. With no timeout and no context, a single bad PDF
sits inside `chunkFile` → `fileToText` → `pdf.GetPlainText()` forever
and blocks the entire `/api/collections/<id>/upload` request until
something upstream fires its own timeout.
Reproduced live: a 9KB invoice PDF (`invoice-5368158.pdf`) made
LocalRecall hang at exactly the gap between "Storing files indexKeys"
and "Chunked file" log lines, never recovering. The downstream
embedding model itself was healthy throughout.
Switch the parser to github.com/gen2brain/go-fitz, which wraps
libmupdf — Mozilla's PDF parser, the same one Firefox uses. Far more
robust against malformed PDFs; bundles its native libs so the build
stays single-step (LocalRecall already uses cgo for vectorscale).
Belt-and-braces: extraction runs on a goroutine with a wall-clock
timeout (default 60s, override via LOCALRECALL_PDF_EXTRACT_TIMEOUT).
Even libmupdf can occasionally be slow; a single adversarial PDF can't
poison the upload queue any more — it fails with a clear error and the
caller can skip/retry.
Iterates pages and concatenates with double-newline separators (so
chunkers that respect paragraph boundaries get a sane signal). The
existing UTF-8 / null-byte sanitization is preserved — libmupdf can
still emit byte sequences that PostgreSQL rejects.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>