Summary
While validating Anytown's adoption of @happyvertical/pdf@0.62.28, the non-CI live Praeco discovery test exposed a scanned/minutes PDF where extraction returned empty text even after the reader logged that it was attempting OCR fallback.
This should be investigated upstream rather than papered over in Praeco/Anytown.
Repro Context
Repo: anytown/anytown.ai
Command:
pnpm --filter @happyvertical/praeco test
Important note: this command runs live agents/praeco/src/discovery.spec.ts locally. In CI, that spec is skipped via CI=true.
Failing test:
src/discovery.spec.ts > Discovery - Basic Workflow > should fetch live Bentley agenda and scanned minutes text from WordPress download pages
Observed source page/document path:
https://townofbentley.ca/download/regular-council-meeting-march-24-2026-signed-minutes/
Runtime log excerpt:
→ Fetching minutes from https://townofbentley.ca/download/regular-council-meeting-march-24-2026-signed-minutes/
No direct text found, attempting OCR fallback...
✗ Failed to fetch minutes: PDF extraction produced no text for https://townofbentley.ca/download/regular-council-meeting-march-24-2026-signed-minutes/?wpdmdl=21101&refresh=69ec58cbcc67c1777096907
Expected
For scanned PDFs where direct text extraction returns nothing, OCR fallback should either:
- return recognized text, or
- fail explicitly with the OCR/rendering/provider error that explains why OCR could not complete.
It should not look like OCR ran successfully but produced an empty successful extraction unless the document truly has no recognizable text.
Notes
This was seen after the child-process extraction release in @happyvertical/pdf@0.62.28, while working on Anytown PR https://github.com/anytown/anytown.ai/pull/264.
There was a separate local environment failure in the same live spec for missing Playwright browser dependencies. This issue is specifically about the PDF/OCR empty-text result after the document was fetched.
Summary
While validating Anytown's adoption of
@happyvertical/pdf@0.62.28, the non-CI live Praeco discovery test exposed a scanned/minutes PDF where extraction returned empty text even after the reader logged that it was attempting OCR fallback.This should be investigated upstream rather than papered over in Praeco/Anytown.
Repro Context
Repo:
anytown/anytown.aiCommand:
pnpm --filter @happyvertical/praeco testImportant note: this command runs live
agents/praeco/src/discovery.spec.tslocally. In CI, that spec is skipped viaCI=true.Failing test:
Observed source page/document path:
Runtime log excerpt:
Expected
For scanned PDFs where direct text extraction returns nothing, OCR fallback should either:
It should not look like OCR ran successfully but produced an empty successful extraction unless the document truly has no recognizable text.
Notes
This was seen after the child-process extraction release in
@happyvertical/pdf@0.62.28, while working on Anytown PR https://github.com/anytown/anytown.ai/pull/264.There was a separate local environment failure in the same live spec for missing Playwright browser dependencies. This issue is specifically about the PDF/OCR empty-text result after the document was fetched.