Skip to content

Investigate OCR fallback returning empty text for scanned Bentley minutes PDF #70

@willgriffin

Description

@willgriffin

Summary

While validating Anytown's adoption of @happyvertical/pdf@0.62.28, the non-CI live Praeco discovery test exposed a scanned/minutes PDF where extraction returned empty text even after the reader logged that it was attempting OCR fallback.

This should be investigated upstream rather than papered over in Praeco/Anytown.

Repro Context

Repo: anytown/anytown.ai
Command:

pnpm --filter @happyvertical/praeco test

Important note: this command runs live agents/praeco/src/discovery.spec.ts locally. In CI, that spec is skipped via CI=true.

Failing test:

src/discovery.spec.ts > Discovery - Basic Workflow > should fetch live Bentley agenda and scanned minutes text from WordPress download pages

Observed source page/document path:

https://townofbentley.ca/download/regular-council-meeting-march-24-2026-signed-minutes/

Runtime log excerpt:

→ Fetching minutes from https://townofbentley.ca/download/regular-council-meeting-march-24-2026-signed-minutes/
No direct text found, attempting OCR fallback...
✗ Failed to fetch minutes: PDF extraction produced no text for https://townofbentley.ca/download/regular-council-meeting-march-24-2026-signed-minutes/?wpdmdl=21101&refresh=69ec58cbcc67c1777096907

Expected

For scanned PDFs where direct text extraction returns nothing, OCR fallback should either:

  • return recognized text, or
  • fail explicitly with the OCR/rendering/provider error that explains why OCR could not complete.

It should not look like OCR ran successfully but produced an empty successful extraction unless the document truly has no recognizable text.

Notes

This was seen after the child-process extraction release in @happyvertical/pdf@0.62.28, while working on Anytown PR https://github.com/anytown/anytown.ai/pull/264.

There was a separate local environment failure in the same live spec for missing Playwright browser dependencies. This issue is specifically about the PDF/OCR empty-text result after the document was fetched.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions