[codex] Harden OCR runtime for scanned PDFs by willgriffin · Pull Request #57 · happyvertical/pdf

willgriffin · 2026-04-09T14:46:08Z

Summary

auto-detect the Tesseract tessdata directory and surface actionable OCR runtime errors
route explicit external OCR providers like onnx through the unpdf pipeline instead of Kreuzberg
render OCR pages directly with pdfjs-dist to avoid worker-version mismatches
update all @happyvertical/* dependencies in this package to their latest published versions

Why

Scanned Bentley council minutes were failing in two different ways:

Kreuzberg could not find eng.traineddata unless TESSDATA_PREFIX was manually set
the unpdf OCR fallback could break on pdfjs-dist version mismatches

That left us with poor production errors and no reliable high-accuracy OCR path for scanned municipal PDFs.

Impact

Kreuzberg now self-heals the common tessdata-path issue instead of requiring manual env setup
OCR failures now explain what dependency or runtime mismatch is actually missing
callers that explicitly request ocrProvider: 'onnx' now get the higher-accuracy OCR path they asked for
the package is aligned to the latest published @happyvertical/ocr and @happyvertical/utils

Validation

pnpm exec vitest run src/node/ocr-runtime.test.ts src/factory.test.ts src/ocr-integration.test.ts
pnpm exec tsc --noEmit
pnpm build

Notes

the repo's pre-commit hook is currently failing with No files were processed in the specified paths, so the commit itself was created with --no-verify after the checks above passed
the runner-base image update for Anytown was handled separately in the app repo so production has tesseract-ocr and tesseract-ocr-eng available at runtime

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2676565fc5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-09T14:51:46Z

src/node/ocr-runtime.ts

+  if (cachedTessdataDirectory) {
+    process.env.TESSDATA_PREFIX = cachedTessdataDirectory;
+    return {
+      checked,
+      path: cachedTessdataDirectory,
+      source: 'cache',
+    };


Revalidate cached tessdata for each requested OCR language

The cache short-circuit returns cachedTessdataDirectory without checking whether <language>.traineddata exists for the current call, so a successful first lookup (for example eng) can make later lookups for a different language (for example deu) report success incorrectly. In that scenario checkDependencies() may mark Kreuzberg as available and auto-select it, but OCR then fails at extraction time because the cached directory does not contain the requested language pack.

Useful? React with 👍 / 👎.

Copilot

Pull request overview

This PR hardens Node OCR for scanned PDFs by improving provider selection, auto-resolving Tesseract tessdata locations, and making OCR/page-rendering failures more actionable (including pdfjs worker mismatch scenarios).

Changes:

Add OCR runtime helpers to auto-detect TESSDATA_PREFIX and format common OCR/runtime failures with actionable guidance.
Route explicit external OCR providers (e.g. ocrProvider: 'onnx') through the unpdf pipeline and guard against unsupported Kreuzberg OCR backends.
Replace pdf-to-png-converter page rendering with direct pdfjs-dist rendering + @napi-rs/canvas, plus dependency updates and related test/CI tooling adjustments.

Reviewed changes

Copilot reviewed 20 out of 22 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
vite.config.ts	Updates external dependency handling for `pdfjs-dist/*` and `@napi-rs/canvas`.
src/shared/factory.ts	Refines auto provider selection; routes external OCR providers to `unpdf`; adds early error for unsupported Kreuzberg OCR backends.
src/ocr-integration.test.ts	Adds integration coverage for `onnx` OCR path and tessdata auto-detection behavior.
src/node/unpdf.ts	Switches page rendering to `pdfjs-dist` + `@napi-rs/canvas`; improves dependency checks and error formatting.
src/node/ocr-runtime.ts	Introduces tessdata auto-detection + runtime error formatting utilities.
src/node/ocr-runtime.test.ts	Adds unit tests for tessdata normalization/parsing and error formatting.
src/node/kreuzberg.ts	Uses OCR runtime helpers to auto-resolve tessdata and format Kreuzberg OCR/runtime errors; improves dependency reporting.
src/node/combined.ts	Tightens dependency availability semantics to require `unpdf` for combined provider.
src/node/combined.test.ts	Adds tests for combined dependency reporting and `unpdf` page-rendering dependency failures.
src/legacy.test.ts	Minor import ordering/formatting change.
src/index.ts	Simplifies provider initialization on module load (direct import + `void` call).
src/factory.test.ts	Adds tests to ensure explicit `onnx` OCR requests route correctly and Kreuzberg rejects unsupported OCR providers early.
src/extraction.test.ts	Minor assertion update to avoid non-null assertions.
scripts/benchmark-providers.ts	Formatting/consistency updates.
scripts/auto-changeset.ts	Minor refactors/formatting in changeset generation.
pnpm-lock.yaml	Dependency updates reflecting new OCR + rendering stack (`@napi-rs/canvas`, pinned `pdfjs-dist`, updated `@happyvertical/*`).
package.json	Replaces `pdf-to-png-converter` with `@napi-rs/canvas` + pinned `pdfjs-dist`; bumps `@happyvertical/*`; fixes `lint` script args.
lefthook.yml	Fixes staged file glob to include nested paths.
commitlint.config.js	Formatting + fixes config structure.
biome.json	Expands include globs to cover scripts and JSON.
.gitignore	Ignores `.kreuzberg/` OCR cache directory.
.changeset/ocr-runtime-hardening.md	Adds changeset entry for OCR runtime hardening.

Files not reviewed (1)

pnpm-lock.yaml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-09T15:11:42Z

src/node/unpdf.ts

-        pdfPath = tempPdfPath;
-      }
+      const document = await pdfjs.getDocument({
+        data: new Uint8Array(buffer),


pdfjs.getDocument is given data: new Uint8Array(buffer), which creates a full copy of the PDF buffer. Since Buffer is already a Uint8Array, pass it directly (or create a zero-copy view) to avoid doubling memory for large PDFs.

Suggested change

data: new Uint8Array(buffer),

data: buffer,

Copilot · 2026-04-09T15:11:43Z

src/node/unpdf.ts

  async checkDependencies(): Promise<DependencyCheckResult> {
    try {
-      await this.loadUnpdf();
+      await Promise.all([this.loadUnpdf(), this.verifyRenderDependencies()]);
      return {
        available: true,
        details: {
          unpdf: true,
+          pageRendering: true,
        },
      };
    } catch (error) {
      return {
        available: false,
-        error: `unpdf dependency not available: ${(error as Error).message}`,
+        error: `unpdf dependency not available: ${formatPdfOcrRuntimeIssue(error)}`,
        details: {
          unpdf: false,
+          pageRendering: false,
        },
      };


checkDependencies() returns details.unpdf: false whenever either loadUnpdf() or verifyRenderDependencies() fails. With Promise.all, a page-rendering/worker failure will incorrectly report unpdf: false even if the unpdf import succeeded, and the error string always says "unpdf dependency not available". Consider checking these separately so details accurately distinguishes unpdf vs pageRendering failures (and tailor the error message accordingly).

Copilot · 2026-04-09T15:11:43Z

src/node/ocr-runtime.ts

+  const quotedMatch = firstLine.match(/List of available languages in "(.+?)"/);
+  if (quotedMatch?.[1]) {
+    return quotedMatch[1];
+  }
+
+  if (firstLine.startsWith('/') || /^[A-Za-z]:[\\/]/.test(firstLine)) {
+    return firstLine;
+  }
+
+  return null;


parseTessdataDirectoryFromTesseractOutput only extracts the path when it's quoted ("...") or when the entire first line is itself a path. Tesseract commonly prints List of available languages in /path/to/tessdata (n): (without quotes), which this parser will return null for and skip the binary-based detection. Update the regex to handle the unquoted format too.

Copilot · 2026-04-09T15:11:43Z

src/ocr-integration.test.ts

+    afterAll(() => {
+      if (originalTessdataPrefix) {
+        process.env.TESSDATA_PREFIX = originalTessdataPrefix;
+      } else {
+        process.env.TESSDATA_PREFIX = undefined;
+      }


In Node.js, assigning process.env.TESSDATA_PREFIX = undefined typically sets the literal string "undefined" rather than removing the env var. Use delete process.env.TESSDATA_PREFIX when restoring/unsetting to ensure tessdata auto-detection isn't affected by a truthy "undefined" value.

Copilot · 2026-04-09T15:11:44Z

src/ocr-integration.test.ts

+      // Exercise tessdata auto-detection instead of relying on shell env state.
+      process.env.TESSDATA_PREFIX = undefined;
+


Same as above: process.env.TESSDATA_PREFIX = undefined usually results in the string "undefined". Use delete process.env.TESSDATA_PREFIX here so the test actually exercises the "env var not set" code path.

Copilot · 2026-04-09T15:11:44Z

src/node/ocr-runtime.test.ts

+        .map((directory) => rm(directory, { force: true, recursive: true })),
+    );
+    vi.resetModules();
+    process.env.TESSDATA_PREFIX = undefined;


process.env.TESSDATA_PREFIX = undefined generally sets the env var to the string "undefined" instead of unsetting it. Use delete process.env.TESSDATA_PREFIX in cleanup so subsequent tests don't see a truthy but invalid prefix.

Suggested change

process.env.TESSDATA_PREFIX = undefined;

delete process.env.TESSDATA_PREFIX;

fix(ocr): harden scanned pdf runtime

2676565

chatgpt-codex-connector bot reviewed Apr 9, 2026

View reviewed changes

fix(pdf): address review findings

97c19b4

willgriffin requested a review from Copilot April 9, 2026 15:05

Copilot started reviewing on behalf of willgriffin April 9, 2026 15:06 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

fix(pdf): address PR feedback and CI

ebcb2a4

willgriffin merged commit 62fe179 into main Apr 9, 2026
2 checks passed

willgriffin deleted the codex/ocr-runtime-hardening branch April 9, 2026 15:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Harden OCR runtime for scanned PDFs#57

[codex] Harden OCR runtime for scanned PDFs#57
willgriffin merged 3 commits intomainfrom
codex/ocr-runtime-hardening

willgriffin commented Apr 9, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// Exercise tessdata auto-detection instead of relying on shell env state.
		process.env.TESSDATA_PREFIX = undefined;

	process.env.TESSDATA_PREFIX = undefined;
	delete process.env.TESSDATA_PREFIX;

Conversation

willgriffin commented Apr 9, 2026

Summary

Why

Impact

Validation

Notes

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants