Skip to content

feat(extract): send spatial layout annotations from OCR to LLM#724

Merged
cpcloud merged 4 commits intomainfrom
worktree-crystalline-humming-thompson
Mar 10, 2026
Merged

feat(extract): send spatial layout annotations from OCR to LLM#724
cpcloud merged 4 commits intomainfrom
worktree-crystalline-humming-thompson

Conversation

@cpcloud
Copy link
Copy Markdown
Collaborator

@cpcloud cpcloud commented Mar 9, 2026

Summary

  • Send compact spatial layout annotations from tesseract OCR to the LLM during document extraction, improving accuracy for invoices, forms, and tabular documents
  • Format: [left,top,width] per line (~2x token overhead vs plain text), with [left,top,width;conf] only for suspect lines below a configurable confidence threshold
  • Drop height from bounding boxes (nearly constant across lines, no useful signal)
  • New config: [extraction.ocr] subtable with tsv (default true) and confidence_threshold (default 70)
  • Toggle in extraction overlay: press t to switch spatial layout on/off and rerun LLM extraction
  • Env vars: MICASA_EXTRACTION_OCR_TSV, MICASA_EXTRACTION_OCR_CONFIDENCE_THRESHOLD

closes #699

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 9, 2026

Codecov Report

❌ Patch coverage is 88.82979% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.55%. Comparing base (7b9d7e5) to head (3ae2a35).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
internal/app/extraction.go 57.69% 10 Missing and 1 partial ⚠️
internal/extract/ocr.go 95.04% 3 Missing and 2 partials ⚠️
cmd/micasa/main.go 0.00% 2 Missing ⚠️
internal/app/types.go 0.00% 2 Missing ⚠️
internal/extract/llmextract.go 95.45% 0 Missing and 1 partial ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
internal/app/model.go 62.37% <100.00%> (+0.03%) ⬆️
internal/config/config.go 89.64% <100.00%> (+0.22%) ⬆️
internal/extract/extractor.go 100.00% <100.00%> (+7.01%) ⬆️
internal/extract/pipeline.go 96.00% <100.00%> (+0.10%) ⬆️
internal/extract/llmextract.go 98.27% <95.45%> (+0.29%) ⬆️
cmd/micasa/main.go 3.31% <0.00%> (-0.02%) ⬇️
internal/app/types.go 52.32% <0.00%> (-1.25%) ⬇️
internal/extract/ocr.go 86.53% <95.04%> (+2.49%) ⬆️
internal/app/extraction.go 70.76% <57.69%> (+0.05%) ⬆️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cpcloud cpcloud force-pushed the worktree-crystalline-humming-thompson branch 3 times, most recently from 08a7825 to de2a5fa Compare March 9, 2026 18:09
@cpcloud cpcloud requested a review from Copilot March 9, 2026 21:31
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds optional OCR spatial layout annotations (derived from tesseract TSV) to the LLM extraction prompt, with configuration + UI toggle support to improve extraction accuracy on invoices/forms/tables.

Changes:

  • Introduces SpatialTextFromTSV (compact [left,top,width] with optional ;conf) and threads TSV controls through the extraction prompt builder.
  • Adds config + env var plumbing for enabling TSV and controlling confidence threshold.
  • Adds a t key toggle in the extraction overlay to rerun the LLM step with layout on/off.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
internal/extract/pipeline.go Threads TSV/threshold settings into LLM prompt construction.
internal/extract/ocr.go Adds TSV → compact spatial text conversion used for prompting.
internal/extract/ocr_test.go Adds unit tests for spatial TSV formatting (bbox + confidence rules).
internal/extract/llmextract.go Updates prompt construction to optionally include spatial OCR annotations and hint text.
internal/extract/llmextract_test.go Adds tests ensuring spatial OCR is included/excluded correctly in prompts.
internal/config/config.go Adds [extraction.ocr] config section + accessors for TSV and confidence threshold.
internal/config/config_test.go Extends env var mapping tests and adds config/env parsing tests for new fields.
internal/app/types.go Extends extraction config/state to carry TSV + threshold.
internal/app/model.go Wires new extraction state fields from options into the model.
internal/app/extraction.go Adds t toggle to rerun LLM extraction with layout annotations on/off.
internal/app/extraction_test.go Adds tests for the new TSV toggle behavior and UI hints.
cmd/micasa/main.go Passes new config values into Options.SetExtraction.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/extract/llmextract.go Outdated
Comment thread internal/extract/ocr.go Outdated
Comment thread internal/extract/ocr.go
Comment thread internal/config/config.go Outdated
Comment thread internal/config/config_test.go Outdated
cpcloud and others added 3 commits March 10, 2026 04:25
Send compact line-level bounding boxes from tesseract OCR to the LLM
during extraction, improving accuracy for invoices, forms, and tabular
documents. The format is [left,top,width] per line (~2x token overhead
vs plain text), with confidence scores shown only for suspect lines
(below a configurable threshold, default 70).

- Add SpatialTextFromTSV() that converts raw TSV to compact spatial format
- Drop height from bounding boxes (nearly constant, no signal)
- Threshold-based confidence: only annotate lines with minConf < threshold
- New config: ocr_tsv (default true), ocr_conf_threshold (default 70)
- Toggle in extraction overlay: press 't' to switch layout on/off on rerun
- Thread config through pipeline, prompt builder, and app plumbing

closes #699

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fall back to plain text when TSV-to-spatial conversion yields empty
- Validate confidence_threshold is 0-100 in config loading
- Detect page breaks in concatenated per-page TSV via block number decrease
- Fix doc comment to mention paragraph breaks
- Add tests for page break detection, spatial fallback, and threshold validation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…reshold

filterTSVByConfidence silently dropped low-confidence OCR words, losing
data. Remove it entirely -- OCR data should never be discarded.

Unify confidence_threshold to serve a single purpose: controlling when
confidence annotations appear in spatial layout output sent to the LLM.
Default 70.

- Remove filterTSVByConfidence and all call sites
- Remove ConfidenceThreshold from PDFOCRExtractor/ImageOCRExtractor
- Remove confidenceThreshold param from DefaultExtractors
- Restore ConfidenceThresholdVal *int on OCR struct for spatial display
- Add AGENTS.md rule: reply to every PR review comment on GitHub

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 10, 2026 08:36
@cpcloud cpcloud force-pushed the worktree-crystalline-humming-thompson branch from 769dce0 to 8ce00e4 Compare March 10, 2026 08:36
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

internal/extract/extractor.go:174

  • Removing confidence-based filtering from the OCR extractors means low-confidence OCR tokens will now always be included in TextSource.Text, which can regress search/display quality and increases prompt tokens when TSV is disabled/fallbacks occur. If you still want filtering, consider reintroducing it behind a separate config knob (distinct from the new spatial annotation threshold) and keep TSV data unfiltered for layout conversion.
func (e *PDFOCRExtractor) Extract(ctx context.Context, data []byte) (TextSource, error) {
	if len(data) == 0 {
		return TextSource{}, nil
	}
	text, tsv, err := ocrPDF(ctx, data, e.MaxPages)
	if err != nil {
		return TextSource{}, err
	}
	return TextSource{
		Tool: "tesseract",
		Desc: "Text recognized from rasterized page images. Covers scanned pages that pdftotext misses, but may contain OCR errors.",
		Text: text,
		Data: tsv,
	}, nil

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/config/config.go Outdated
Comment thread internal/config/config.go
Remove the stale commented-out [extraction.ocr] section that still
described word-dropping behavior. Merge into the single active section
with enable, tsv, and confidence_threshold.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cpcloud cpcloud merged commit 73f87f3 into main Mar 10, 2026
25 checks passed
@cpcloud cpcloud deleted the worktree-crystalline-humming-thompson branch March 10, 2026 09:45
cpcloud added a commit that referenced this pull request Mar 19, 2026
## Summary

- Send compact spatial layout annotations from tesseract OCR to the LLM
during document extraction, improving accuracy for invoices, forms, and
tabular documents
- Format: `[left,top,width]` per line (~2x token overhead vs plain
text), with `[left,top,width;conf]` only for suspect lines below a
configurable confidence threshold
- Drop height from bounding boxes (nearly constant across lines, no
useful signal)
- New config: `[extraction.ocr]` subtable with `tsv` (default `true`)
and `confidence_threshold` (default `70`)
- Toggle in extraction overlay: press `t` to switch spatial layout
on/off and rerun LLM extraction
- Env vars: `MICASA_EXTRACTION_OCR_TSV`,
`MICASA_EXTRACTION_OCR_CONFIDENCE_THRESHOLD`

closes #699

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(extract): send tesseract TSV output to LLM instead of plain text

2 participants