feat(extract): send spatial layout annotations from OCR to LLM by cpcloud · Pull Request #724 · micasa-dev/micasa

cpcloud · 2026-03-09T11:58:09Z

Summary

Send compact spatial layout annotations from tesseract OCR to the LLM during document extraction, improving accuracy for invoices, forms, and tabular documents
Format: [left,top,width] per line (~2x token overhead vs plain text), with [left,top,width;conf] only for suspect lines below a configurable confidence threshold
Drop height from bounding boxes (nearly constant across lines, no useful signal)
New config: [extraction.ocr] subtable with tsv (default true) and confidence_threshold (default 70)
Toggle in extraction overlay: press t to switch spatial layout on/off and rerun LLM extraction
Env vars: MICASA_EXTRACTION_OCR_TSV, MICASA_EXTRACTION_OCR_CONFIDENCE_THRESHOLD

closes #699

codecov · 2026-03-09T12:01:06Z

Codecov Report

❌ Patch coverage is 88.82979% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.55%. Comparing base (7b9d7e5) to head (3ae2a35).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
internal/app/extraction.go	57.69%	10 Missing and 1 partial ⚠️
internal/extract/ocr.go	95.04%	3 Missing and 2 partials ⚠️
cmd/micasa/main.go	0.00%	2 Missing ⚠️
internal/app/types.go	0.00%	2 Missing ⚠️
internal/extract/llmextract.go	95.45%	0 Missing and 1 partial ⚠️

Additional details and impacted files

Files with missing lines	Coverage Δ
internal/app/model.go	`62.37% <100.00%> (+0.03%)`	⬆️
internal/config/config.go	`89.64% <100.00%> (+0.22%)`	⬆️
internal/extract/extractor.go	`100.00% <100.00%> (+7.01%)`	⬆️
internal/extract/pipeline.go	`96.00% <100.00%> (+0.10%)`	⬆️
internal/extract/llmextract.go	`98.27% <95.45%> (+0.29%)`	⬆️
cmd/micasa/main.go	`3.31% <0.00%> (-0.02%)`	⬇️
internal/app/types.go	`52.32% <0.00%> (-1.25%)`	⬇️
internal/extract/ocr.go	`86.53% <95.04%> (+2.49%)`	⬆️
internal/app/extraction.go	`70.76% <57.69%> (+0.05%)`	⬆️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

Adds optional OCR spatial layout annotations (derived from tesseract TSV) to the LLM extraction prompt, with configuration + UI toggle support to improve extraction accuracy on invoices/forms/tables.

Changes:

Introduces SpatialTextFromTSV (compact [left,top,width] with optional ;conf) and threads TSV controls through the extraction prompt builder.
Adds config + env var plumbing for enabling TSV and controlling confidence threshold.
Adds a t key toggle in the extraction overlay to rerun the LLM step with layout on/off.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
internal/extract/pipeline.go	Threads TSV/threshold settings into LLM prompt construction.
internal/extract/ocr.go	Adds TSV → compact spatial text conversion used for prompting.
internal/extract/ocr_test.go	Adds unit tests for spatial TSV formatting (bbox + confidence rules).
internal/extract/llmextract.go	Updates prompt construction to optionally include spatial OCR annotations and hint text.
internal/extract/llmextract_test.go	Adds tests ensuring spatial OCR is included/excluded correctly in prompts.
internal/config/config.go	Adds `[extraction.ocr]` config section + accessors for TSV and confidence threshold.
internal/config/config_test.go	Extends env var mapping tests and adds config/env parsing tests for new fields.
internal/app/types.go	Extends extraction config/state to carry TSV + threshold.
internal/app/model.go	Wires new extraction state fields from options into the model.
internal/app/extraction.go	Adds `t` toggle to rerun LLM extraction with layout annotations on/off.
internal/app/extraction_test.go	Adds tests for the new TSV toggle behavior and UI hints.
cmd/micasa/main.go	Passes new config values into `Options.SetExtraction`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Send compact line-level bounding boxes from tesseract OCR to the LLM during extraction, improving accuracy for invoices, forms, and tabular documents. The format is [left,top,width] per line (~2x token overhead vs plain text), with confidence scores shown only for suspect lines (below a configurable threshold, default 70). - Add SpatialTextFromTSV() that converts raw TSV to compact spatial format - Drop height from bounding boxes (nearly constant, no signal) - Threshold-based confidence: only annotate lines with minConf < threshold - New config: ocr_tsv (default true), ocr_conf_threshold (default 70) - Toggle in extraction overlay: press 't' to switch layout on/off on rerun - Thread config through pipeline, prompt builder, and app plumbing closes #699 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fall back to plain text when TSV-to-spatial conversion yields empty - Validate confidence_threshold is 0-100 in config loading - Detect page breaks in concatenated per-page TSV via block number decrease - Fix doc comment to mention paragraph breaks - Add tests for page break detection, spatial fallback, and threshold validation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…reshold filterTSVByConfidence silently dropped low-confidence OCR words, losing data. Remove it entirely -- OCR data should never be discarded. Unify confidence_threshold to serve a single purpose: controlling when confidence annotations appear in spatial layout output sent to the LLM. Default 70. - Remove filterTSVByConfidence and all call sites - Remove ConfidenceThreshold from PDFOCRExtractor/ImageOCRExtractor - Remove confidenceThreshold param from DefaultExtractors - Restore ConfidenceThresholdVal *int on OCR struct for spatial display - Add AGENTS.md rule: reply to every PR review comment on GitHub Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

internal/extract/extractor.go:174

Removing confidence-based filtering from the OCR extractors means low-confidence OCR tokens will now always be included in TextSource.Text, which can regress search/display quality and increases prompt tokens when TSV is disabled/fallbacks occur. If you still want filtering, consider reintroducing it behind a separate config knob (distinct from the new spatial annotation threshold) and keep TSV data unfiltered for layout conversion.

func (e *PDFOCRExtractor) Extract(ctx context.Context, data []byte) (TextSource, error) {
	if len(data) == 0 {
		return TextSource{}, nil
	}
	text, tsv, err := ocrPDF(ctx, data, e.MaxPages)
	if err != nil {
		return TextSource{}, err
	}
	return TextSource{
		Tool: "tesseract",
		Desc: "Text recognized from rasterized page images. Covers scanned pages that pdftotext misses, but may contain OCR errors.",
		Text: text,
		Data: tsv,
	}, nil

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Remove the stale commented-out [extraction.ocr] section that still described word-dropping behavior. Merge into the single active section with enable, tsv, and confidence_threshold. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

## Summary - Send compact spatial layout annotations from tesseract OCR to the LLM during document extraction, improving accuracy for invoices, forms, and tabular documents - Format: `[left,top,width]` per line (~2x token overhead vs plain text), with `[left,top,width;conf]` only for suspect lines below a configurable confidence threshold - Drop height from bounding boxes (nearly constant across lines, no useful signal) - New config: `[extraction.ocr]` subtable with `tsv` (default `true`) and `confidence_threshold` (default `70`) - Toggle in extraction overlay: press `t` to switch spatial layout on/off and rerun LLM extraction - Env vars: `MICASA_EXTRACTION_OCR_TSV`, `MICASA_EXTRACTION_OCR_CONFIDENCE_THRESHOLD` closes #699 --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

cpcloud force-pushed the worktree-crystalline-humming-thompson branch 3 times, most recently from 08a7825 to de2a5fa Compare March 9, 2026 18:09

cpcloud requested a review from Copilot March 9, 2026 21:31

Copilot started reviewing on behalf of cpcloud March 9, 2026 21:31 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

Comment thread internal/extract/llmextract.go Outdated

Comment thread internal/extract/ocr.go Outdated

Comment thread internal/extract/ocr.go

Comment thread internal/config/config.go Outdated

Comment thread internal/config/config_test.go Outdated

cpcloud and others added 3 commits March 10, 2026 04:25

Copilot AI review requested due to automatic review settings March 10, 2026 08:36

cpcloud force-pushed the worktree-crystalline-humming-thompson branch from 769dce0 to 8ce00e4 Compare March 10, 2026 08:36

Copilot started reviewing on behalf of cpcloud March 10, 2026 08:37 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

Comment thread internal/config/config.go Outdated

Comment thread internal/config/config.go

cpcloud merged commit 73f87f3 into main Mar 10, 2026
25 checks passed

cpcloud deleted the worktree-crystalline-humming-thompson branch March 10, 2026 09:45

BrewTestBot mentioned this pull request Mar 10, 2026

micasa 1.79.0 Homebrew/homebrew-core#271555

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(extract): send spatial layout annotations from OCR to LLM#724

feat(extract): send spatial layout annotations from OCR to LLM#724
cpcloud merged 4 commits intomainfrom
worktree-crystalline-humming-thompson

cpcloud commented Mar 9, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Mar 9, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cpcloud commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

codecov Bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cpcloud commented Mar 9, 2026 •

edited

Loading

codecov Bot commented Mar 9, 2026 •

edited

Loading