feat(extract): send spatial layout annotations from OCR to LLM#724
feat(extract): send spatial layout annotations from OCR to LLM#724
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files
🚀 New features to boost your workflow:
|
08a7825 to
de2a5fa
Compare
There was a problem hiding this comment.
Pull request overview
Adds optional OCR spatial layout annotations (derived from tesseract TSV) to the LLM extraction prompt, with configuration + UI toggle support to improve extraction accuracy on invoices/forms/tables.
Changes:
- Introduces
SpatialTextFromTSV(compact[left,top,width]with optional;conf) and threads TSV controls through the extraction prompt builder. - Adds config + env var plumbing for enabling TSV and controlling confidence threshold.
- Adds a
tkey toggle in the extraction overlay to rerun the LLM step with layout on/off.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/extract/pipeline.go | Threads TSV/threshold settings into LLM prompt construction. |
| internal/extract/ocr.go | Adds TSV → compact spatial text conversion used for prompting. |
| internal/extract/ocr_test.go | Adds unit tests for spatial TSV formatting (bbox + confidence rules). |
| internal/extract/llmextract.go | Updates prompt construction to optionally include spatial OCR annotations and hint text. |
| internal/extract/llmextract_test.go | Adds tests ensuring spatial OCR is included/excluded correctly in prompts. |
| internal/config/config.go | Adds [extraction.ocr] config section + accessors for TSV and confidence threshold. |
| internal/config/config_test.go | Extends env var mapping tests and adds config/env parsing tests for new fields. |
| internal/app/types.go | Extends extraction config/state to carry TSV + threshold. |
| internal/app/model.go | Wires new extraction state fields from options into the model. |
| internal/app/extraction.go | Adds t toggle to rerun LLM extraction with layout annotations on/off. |
| internal/app/extraction_test.go | Adds tests for the new TSV toggle behavior and UI hints. |
| cmd/micasa/main.go | Passes new config values into Options.SetExtraction. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Send compact line-level bounding boxes from tesseract OCR to the LLM during extraction, improving accuracy for invoices, forms, and tabular documents. The format is [left,top,width] per line (~2x token overhead vs plain text), with confidence scores shown only for suspect lines (below a configurable threshold, default 70). - Add SpatialTextFromTSV() that converts raw TSV to compact spatial format - Drop height from bounding boxes (nearly constant, no signal) - Threshold-based confidence: only annotate lines with minConf < threshold - New config: ocr_tsv (default true), ocr_conf_threshold (default 70) - Toggle in extraction overlay: press 't' to switch layout on/off on rerun - Thread config through pipeline, prompt builder, and app plumbing closes #699 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fall back to plain text when TSV-to-spatial conversion yields empty - Validate confidence_threshold is 0-100 in config loading - Detect page breaks in concatenated per-page TSV via block number decrease - Fix doc comment to mention paragraph breaks - Add tests for page break detection, spatial fallback, and threshold validation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…reshold filterTSVByConfidence silently dropped low-confidence OCR words, losing data. Remove it entirely -- OCR data should never be discarded. Unify confidence_threshold to serve a single purpose: controlling when confidence annotations appear in spatial layout output sent to the LLM. Default 70. - Remove filterTSVByConfidence and all call sites - Remove ConfidenceThreshold from PDFOCRExtractor/ImageOCRExtractor - Remove confidenceThreshold param from DefaultExtractors - Restore ConfidenceThresholdVal *int on OCR struct for spatial display - Add AGENTS.md rule: reply to every PR review comment on GitHub Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
769dce0 to
8ce00e4
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
internal/extract/extractor.go:174
- Removing confidence-based filtering from the OCR extractors means low-confidence OCR tokens will now always be included in
TextSource.Text, which can regress search/display quality and increases prompt tokens when TSV is disabled/fallbacks occur. If you still want filtering, consider reintroducing it behind a separate config knob (distinct from the new spatial annotation threshold) and keep TSV data unfiltered for layout conversion.
func (e *PDFOCRExtractor) Extract(ctx context.Context, data []byte) (TextSource, error) {
if len(data) == 0 {
return TextSource{}, nil
}
text, tsv, err := ocrPDF(ctx, data, e.MaxPages)
if err != nil {
return TextSource{}, err
}
return TextSource{
Tool: "tesseract",
Desc: "Text recognized from rasterized page images. Covers scanned pages that pdftotext misses, but may contain OCR errors.",
Text: text,
Data: tsv,
}, nil
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Remove the stale commented-out [extraction.ocr] section that still described word-dropping behavior. Merge into the single active section with enable, tsv, and confidence_threshold. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
## Summary - Send compact spatial layout annotations from tesseract OCR to the LLM during document extraction, improving accuracy for invoices, forms, and tabular documents - Format: `[left,top,width]` per line (~2x token overhead vs plain text), with `[left,top,width;conf]` only for suspect lines below a configurable confidence threshold - Drop height from bounding boxes (nearly constant across lines, no useful signal) - New config: `[extraction.ocr]` subtable with `tsv` (default `true`) and `confidence_threshold` (default `70`) - Toggle in extraction overlay: press `t` to switch spatial layout on/off and rerun LLM extraction - Env vars: `MICASA_EXTRACTION_OCR_TSV`, `MICASA_EXTRACTION_OCR_CONFIDENCE_THRESHOLD` closes #699 --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Summary
[left,top,width]per line (~2x token overhead vs plain text), with[left,top,width;conf]only for suspect lines below a configurable confidence threshold[extraction.ocr]subtable withtsv(defaulttrue) andconfidence_threshold(default70)tto switch spatial layout on/off and rerun LLM extractionMICASA_EXTRACTION_OCR_TSV,MICASA_EXTRACTION_OCR_CONFIDENCE_THRESHOLDcloses #699