Skip to content

[Bug]: Tesseract OCR fails to separate words, causing over-redaction #17

@karant-dev

Description

@karant-dev

Describe the bug

When scanning documents (e.g., IDs/Licenses), Tesseract sometimes fails to insert spaces between labels and values (e.g., 'Address:123MainSt' instead of 'Address: 123 Main St').

Because lines are detected as single massive 'words', if a regex matches part of that string (e.g. the number '123'), the entire string gets redacted.

Steps to reproduce

  1. Upload a document where text is close together (like a Driver's License).
  2. Observe that 'Address' and other non-sensitive labels get redacted along with the sensitive data.

Expected behavior

Only the sensitive substring should be redacted, or words should be correctly segmented.

Additional context

Code audit suggests Tesseract is grouping these into a single 'word' block. We might need to force char-level processing or better segmentation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions