Describe the bug
When scanning documents (e.g., IDs/Licenses), Tesseract sometimes fails to insert spaces between labels and values (e.g., 'Address:123MainSt' instead of 'Address: 123 Main St').
Because lines are detected as single massive 'words', if a regex matches part of that string (e.g. the number '123'), the entire string gets redacted.
Steps to reproduce
- Upload a document where text is close together (like a Driver's License).
- Observe that 'Address' and other non-sensitive labels get redacted along with the sensitive data.
Expected behavior
Only the sensitive substring should be redacted, or words should be correctly segmented.
Additional context
Code audit suggests Tesseract is grouping these into a single 'word' block. We might need to force char-level processing or better segmentation.
Describe the bug
When scanning documents (e.g., IDs/Licenses), Tesseract sometimes fails to insert spaces between labels and values (e.g., 'Address:123MainSt' instead of 'Address: 123 Main St').
Because lines are detected as single massive 'words', if a regex matches part of that string (e.g. the number '123'), the entire string gets redacted.
Steps to reproduce
Expected behavior
Only the sensitive substring should be redacted, or words should be correctly segmented.
Additional context
Code audit suggests Tesseract is grouping these into a single 'word' block. We might need to force char-level processing or better segmentation.