[Bug]: Tesseract OCR fails to separate words, causing over-redaction

### Describe the bug
When scanning documents (e.g., IDs/Licenses), Tesseract sometimes fails to insert spaces between labels and values (e.g., 'Address:123MainSt' instead of 'Address: 123 Main St').

Because lines are detected as single massive 'words', if a regex matches *part* of that string (e.g. the number '123'), the **entire** string gets redacted.

### Steps to reproduce
1. Upload a document where text is close together (like a Driver's License).
2. Observe that 'Address' and other non-sensitive labels get redacted along with the sensitive data.

### Expected behavior
Only the sensitive substring should be redacted, or words should be correctly segmented.

### Additional context
Code audit suggests Tesseract is grouping these into a single 'word' block. We might need to force char-level processing or better segmentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Tesseract OCR fails to separate words, causing over-redaction #17

Describe the bug

Steps to reproduce

Expected behavior

Additional context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Bug]: Tesseract OCR fails to separate words, causing over-redaction #17

Description

Describe the bug

Steps to reproduce

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions