Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbled order of OCR'ed contents #1035

Open
rkevk opened this issue Nov 16, 2022 · 2 comments
Open

Garbled order of OCR'ed contents #1035

rkevk opened this issue Nov 16, 2022 · 2 comments

Comments

@rkevk
Copy link

rkevk commented Nov 16, 2022

I've encountered a typewritten but high-resolution and clearly legible PDF whose OCR text is somehow misplaced after being generated: The file in question is this thesis. For example, page 3 is a good pure-text example of where this behavior occurs (I won't paste the page here because I'm unsure about copyright issues).

More explicitly, it seems that Tesseract (more or less) correctly identifies all the words and their order (as can also be verified using --sidecar), but when grafting the OCR contents back onto the PDF, the text or its bounding box gets misplaced. For example, when highlighting OCR'ed phrases spanning more than one word, the position of the words evidently does not follow the left-to-right, top-to-bottom order that it should, and instead of highlighting two adjacent words, half the page is highlighted. This can also be verified by copy-pasting (i.e., CTRL-A, CTRL-C) the entire text, which results in a garbled version of the original Tesseract/sidecar text.

This occurred on both v13.4 and v13.5 with tesseract 4.1.1 when running OCRmyPDF without any further options. Is there an option I should be trying here?

@jbarlow83
Copy link
Collaborator

You can try --pdf-renderer hocr which uses a different renderer to produce the PDF.

Some PDF viewers also struggle with OCR placement.

@rkevk
Copy link
Author

rkevk commented Nov 18, 2022

For the PDF viewer in question (Evince), hocr didn't make a difference.
However, it turns out that the Firefox PDF viewer does read the placement of the OCR contents correctly, regardless of which renderer I use (however, I'll note that for most other OCR'ed documents Evince also worked fine). Feel free to close the issue if you feel the problem is more one of Evince than of OCRmyPDF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants