-
-
Notifications
You must be signed in to change notification settings - Fork 931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
when pdf_renderer='hocr' mode the output PDF text is reversed? #642
Comments
You showed me a screenshot of the order you expected from another OCR program, I think, with the concern that OCRmyPDF is presenting the content stream information in reverse order from what is shown there. Yes, we shouldn't do it exactly backwards and that is a good catch, and conveniently can be fixed by removing a minus sign. Just to forewarn, you can't actually rely on the order in which text appears in a page content stream. The only reasonable way to interpret a content stream is parse it and fully locate the position of all text. On top of that, the characters in bytestrings in a content stream are glyph IDs that map a particular character in a particular font, so you can't even trust that the characters inside |
This type of out put pdf's, Non adobe users still face some problems. So we need fix. Then the pdf's content stream in sorted order. Please update this in next release. |
@gogineniravikumar This was fixed in v11.1.2. |
Describing the bug
When I generate output PDF using using ocrmypdf in pdf_renderer='hocr' mode the out put pdf content stream (text is reversed - in content stream the first line becomes last line and last line become you encounter first in content stream ). Because of this, when you tag the out put file it is reading reverse.
To Reproduce
ocrmypdf.ocr("InputScanned.pdf", "outputwithtext.pdf", pdf_renderer='hocr', keep_temporary_files=True)
vs
correct out put file you can generate using command
ocrmypdf.ocr("InputScanned.pdf", "outputwithtext.pdf" ,keep_temporary_files=True)
Example file
input file
error output file
Expected behavior
After observation you can see that the text should be in correct order in text layer.
Screenshots
![image](https://user-images.githubusercontent.com/50137469/94533740-33d8ea80-025d-11eb-86cf-156e54b34662.png)
output pdf content stream
System
ocrmypdf --11.1.0
The text was updated successfully, but these errors were encountered: