Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when pdf_renderer='hocr' mode the output PDF text is reversed? #642

Closed
gogineniravikumar opened this issue Sep 29, 2020 · 3 comments
Closed

Comments

@gogineniravikumar
Copy link

Describing the bug
When I generate output PDF using using ocrmypdf in pdf_renderer='hocr' mode the out put pdf content stream (text is reversed - in content stream the first line becomes last line and last line become you encounter first in content stream ). Because of this, when you tag the out put file it is reading reverse.

To Reproduce
ocrmypdf.ocr("InputScanned.pdf", "outputwithtext.pdf", pdf_renderer='hocr', keep_temporary_files=True)

vs
correct out put file you can generate using command
ocrmypdf.ocr("InputScanned.pdf", "outputwithtext.pdf" ,keep_temporary_files=True)

ocrmypdf.ocr("InputScanned.pdf", "outputwithtext.pdf",pdf_renderer='hocr', keep_temporary_files=True)

Example file
input file
error output file

Expected behavior
After observation you can see that the text should be in correct order in text layer.

Screenshots
output pdf content stream
image

System

  • OS: Windows 10
  • OCRmyPDF Version: ocrmypdf --11.1.0
@jbarlow83
Copy link
Collaborator

You showed me a screenshot of the order you expected from another OCR program, I think, with the concern that OCRmyPDF is presenting the content stream information in reverse order from what is shown there.

Yes, we shouldn't do it exactly backwards and that is a good catch, and conveniently can be fixed by removing a minus sign.

Just to forewarn, you can't actually rely on the order in which text appears in a page content stream. The only reasonable way to interpret a content stream is parse it and fully locate the position of all text. On top of that, the characters in bytestrings in a content stream are glyph IDs that map a particular character in a particular font, so you can't even trust that the characters inside (here ) Tj will be legible. You have to use a program like poppler pdftotext or pdfminer.six that understands the intracies of PDF fonts. A PDF writer is free to write content in any order it pleases. In some cases, reordering the text can affect display, e.g. if text is written over top of other text. It sounds to me like you're just dealing with another program that is using the content stream for tagging order, and this may not matter, but it's a common issue people run into so I hand out reminders freely.

@gogineniravikumar
Copy link
Author

gogineniravikumar commented Oct 15, 2020

This type of out put pdf's, Non adobe users still face some problems. So we need fix.
I found why its happening. Its small fix. In hocrtransform.py file the line of code causing is
for line in sorted( chain( self.hocr.iterfind(self._child_xpath('span', 'ocr_header')), self.hocr.iterfind(self._child_xpath('span', 'ocr_line')), self.hocr.iterfind(self._child_xpath('span', 'ocr_textfloat')), ), key=self.topdown_position, ):
You have to reverse the sorted function out put
for line in sorted( chain( self.hocr.iterfind(self._child_xpath('span', 'ocr_header')), self.hocr.iterfind(self._child_xpath('span', 'ocr_line')), self.hocr.iterfind(self._child_xpath('span', 'ocr_textfloat')), ), key=self.topdown_position, ).reverse():

Then the pdf's content stream in sorted order. Please update this in next release.

@jbarlow83
Copy link
Collaborator

@gogineniravikumar This was fixed in v11.1.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants