when pdf_renderer='hocr' mode the output PDF text is reversed? #642

gogineniravikumar · 2020-09-29T08:40:50Z

Describing the bug
When I generate output PDF using using ocrmypdf in pdf_renderer='hocr' mode the out put pdf content stream (text is reversed - in content stream the first line becomes last line and last line become you encounter first in content stream ). Because of this, when you tag the out put file it is reading reverse.

To Reproduce
ocrmypdf.ocr("InputScanned.pdf", "outputwithtext.pdf", pdf_renderer='hocr', keep_temporary_files=True)

vs
correct out put file you can generate using command
ocrmypdf.ocr("InputScanned.pdf", "outputwithtext.pdf" ,keep_temporary_files=True)

ocrmypdf.ocr("InputScanned.pdf", "outputwithtext.pdf",pdf_renderer='hocr', keep_temporary_files=True)

Example file
input file
error output file

Expected behavior
After observation you can see that the text should be in correct order in text layer.

Screenshots
output pdf content stream

System

OS: Windows 10
OCRmyPDF Version: ocrmypdf --11.1.0

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2020-09-29T09:43:20Z

You showed me a screenshot of the order you expected from another OCR program, I think, with the concern that OCRmyPDF is presenting the content stream information in reverse order from what is shown there.

Yes, we shouldn't do it exactly backwards and that is a good catch, and conveniently can be fixed by removing a minus sign.

Just to forewarn, you can't actually rely on the order in which text appears in a page content stream. The only reasonable way to interpret a content stream is parse it and fully locate the position of all text. On top of that, the characters in bytestrings in a content stream are glyph IDs that map a particular character in a particular font, so you can't even trust that the characters inside (here ) Tj will be legible. You have to use a program like poppler pdftotext or pdfminer.six that understands the intracies of PDF fonts. A PDF writer is free to write content in any order it pleases. In some cases, reordering the text can affect display, e.g. if text is written over top of other text. It sounds to me like you're just dealing with another program that is using the content stream for tagging order, and this may not matter, but it's a common issue people run into so I hand out reminders freely.

gogineniravikumar · 2020-10-15T05:02:52Z

This type of out put pdf's, Non adobe users still face some problems. So we need fix.
I found why its happening. Its small fix. In hocrtransform.py file the line of code causing is
for line in sorted( chain( self.hocr.iterfind(self._child_xpath('span', 'ocr_header')), self.hocr.iterfind(self._child_xpath('span', 'ocr_line')), self.hocr.iterfind(self._child_xpath('span', 'ocr_textfloat')), ), key=self.topdown_position, ):
You have to reverse the sorted function out put
for line in sorted( chain( self.hocr.iterfind(self._child_xpath('span', 'ocr_header')), self.hocr.iterfind(self._child_xpath('span', 'ocr_line')), self.hocr.iterfind(self._child_xpath('span', 'ocr_textfloat')), ), key=self.topdown_position, ).reverse():

Then the pdf's content stream in sorted order. Please update this in next release.

jbarlow83 · 2020-10-15T08:05:48Z

@gogineniravikumar This was fixed in v11.1.2.

jbarlow83 closed this as completed in 4eacb34 Sep 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

when pdf_renderer='hocr' mode the output PDF text is reversed? #642

when pdf_renderer='hocr' mode the output PDF text is reversed? #642

gogineniravikumar commented Sep 29, 2020

jbarlow83 commented Sep 29, 2020

gogineniravikumar commented Oct 15, 2020 •

edited

Loading

jbarlow83 commented Oct 15, 2020

when pdf_renderer='hocr' mode the output PDF text is reversed? #642

when pdf_renderer='hocr' mode the output PDF text is reversed? #642

Comments

gogineniravikumar commented Sep 29, 2020

jbarlow83 commented Sep 29, 2020

gogineniravikumar commented Oct 15, 2020 • edited Loading

jbarlow83 commented Oct 15, 2020

gogineniravikumar commented Oct 15, 2020 •

edited

Loading