Add HOCR output as a sidecar option #177

parkerhancock · 2017-07-27T20:35:58Z

I have many applications where the physical location of text on a page is significant, and an existing codebase built around the HOCR html format.

What would make this library completely killer is an option to produce a sidecar file of the hocr data from Tesseract. I know that Tesseract natively can produce HOCR data, so the change shouldn't be difficult. The only question is how to integrate that into the existing command line interface.

Maybe a new option for --sidecar-hocr?

Flipping through the codebase now to see if there's an easy option.

jbarlow83 · 2017-07-27T22:10:44Z

ocrmypdf has three PDF renderers.

One of them is called the hocr renderer and uses HOCR as an intermediate format. For your use case it might make the most sense to use the older hocr renderer, since you intend to hocr for other things.

So

ocrmypdf -k --pdf-renderer hocr

which will output a temporary folder with all working files, including the hocr files per page. The main drawback of the hocr renderer is that its support for non-Latin script is poor.

If you'd prefer to force generation of hocr files using the new (and default) sandwich renderer (best PDF quality, requires Tesseract 3.05.01 or newer):

ocrmypdf -k --tesseract-config hocr <rest of your arguments>

I will think about adding an option for hocr sidecars that involves less hackery, but this should do it for now.

andrewjfreyer · 2019-08-20T18:29:24Z

Any further thoughts on adding additional sidecar features?

zweissman · 2020-04-21T15:33:01Z

Your suggestion to use ocrmypdf -k --tesseract-config hocr <rest of your arguments> works great along with the keep-temporary-files=true. The only issue that I am currently having is where to find the temp files. The path is output while the ocr is running, but that temp path changes from run to run. Is there any way to query the ocrmypdf object to get the temp path so I know where to look for the .hocr file?

hendursaga · 2024-08-10T16:46:00Z

This probably is related, but I was wondering if there's a good way to store the hOCR on the side and then "apply" it to the PDF when needed. I want to retain the original PDF files without having to basically duplicate them, more than doubling storage costs.

jbarlow83 · 2024-08-11T10:28:36Z

You can use the API functions in ocrmypdf.api to save a hocr and apply it later.

hendursaga · 2024-08-11T19:58:48Z

@jbarlow83 Which API functions? Are there command-line flag(s) that could do the trick just yet?

jbarlow83 · 2024-08-12T07:20:49Z

OCRmyPDF/src/ocrmypdf/api.py

Line 383 in 3a75b20

def _pdf_to_hocr( # noqa: D417

jbarlow83 added the enhancement label Jul 27, 2017

jbarlow83 mentioned this issue Nov 15, 2017

Do you have the text layer to the PDF document on the API painting #198

Closed

adrianbroher mentioned this issue Nov 12, 2019

hocr import / export #453

Closed

francescocarzaniga mentioned this issue Jan 11, 2021

Use ocrmypdf for recognition and switch to PDF-only representation papermerge/papermerge-core#1

Closed

SpencerRP mentioned this issue Jul 13, 2021

OCR Confidence (add --sidecar-hocr) #273

Closed

jbarlow83 closed this as completed Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HOCR output as a sidecar option #177

Add HOCR output as a sidecar option #177

parkerhancock commented Jul 27, 2017

jbarlow83 commented Jul 27, 2017

andrewjfreyer commented Aug 20, 2019

zweissman commented Apr 21, 2020

hendursaga commented Aug 10, 2024

jbarlow83 commented Aug 11, 2024

hendursaga commented Aug 11, 2024

jbarlow83 commented Aug 12, 2024

Add HOCR output as a sidecar option #177

Add HOCR output as a sidecar option #177

Comments

parkerhancock commented Jul 27, 2017

jbarlow83 commented Jul 27, 2017

andrewjfreyer commented Aug 20, 2019

zweissman commented Apr 21, 2020

hendursaga commented Aug 10, 2024

jbarlow83 commented Aug 11, 2024

hendursaga commented Aug 11, 2024

jbarlow83 commented Aug 12, 2024