Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HOCR output as a sidecar option #177

Closed
parkerhancock opened this issue Jul 27, 2017 · 7 comments
Closed

Add HOCR output as a sidecar option #177

parkerhancock opened this issue Jul 27, 2017 · 7 comments

Comments

@parkerhancock
Copy link

I have many applications where the physical location of text on a page is significant, and an existing codebase built around the HOCR html format.

What would make this library completely killer is an option to produce a sidecar file of the hocr data from Tesseract. I know that Tesseract natively can produce HOCR data, so the change shouldn't be difficult. The only question is how to integrate that into the existing command line interface.

Maybe a new option for --sidecar-hocr?

Flipping through the codebase now to see if there's an easy option.

@jbarlow83
Copy link
Collaborator

ocrmypdf has three PDF renderers.

One of them is called the hocr renderer and uses HOCR as an intermediate format. For your use case it might make the most sense to use the older hocr renderer, since you intend to hocr for other things.

So

ocrmypdf -k --pdf-renderer hocr

which will output a temporary folder with all working files, including the hocr files per page. The main drawback of the hocr renderer is that its support for non-Latin script is poor.

If you'd prefer to force generation of hocr files using the new (and default) sandwich renderer (best PDF quality, requires Tesseract 3.05.01 or newer):

ocrmypdf -k --tesseract-config hocr <rest of your arguments>

I will think about adding an option for hocr sidecars that involves less hackery, but this should do it for now.

@andrewjfreyer
Copy link

Any further thoughts on adding additional sidecar features?

@zweissman
Copy link

Your suggestion to use ocrmypdf -k --tesseract-config hocr <rest of your arguments> works great along with the keep-temporary-files=true. The only issue that I am currently having is where to find the temp files. The path is output while the ocr is running, but that temp path changes from run to run. Is there any way to query the ocrmypdf object to get the temp path so I know where to look for the .hocr file?

@hendursaga
Copy link

This probably is related, but I was wondering if there's a good way to store the hOCR on the side and then "apply" it to the PDF when needed. I want to retain the original PDF files without having to basically duplicate them, more than doubling storage costs.

@jbarlow83
Copy link
Collaborator

You can use the API functions in ocrmypdf.api to save a hocr and apply it later.

@hendursaga
Copy link

@jbarlow83 Which API functions? Are there command-line flag(s) that could do the trick just yet?

@jbarlow83
Copy link
Collaborator

def _pdf_to_hocr( # noqa: D417

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants