Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hOCR output + hOCR editor #30

Closed
porculator opened this issue Aug 18, 2015 · 8 comments
Closed

hOCR output + hOCR editor #30

porculator opened this issue Aug 18, 2015 · 8 comments

Comments

@porculator
Copy link

Hi, it would be nice to be able to export to hOCR format. It's as simple as sending a parameter to tesseract.

Also, a VERY useful thing would be to implement hOCR editor and so to connect text image with ocr-ed text. This would be extremely useful!!! Something like line-by-line display of image followed by a text box with text which could then be manually corrected.

@manisandro
Copy link
Owner

How would you integrate export to hOCR in the UI? Via rightclick on individual selections?

I'll need to check how hOCR editors work, not familiar with them. And then again how to integrate it in the current UI.

@innir
Copy link
Contributor

innir commented Aug 21, 2015

There is a hOCR editor available for Firefox: https://github.com/garrison/moz-hocr-edit.git

@manisandro
Copy link
Owner

This seems to be a pretty big thing, so I can't promise that I'll find time to work on it too soon. Clearly, if anyone wants to work on it, I'd be happy to accept patches.

@innir
Copy link
Contributor

innir commented Aug 27, 2015

As a first step being able so save the hOCR output would help I think. Adding an icon to the output pane to save it would not clutter the UI too much it guess. I could try to look into that and make a PR if you're fine with the idea.

@manisandro
Copy link
Owner

The way the code is organized now, the output pane only receives the plain text from tesseract, so at this stage it is too late to get the hOCR output. I have started experimenting with adding a combobox "Output mode" right of the recognize button, with options "Plain Text", "hOCR" and some time in the future also "PDF". This requires some slight modifications to the code by introducing a OutputEditor interface from which OutputEditorText, OutputEditorHOCR etc will inherit. Then these classes will need to have a method processImage with image data and a tesseract object as parameters which call the necessary tesseract API methods to obtain the desired output. I think this should work and be pretty user-friendly, progress however is somewhat slow due to time constraints.

@manisandro
Copy link
Owner

I've added some initial work on this in 2e154aa (Qt interface only ATM), please give it a try.

@manisandro
Copy link
Owner

So struggling to find time to work on this lately, but I've pushed an inital version of a reworked HOCR editor in commit 050424f for those who want to give it a try. It supports generating PDFs with invisible overlay text, as well as generating PDF with only the pictures from the original documents as raster parts.
The main open issue is how to figure out a decent font size. tesseract actually gives me a hint for the font size as well as a font family, so possibly it would be sufficient to just map the font family to a sans, serif or monospace variant, and then use the hinted font sizes, possibly giving the user the possibility to override this with a custom choice. If anyone feels like doing some work on this, please go ahead :)

@manisandro
Copy link
Owner

A first implementation should be pretty usable now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants