-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hOCR output + hOCR editor #30
Comments
How would you integrate export to hOCR in the UI? Via rightclick on individual selections? I'll need to check how hOCR editors work, not familiar with them. And then again how to integrate it in the current UI. |
There is a hOCR editor available for Firefox: https://github.com/garrison/moz-hocr-edit.git |
This seems to be a pretty big thing, so I can't promise that I'll find time to work on it too soon. Clearly, if anyone wants to work on it, I'd be happy to accept patches. |
As a first step being able so save the hOCR output would help I think. Adding an icon to the output pane to save it would not clutter the UI too much it guess. I could try to look into that and make a PR if you're fine with the idea. |
The way the code is organized now, the output pane only receives the plain text from tesseract, so at this stage it is too late to get the hOCR output. I have started experimenting with adding a combobox "Output mode" right of the recognize button, with options "Plain Text", "hOCR" and some time in the future also "PDF". This requires some slight modifications to the code by introducing a |
I've added some initial work on this in 2e154aa (Qt interface only ATM), please give it a try. |
So struggling to find time to work on this lately, but I've pushed an inital version of a reworked HOCR editor in commit 050424f for those who want to give it a try. It supports generating PDFs with invisible overlay text, as well as generating PDF with only the pictures from the original documents as raster parts. |
A first implementation should be pretty usable now. |
Hi, it would be nice to be able to export to hOCR format. It's as simple as sending a parameter to tesseract.
Also, a VERY useful thing would be to implement hOCR editor and so to connect text image with ocr-ed text. This would be extremely useful!!! Something like line-by-line display of image followed by a text box with text which could then be manually corrected.
The text was updated successfully, but these errors were encountered: