hOCR output + hOCR editor #30

porculator · 2015-08-18T03:04:42Z

Hi, it would be nice to be able to export to hOCR format. It's as simple as sending a parameter to tesseract.

Also, a VERY useful thing would be to implement hOCR editor and so to connect text image with ocr-ed text. This would be extremely useful!!! Something like line-by-line display of image followed by a text box with text which could then be manually corrected.

manisandro · 2015-08-18T06:03:46Z

How would you integrate export to hOCR in the UI? Via rightclick on individual selections?

I'll need to check how hOCR editors work, not familiar with them. And then again how to integrate it in the current UI.

innir · 2015-08-21T12:11:45Z

There is a hOCR editor available for Firefox: https://github.com/garrison/moz-hocr-edit.git

manisandro · 2015-08-21T23:47:31Z

This seems to be a pretty big thing, so I can't promise that I'll find time to work on it too soon. Clearly, if anyone wants to work on it, I'd be happy to accept patches.

innir · 2015-08-27T12:59:37Z

As a first step being able so save the hOCR output would help I think. Adding an icon to the output pane to save it would not clutter the UI too much it guess. I could try to look into that and make a PR if you're fine with the idea.

manisandro · 2015-08-27T21:27:52Z

The way the code is organized now, the output pane only receives the plain text from tesseract, so at this stage it is too late to get the hOCR output. I have started experimenting with adding a combobox "Output mode" right of the recognize button, with options "Plain Text", "hOCR" and some time in the future also "PDF". This requires some slight modifications to the code by introducing a OutputEditor interface from which OutputEditorText, OutputEditorHOCR etc will inherit. Then these classes will need to have a method processImage with image data and a tesseract object as parameters which call the necessary tesseract API methods to obtain the desired output. I think this should work and be pretty user-friendly, progress however is somewhat slow due to time constraints.

…ors (#30)

manisandro · 2015-10-11T21:37:45Z

I've added some initial work on this in 2e154aa (Qt interface only ATM), please give it a try.

manisandro · 2016-02-04T11:41:08Z

So struggling to find time to work on this lately, but I've pushed an inital version of a reworked HOCR editor in commit 050424f for those who want to give it a try. It supports generating PDFs with invisible overlay text, as well as generating PDF with only the pictures from the original documents as raster parts.
The main open issue is how to figure out a decent font size. tesseract actually gives me a hint for the font size as well as a font family, so possibly it would be sufficient to just map the font family to a sans, serif or monospace variant, and then use the hinted font sizes, possibly giving the user the possibility to override this with a custom choice. If anyone feels like doing some work on this, please go ahead :)

manisandro · 2016-04-15T17:43:04Z

A first implementation should be pretty usable now.

innir mentioned this issue Aug 21, 2015

Please add possibility to create searchable PDF with text overlay #27

Closed

manisandro added a commit that referenced this issue Oct 11, 2015

[Qt] Add support for different output editors, add Text and hOCR edit…

2e154aa

…ors (#30)

d0b3rm4n mentioned this issue Oct 14, 2015

Slot Displayer::setCurrentPage gets called twice, when QSpinBox arrow is clicked #37

Closed

manisandro closed this as completed Apr 15, 2016

SantosSi mentioned this issue Nov 27, 2017

hOCR PDF export: prevent users from overwriting any input image PDF file #243

Closed

napasa mentioned this issue Dec 26, 2017

newest master code occur exception when export pdf #276

Closed

SantosSi mentioned this issue Dec 27, 2017

Qt5,Debian,libtesseract4: Crash on recognition #279

Closed

TeoColuccio mentioned this issue Apr 19, 2020

Glibmm-error, detected trace/breakpoint #445

Closed

hendrack mentioned this issue Mar 13, 2024

Segfault on Alpine (OpenCL, Tesseract issue?) #668

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hOCR output + hOCR editor #30

hOCR output + hOCR editor #30

porculator commented Aug 18, 2015

manisandro commented Aug 18, 2015

innir commented Aug 21, 2015

manisandro commented Aug 21, 2015

innir commented Aug 27, 2015

manisandro commented Aug 27, 2015

manisandro commented Oct 11, 2015

manisandro commented Feb 4, 2016

manisandro commented Apr 15, 2016

hOCR output + hOCR editor #30

hOCR output + hOCR editor #30

Comments

porculator commented Aug 18, 2015

manisandro commented Aug 18, 2015

innir commented Aug 21, 2015

manisandro commented Aug 21, 2015

innir commented Aug 27, 2015

manisandro commented Aug 27, 2015

manisandro commented Oct 11, 2015

manisandro commented Feb 4, 2016

manisandro commented Apr 15, 2016