hOCR per word basis? #52

ghost · 2015-07-26T16:58:54Z

Hi,

I really like your tool, it's recognition seems to be better than Tesseract's in some cases. Tesseract, however, has a more detailled hOCR output:

Each word gets wrapped in a span with class ocrx_word and has a bbox and x_wconf property.

The bbox property for each word gives the user the possibility to write an own implementation of layout detection, while the x_wconf allows omitting words, which were probably not recognized correctly.

Is this also possible with ocropy or is this planned?

Thank you.

mittagessen · 2015-08-01T09:35:00Z

kraken has word and character bounding boxes with character confidences and is as far as I know completely compatible with ocropus models.
You have to keep in mind though that the model does not really segment the line in words and characters (or rather grapheme clusters) but is trained to create the correct labels in the right order (if I understand CTC correctly) so character cuts are often not quite correct even if the recognition result is. In the same line ocrx_words are calculated "artificially" from the recognition result as ocropus has no notion of words while tesseract does some word based postprocessing.

zuphilip · 2017-12-25T17:14:27Z

There is a PR #283 to add such a feature to ocropy.

zuphilip added the ❔ question label Oct 31, 2016

zuphilip added ✨ enhancement and removed ❔ question labels Dec 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hOCR per word basis? #52

hOCR per word basis? #52

ghost commented Jul 26, 2015

mittagessen commented Aug 1, 2015

zuphilip commented Dec 25, 2017 •

edited

Loading

hOCR per word basis? #52

hOCR per word basis? #52

Comments

ghost commented Jul 26, 2015

mittagessen commented Aug 1, 2015

zuphilip commented Dec 25, 2017 • edited Loading

zuphilip commented Dec 25, 2017 •

edited

Loading