Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hOCR per word basis? #52

Open
ghost opened this issue Jul 26, 2015 · 2 comments
Open

hOCR per word basis? #52

ghost opened this issue Jul 26, 2015 · 2 comments

Comments

@ghost
Copy link

ghost commented Jul 26, 2015

Hi,

I really like your tool, it's recognition seems to be better than Tesseract's in some cases. Tesseract, however, has a more detailled hOCR output:

Each word gets wrapped in a span with class ocrx_word and has a bbox and x_wconf property.

The bbox property for each word gives the user the possibility to write an own implementation of layout detection, while the x_wconf allows omitting words, which were probably not recognized correctly.

Is this also possible with ocropy or is this planned?

Thank you.

@mittagessen
Copy link

kraken has word and character bounding boxes with character confidences and is as far as I know completely compatible with ocropus models.
You have to keep in mind though that the model does not really segment the line in words and characters (or rather grapheme clusters) but is trained to create the correct labels in the right order (if I understand CTC correctly) so character cuts are often not quite correct even if the recognition result is. In the same line ocrx_words are calculated "artificially" from the recognition result as ocropus has no notion of words while tesseract does some word based postprocessing.

@zuphilip
Copy link
Collaborator

zuphilip commented Dec 25, 2017

There is a PR #283 to add such a feature to ocropy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants