Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ocrx_line example #39

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

ocrx_line example #39

wants to merge 1 commit into from

Conversation

kba
Copy link
Owner

@kba kba commented Oct 1, 2016

No description provided.


```html
...
<span class="ocrx_line">
Copy link
Collaborator

@amitdo amitdo Oct 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ocr_lines nested in ocrx_line? That's doesn't look right to me.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ocr_line nested in ocrx_line, in this case a single heading split over two lines.

But I'll gladly make a better example if you have an idea. What i've seen in the wild is just replacements for ocr_line, e.g. https://github.com/jwilk/ocrodjvu/blob/master/lib/hocr.py.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ocr_line nested in ocrx_line

Yeah, I fixed my original mistake...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, I don't know what is the right way in this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ocrx_line is engine-specific line markup. It exists for those cases where your OCR engine outputs text lines that don't correspond to "normal" text lines.

The most common case is if you apply an engine that's not capable of column segmentation to a multi-column document and you want to prevent subsequent processing stages from assuming that the text lines it gets contain text in reading order.

Basically, if you use ocrx_line instead of ocr_line, you're (intentionally) breaking most subsequent processing, since most OCR output processing will look for ocr_line tags (and assume they are in reading order).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tom, thanks for clarifying this for us.

@amitdo amitdo mentioned this pull request Oct 22, 2016
kba added a commit that referenced this pull request Nov 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants