Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ocr_line vs. ocrx_line #19

Open
kba opened this issue Sep 15, 2016 · 10 comments
Open

ocr_line vs. ocrx_line #19

kba opened this issue Sep 15, 2016 · 10 comments

Comments

@kba
Copy link
Owner

kba commented Sep 15, 2016

@zdenop asked 2012 on the hocr ML without an answer:

I need clarification of ocr_line vs. ocrx_line

hOCR spec define ocrx_line as:

  • any kind of "line" returned by an OCR system that differs from the
    standard ocr_line above
  • might be some kind of "logical" line

hocr-tools provide this example of ocr_line[1]:

 <span class='ocr_line' title='bbox 461 648 2077 707'>Alice was beginning to get very tired of sitting by her sister on the bank,</span>

And tesseract-ocr (r729) produce this hocr output:

  <span class='ocr_line' id='line_2' title="bbox 464 651 2074 704">
      <span class='ocrx_word' id='word_5' title="bbox 464 651 569 688">Alice</span>
      <span class='ocrx_word' id='word_6' title="bbox 591 665 667 688">was</span>
       ...
      <span class='ocrx_word' id='word_19' title="bbox 1962 660 2074 704">bank,</span>
  </span>

Does tesseract-ocr ocr_line meets criteria of "standard ocr_line" or
should it use ocrx_line?

@zuphilip
Copy link
Collaborator

Could the handling of catchwords (=Kustode) be an example where one could try to use a ocrx_line instead/besides the usual ocr_lines?

@amitdo
Copy link
Collaborator

amitdo commented Sep 15, 2016

Maybe one of them is something like this:

<p>line one in a paragraph.<br>
line two in a paragraph</p>

??

@amitdo
Copy link
Collaborator

amitdo commented Sep 15, 2016

@amitdo
Copy link
Collaborator

amitdo commented Sep 16, 2016

Oh, I now see that there is a property hardbreak for ocr_line.

kba added a commit that referenced this issue Sep 28, 2016
@wanghaisheng
Copy link
Collaborator

wanghaisheng commented Sep 29, 2016

suppose u got a table ,if you can output content you can start a new one just like this

Command Description
git status List all new or modified files
blahblah.....

you can consider these two line as a logical one instead two separate ones

@amitdo
Copy link
Collaborator

amitdo commented Sep 29, 2016

You mean a row in a table?

@wanghaisheng
Copy link
Collaborator

yes

kba added a commit that referenced this issue Sep 29, 2016
kba added a commit that referenced this issue Sep 29, 2016
@kba
Copy link
Owner Author

kba commented Sep 29, 2016

Thanks for the examples, rowspan makes sense, as does line continuation.

Catch word (those words printers put in the bottom margin as a helper/"checksum" which page should come next) are also an interesting example that probably warrants a new issue.

I've created an example for ocrx_line, hopefully a sensible one.

kba added a commit that referenced this issue Oct 1, 2016
@amitdo
Copy link
Collaborator

amitdo commented Oct 2, 2016

I just found this message:
https://groups.google.com/forum/#!topic/ocropus/-s33xn9fBGY

@amitdo
Copy link
Collaborator

amitdo commented Oct 22, 2016

@tmbdev wrote in #39:

ocrx_line is engine-specific line markup. It exists for those cases where your OCR engine outputs text lines that don't correspond to "normal" text lines.

The most common case is if you apply an engine that's not capable of column segmentation to a multi-column document and you want to prevent subsequent processing stages from assuming that the text lines it gets contain text in reading order.

Basically, if you use ocrx_line instead of ocr_line, you're (intentionally) breaking most subsequent processing, since most OCR output processing will look for ocr_line tags (and assume they are in reading order).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants