ocr_line vs. ocrx_line #19

kba · 2016-09-15T12:14:21Z

@zdenop asked 2012 on the hocr ML without an answer:

I need clarification of ocr_line vs. ocrx_line

hOCR spec define ocrx_line as:

any kind of "line" returned by an OCR system that differs from the
standard ocr_line above

might be some kind of "logical" line

hocr-tools provide this example of ocr_line[1]:
 Alice was beginning to get very tired of sitting by her sister on the bank,
And tesseract-ocr (r729) produce this hocr output:
 
 Alice
 was
 ...
 bank,
 
Does tesseract-ocr ocr_line meets criteria of "standard ocr_line" or
should it use ocrx_line?

The text was updated successfully, but these errors were encountered:

zuphilip · 2016-09-15T13:41:03Z

Could the handling of catchwords (=Kustode) be an example where one could try to use a ocrx_line instead/besides the usual ocr_lines?

amitdo · 2016-09-15T15:51:38Z

Maybe one of them is something like this:

<p>line one in a paragraph.<br>
line two in a paragraph</p>

??

amitdo · 2016-09-15T16:02:49Z

http://word.tips.net/T000170_Understanding_Hard_and_Soft_Returns.html

amitdo · 2016-09-16T15:54:59Z

Oh, I now see that there is a property hardbreak for ocr_line.

wanghaisheng · 2016-09-29T16:28:20Z

suppose u got a table ,if you can output content you can start a new one just like this

Command	Description
git status	List all new or modified files
	blahblah.....

you can consider these two line as a logical one instead two separate ones

amitdo · 2016-09-29T16:34:15Z

You mean a row in a table?

wanghaisheng · 2016-09-29T16:48:04Z

yes

kba · 2016-09-29T23:49:45Z

Thanks for the examples, rowspan makes sense, as does line continuation.

Catch word (those words printers put in the bottom margin as a helper/"checksum" which page should come next) are also an interesting example that probably warrants a new issue.

I've created an example for ocrx_line, hopefully a sensible one.

amitdo · 2016-10-02T22:11:25Z

I just found this message:
https://groups.google.com/forum/#!topic/ocropus/-s33xn9fBGY

amitdo · 2016-10-22T17:20:11Z

@tmbdev wrote in #39:

ocrx_line is engine-specific line markup. It exists for those cases where your OCR engine outputs text lines that don't correspond to "normal" text lines.

The most common case is if you apply an engine that's not capable of column segmentation to a multi-column document and you want to prevent subsequent processing stages from assuming that the text lines it gets contain text in reading order.

Basically, if you use ocrx_line instead of ocr_line, you're (intentionally) breaking most subsequent processing, since most OCR output processing will look for ocr_line tags (and assume they are in reading order).

fix #19 fix #39

kba added a commit that referenced this issue Sep 28, 2016

Link to issue #19

832eb5b

kba added a commit that referenced this issue Sep 29, 2016

Example for ocrx_line, #19

c39af25

kba added a commit that referenced this issue Sep 29, 2016

Example for ocrx_line, #19

7e3a49e

kba added a commit that referenced this issue Oct 1, 2016

Example for ocrx_line, #19

b69b342

amitdo mentioned this issue Oct 20, 2016

Logical Tags/classes #66

Open

kba added a commit that referenced this issue Nov 30, 2017

Add note on ocrx_line by @tmbdev

e2bf67d

fix #19 fix #39

kba mentioned this issue Nov 30, 2017

Add note on ocrx_line by @tmbdev #105

Open

HiromuHota mentioned this issue Oct 20, 2020

ocrx_line is missing from the ocr-capabilities metadata field HazyResearch/pdftotree#94

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ocr_line vs. ocrx_line #19

ocr_line vs. ocrx_line #19

kba commented Sep 15, 2016

zuphilip commented Sep 15, 2016

amitdo commented Sep 15, 2016

amitdo commented Sep 15, 2016

amitdo commented Sep 16, 2016

wanghaisheng commented Sep 29, 2016 •

edited

Loading

amitdo commented Sep 29, 2016

wanghaisheng commented Sep 29, 2016

kba commented Sep 29, 2016

amitdo commented Oct 2, 2016

amitdo commented Oct 22, 2016

ocr_line vs. ocrx_line #19

ocr_line vs. ocrx_line #19

Comments

kba commented Sep 15, 2016

zuphilip commented Sep 15, 2016

amitdo commented Sep 15, 2016

amitdo commented Sep 15, 2016

amitdo commented Sep 16, 2016

wanghaisheng commented Sep 29, 2016 • edited Loading

amitdo commented Sep 29, 2016

wanghaisheng commented Sep 29, 2016

kba commented Sep 29, 2016

amitdo commented Oct 2, 2016

amitdo commented Oct 22, 2016

wanghaisheng commented Sep 29, 2016 •

edited

Loading