Skip to content

Commit

Permalink
Example for ocrx_line, #19
Browse files Browse the repository at this point in the history
  • Loading branch information
kba committed Sep 29, 2016
1 parent 83792f9 commit 7e3a49e
Show file tree
Hide file tree
Showing 4 changed files with 72 additions and 9 deletions.
30 changes: 27 additions & 3 deletions 1.2/index.bs
Original file line number Diff line number Diff line change
Expand Up @@ -630,10 +630,34 @@ Issue: [ocr_carea vs ocrx_block](https://github.com/kba/hocr-spec/issues/28)

### `ocrx_line`

Issue: [ocr_line vs ocrx_line](https://github.com/kba/hocr-spec/issues/19)

* any kind of "line" returned by an OCR system that differs from the standard ocr_line above
* any kind of "line" returned by an OCR system that differs from [[#ocr_line]]
* might be some kind of "logical" line
* examples include line continuations and rowspan in tables

<div class="example">

Consider the following snippet, containing a wide-spaced heading broken over
two physical lines:

<figure>
<img
width=600
alt="Wide spaced two line heading"
src="../images/akf-widespaced-heading.png"/>
</figure>

An OCR engine could produce the following output, indicating the two physical
lines that form a single logical line:

```html
...
<span class="ocrx_line">
<span class='ocr_line' title="bbox 16 16 860 47">Aus den Gewinn- und Verlust-</span>
<span class='ocr_line' title="bbox 302 62 603 98">rechnungen</span>
</span>
...
```
</div>

### `ocrx_word`

Expand Down
21 changes: 18 additions & 3 deletions 1.2/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -2112,13 +2112,29 @@ <h4 class="heading settled" data-level="9.1.1" id="ocrx_block"><span class="secn
<p>engine-specific because the definition of a "block" depends on the engine</p>
</ul>
<h4 class="heading settled" data-level="9.1.2" id="ocrx_line"><span class="secno">9.1.2. </span><span class="content"><code>ocrx_line</code></span><a class="self-link" href="#ocrx_line"></a></h4>
<p class="issue" id="issue-8ef34561"><a class="self-link" href="#issue-8ef34561"></a> <a href="https://github.com/kba/hocr-spec/issues/19">ocr_line vs ocrx_line</a></p>
<ul>
<li data-md="">
<p>any kind of "line" returned by an OCR system that differs from the standard ocr_line above</p>
<p>any kind of "line" returned by an OCR system that differs from <a href="#ocr_line">§6.1.4 ocr_line</a></p>
<li data-md="">
<p>might be some kind of "logical" line</p>
<li data-md="">
<p>examples include line continuations and rowspan in tables</p>
</ul>
<div class="example" id="example-3548e1d4">
<a class="self-link" href="#example-3548e1d4"></a>
<p>Consider the following snippet, containing a wide-spaced heading broken over
two physical lines:</p>
<figure> <img alt="Wide spaced two line heading" src="../images/akf-widespaced-heading.png" width="600"> </figure>
<p>An OCR engine could produce the following output, indicating the two physical
lines that form a single logical line:</p>
<pre class="language-html highlight">...
<span class="p">&lt;</span><span class="nt">span</span> <span class="na">class</span><span class="o">=</span><span class="s">"ocrx_line"</span><span class="p">></span>
<span class="p">&lt;</span><span class="nt">span</span> <span class="na">class</span><span class="o">=</span><span class="s">'ocr_line'</span> <span class="na">title</span><span class="o">=</span><span class="s">"bbox 16 16 860 47"</span><span class="p">></span>Aus den Gewinn- und Verlust-<span class="p">&lt;</span><span class="p">/</span><span class="nt">span</span><span class="p">></span>
<span class="p">&lt;</span><span class="nt">span</span> <span class="na">class</span><span class="o">=</span><span class="s">'ocr_line'</span> <span class="na">title</span><span class="o">=</span><span class="s">"bbox 302 62 603 98"</span><span class="p">></span>rechnungen<span class="p">&lt;</span><span class="p">/</span><span class="nt">span</span><span class="p">></span>
<span class="p">&lt;</span><span class="p">/</span><span class="nt">span</span><span class="p">></span>
...
</pre>
</div>
<h4 class="heading settled" data-level="9.1.3" id="ocrx_word"><span class="secno">9.1.3. </span><span class="content"><code>ocrx_word</code></span><a class="self-link" href="#ocrx_word"></a></h4>
<ul>
<li data-md="">
Expand Down Expand Up @@ -2804,7 +2820,6 @@ <h2 class="no-num no-ref heading settled" id="issues-index"><span class="content
<div class="issue"> There is currently no way of indicating anchoring or flow-around
properties for floating elements; properties need to be defined for this.<a href="#issue-3f2f70ed"></a></div>
<div class="issue"> <a href="https://github.com/kba/hocr-spec/issues/28">ocr_carea vs ocrx_block</a><a href="#issue-66c198d9"></a></div>
<div class="issue"> <a href="https://github.com/kba/hocr-spec/issues/19">ocr_line vs ocrx_line</a><a href="#issue-8ef34561"></a></div>
<div class="issue"> <a href="https://github.com/kba/hocr-spec/issues/9">Delete x_cost</a><a href="#issue-b35297dd"></a></div>
<div class="issue"> <a href="https://github.com/kba/hocr-spec/issues/2">XML namespace for hOCR HTML?</a><a href="#issue-f6d39356"></a></div>
<div class="issue"> <a href="https://github.com/kba/hocr-spec/issues/1">What DOCTYPE for hOCR HTML?</a><a href="#issue-a3899b99"></a></div>
Expand Down
30 changes: 27 additions & 3 deletions 1.2/spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -600,10 +600,34 @@ Issue: [ocr_carea vs ocrx_block](https://github.com/kba/hocr-spec/issues/28)

### `ocrx_line`

Issue: [ocr_line vs ocrx_line](https://github.com/kba/hocr-spec/issues/19)

* any kind of "line" returned by an OCR system that differs from the standard ocr_line above
* any kind of "line" returned by an OCR system that differs from [[#ocr_line]]
* might be some kind of "logical" line
* examples include line continuations and rowspan in tables

<div class="example">

Consider the following snippet, containing a wide-spaced heading broken over
two physical lines:

<figure>
<img
width=600
alt="Wide spaced two line heading"
src="../images/akf-widespaced-heading.png"/>
</figure>

An OCR engine could produce the following output, indicating the two physical
lines that form a single logical line:

```html
...
<span class="ocrx_line">
<span class='ocr_line' title="bbox 16 16 860 47">Aus den Gewinn- und Verlust-</span>
<span class='ocr_line' title="bbox 302 62 603 98">rechnungen</span>
</span>
...
```
</div>

### `ocrx_word`

Expand Down
Binary file added images/akf-widespaced-heading.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 7e3a49e

Please sign in to comment.