From b69b3429d4542d854c9fff042195a18f987671f0 Mon Sep 17 00:00:00 2001 From: Konstantin Baierer Date: Fri, 30 Sep 2016 01:41:37 +0200 Subject: [PATCH] Example for ocrx_line, #19 --- 1.2/index.bs | 30 +++++++++++++++++++++++++++--- 1.2/index.html | 21 ++++++++++++++++++--- 1.2/spec.md | 30 +++++++++++++++++++++++++++--- images/akf-widespaced-heading.png | Bin 0 -> 5355 bytes 4 files changed, 72 insertions(+), 9 deletions(-) create mode 100644 images/akf-widespaced-heading.png diff --git a/1.2/index.bs b/1.2/index.bs index 59ce6d4..a5f63da 100644 --- a/1.2/index.bs +++ b/1.2/index.bs @@ -656,10 +656,34 @@ Issue: [ocr_carea vs ocrx_block](https://github.com/kba/hocr-spec/issues/28) ### `ocrx_line` -Issue: [ocr_line vs ocrx_line](https://github.com/kba/hocr-spec/issues/19) - - * any kind of "line" returned by an OCR system that differs from the standard ocr_line above + * any kind of "line" returned by an OCR system that differs from [[#ocr_line]] * might be some kind of "logical" line + * examples include line continuations and rowspan in tables + +
+ +Consider the following snippet, containing a wide-spaced heading broken over +two physical lines: + +
+ Wide spaced two line heading +
+ +An OCR engine could produce the following output, indicating the two physical +lines that form a single logical line: + +```html +... + + Aus den Gewinn- und Verlust- + rechnungen + +... +``` +
### `ocrx_word` diff --git a/1.2/index.html b/1.2/index.html index f83ecca..7e8ed50 100644 --- a/1.2/index.html +++ b/1.2/index.html @@ -2124,13 +2124,29 @@

9.1.2. ocrx_line

-

ocr_line vs ocrx_line

  • -

    any kind of "line" returned by an OCR system that differs from the standard ocr_line above

    +

    any kind of "line" returned by an OCR system that differs from §6.1.4 ocr_line

  • might be some kind of "logical" line

    +
  • +

    examples include line continuations and rowspan in tables

+
+ +

Consider the following snippet, containing a wide-spaced heading broken over +two physical lines:

+
Wide spaced two line heading
+

An OCR engine could produce the following output, indicating the two physical +lines that form a single logical line:

+
...
+<span class="ocrx_line">
+  <span class='ocr_line' title="bbox 16 16 860 47">Aus den Gewinn- und Verlust-</span>
+  <span class='ocr_line' title="bbox 302 62 603 98">rechnungen</span> 
+</span>
+...
+
+

9.1.3. ocrx_word

  • @@ -2816,7 +2832,6 @@

    There is currently no way of indicating anchoring or flow-around properties for floating elements; properties need to be defined for this. - diff --git a/1.2/spec.md b/1.2/spec.md index 212c9d0..f983ffe 100644 --- a/1.2/spec.md +++ b/1.2/spec.md @@ -627,10 +627,34 @@ Issue: [ocr_carea vs ocrx_block](https://github.com/kba/hocr-spec/issues/28) ### `ocrx_line` -Issue: [ocr_line vs ocrx_line](https://github.com/kba/hocr-spec/issues/19) - - * any kind of "line" returned by an OCR system that differs from the standard ocr_line above + * any kind of "line" returned by an OCR system that differs from [[#ocr_line]] * might be some kind of "logical" line + * examples include line continuations and rowspan in tables + +
    + +Consider the following snippet, containing a wide-spaced heading broken over +two physical lines: + +
    + Wide spaced two line heading +
    + +An OCR engine could produce the following output, indicating the two physical +lines that form a single logical line: + +```html +... + + Aus den Gewinn- und Verlust- + rechnungen + +... +``` +
    ### `ocrx_word` diff --git a/images/akf-widespaced-heading.png b/images/akf-widespaced-heading.png new file mode 100644 index 0000000000000000000000000000000000000000..baaeea3c8bca811240a0dc6ab8f9c73fab98288f GIT binary patch literal 5355 zcmbtYXH*khlMb(nVkIIHI*JIQ7!i;T@+zQW2kAuR}l;?Sb%51KfZ+KJ#;YE)(GD=OSZ#``$x~sN*64 zz$uH{Hx0}JN9HNRRihed2d+zkcXz>A!G)SZNosf6n}i#>&irzg$BcwEBx&~>8puC@al zLnVr<>#l9gm}xv%d;?sH5J~K0D4hx2<-dxtoSuow%RLqzCP;Xs*H~^TzCtWb@Awr| zKfKv-`^6T{ilC1=p_!5Dyg6WhY@Cw9a4@YPB~9|QNEd#52=`d(@=RT5o{;hXUJq5p zoX}<$F&j1FP27CP^IttK*f2on=hKOVCy4cy#ogUbJ8qTRtX+@kU2Sf;c-MN+{Ruq_ z$j^9DS=#TChHJ4ILCE6QT2RvLQ)QQ%5(yC|Pl{`&zCk|P+L~mBnmyvx6xbr?)hJ=Y zBu$ieu;BW&+|hM9xm5rB92KJwm!GfMD3&zraJVHUSP*q6552|F1q&uQ&vRoNryIl^ zw=G%8tE6|jz8z{}>y*Oe=;#g79g%484A7+OM4^lAj&naICQn?1HQ-DyM@eojo4yN) z=#JvB2n~tIkCB;qB!NB-d>lOBa=8Fn=K_W5iPe4VoRk!zg>D~q9k%wX;M+Z>;AgR; z_<2Mf);eoA3Res9{Q764iR7R zCp3Njs5l9b%W@wWzAnmZK7FG$1Ul>3iO>*pqg#vOBZ5**N$~dJjt;r~OIc`7@+@Ie zUt~A!7!}VP6k{Sqzd^NYgbL+4w`&;MW>Z6&Yqbf%K}wEted}M1&Gk1@>`bK9i+j0| z)|WA<1nJVa9{ zvPna9Abtu+HVsRXiuDW7Pk@xua}3H@P8Krhj0qk?=N+C0{d!iy_IB*A{YKMdvL6I* z1zi`U=JjUjR%$tmKM)e?fk^40``o)O)>OL$$Q@L2(Dunf5zshXZZPvy)*O?KM)EzO z&se%;)OK{*AsN)i5fS3Tl)RykQ^u*xbTLfm^$UiB!rN`N@6MNy4%1-5Jc^V=;NO)= zovD#hYol(xg@8Zol%zjo#*aqE0g~0_v|JYZZ&{kgs#C&{C{{VgE5v;lpIiQ0I)Zj< z3*%!B;{uyUTejFED_^f_E%!(m*jnAAzv@)K{Oe30vrx&OkWiYQ*7=P4VjHIu(NZ{E zH}avHz?}BD(+NJ~0J&Me?rtk(tyZ>5mW_M5kG^ z#MivaB^y6U=v{CO$UIm-S*La|kj#)Y97VQsdKG+Dd)p?y#ow?O_;^CEQ*(}iNfreT zt_EdqO-patFTQzSJn)#%u)RQ;slE%TgkFJTQuMSoR=$XkhAamdH|;6?d{vrH+c-;-2(T?rzN$9ys6m; zXz}`5PMSyMc3=Mp$J}4?vd3uHsR3D=9}6<;jujFdi!ad!r5gOEYWs@TbUU@ z#vW>|Sa-0Sep@3EynkEeB@x3=c31*Y{QY-K1s;bEGupXt?uBhImVV$I1b_KvuDntq zjCXenAw#-=75ja-h-&v^qs?&**!Ge0m8kXMN_hldEdSD`WItKXrZ(5ZacxWleh3n| zVA(GS&!ol!0@crS)r4R8>wilodz|BaF`OS5zPOJW2~-9gf&|LbpI#S?R^Z>S{0|*X zl#i)(EAe*qoP^pUWn%L|-N=VY>8(qB%@2(A1Ua}A$V^q|;5BC94UxnG{F|a!72P*G z?(wKy>)JH5914ZC@pQ%FAPF^r`>pEd3WB-R$-!VA*Z6(vbZPyVHR~`4cmrBUPHfEf zP2kXrha{-r7r<;ff|xS81%LEnwR{tcsbE;nHN8@3Lx9@$-`cEp3p!ch=(ZXXI;3|J zD=iFuAqyWt1csfm)i#h!HK4CqVsye*I&xVKjo2>^?12ptX#%h z&VuRu-5cg4)4fZG2lgl{>lCvD?M$czqwn_(qMl55pHpQ&b3?y}oC)wuSFD+IjmS3j z-NAJ^`#;Z~QL(bJxGvaiNQBKlOF@kjwNYMPDb3suXl7ganX)z#j<4ue`vaFxL@0LB3tL~d-*~9z%5mEX8AH+!-a&}l^p17`!2WRu)4Dz z@q$v|GII#wYCrsAv-coUGTL!^St)0HY#TQG%~8Mc>w5D6HCW;9g=U4LZCmSpcuZxj z!*hG;4SfVcCicu3ovZOU=rCGiD88@6x={`RTPK84JMYN6&fKhF91mbD$sJESmgOaB zG2M1T8JUE7Yg}JXwE~fprzAY?eRig6!jyvg2SC8)CS^Dq%g0flDkU&OR)1ddH;ToX zd}FVF7`=0G2jO!}=q0H7=^&#za`pt|y5K%jh~=r~TaB{(r=iu;=2t&qrV)I5;z$|( z4GVYPfO^XnBpVU+MS(vN5moeUg|N@4<$A!WPM z1DrW0Z8|+(hn!aW{HQLR(msn$>$dwLQTxECrK!8^JIkaZDbyEwPC`VKdL@T^b44== zDJas?lruGv&ki~FUc}X0RHS*6NXQ)h4#%B-C|a<~bFh--`qHr1-r%w~B_-Dq=U#IB z%FFr;ubk!g)H>aPjjfrt(G}_mNT#`zsBE$eHHv5Njmg9&A)=17sJ6AOkTX!&u$Q1n zW23;k)>s*SsVlVZSk4qp>#hYMucsmjiu-b30Mew8?^cwB_xU*27|?~3C~9lc*z*e} zHf7r|nZM9_rma0-9_>iC#y#_}D3V)1^Gk`aAeK$+T%+%5^hubq+OfOI#5sxNsaFr+3^5gFrE<~k>6i+jvNxn(3j*1y z2WqIWOx$@PJ(k>(x}pQhi?_}fP_-^MIv98TR`sTTRoi0h>^ax|Y*@W<*ar6&90r|R zk#;%R(&Rd+Lo?N>44-PH!KF<5n_Sfvs-@{r)?L*wEql1V?10t8FW@8d%N8pQpZMpCpfe@* z*NX8z=8`8{0Jd6*SyHC?`Fz!%))Ps-s52jIs;nkq`t?dwG;()=W_<@fsGKdP?*A%y3dGTd$?$VXPQsq4( zAAsN|%Ia{%y3qN-L)Z6yFY;H_j{U!1(?1G&I5xmB3myk+Cet(4vfZ7YQp}1JuR8S6 zaUjIXGn~1AYN-UX0xew$Dz8FFc4lQ~#Hv8@zrVUDq*o88dOL9+*UnqIFtP^a68J2# zflI&hWu$=K)PcUAlRQw5{!ZWH&g^fKMccfwGNQSBuHyqbAEo>oiTP|4nMwi=l9!Ym zxjusxlxaJW0Rzl|)_2l-RaOVBbGFV}uPDK19V9jU!84xpP=Yz52I5!Xbi2FZ01Ii$I z+$zFA4%Bo0XXcaOZ0S?qU!T+7o55Ti1At8|IhS(M_1&MwNaM zD>itza?ivsS-%5ZJoNEYyU>lxibh~#ish1-=0;cQ@11RaaBSOz^*sv{YZ!6vQ+yPx zJ&qGH7PygB8$Mi@PgECT;S-6Ae#Q!?Ab`h$)V8-%ueem#NL$@Ij;4QJg8?Up@VuHrd%dCEnDl847XH^^CYaExp9 ziga4NV?r!>A;%4KcODN4EYPU```5g>$I4onQ*N$a#zn08g zFal*3AZu{9^c(}6pUgbjzOTo?>`8c1fP)g7tEYRss5^(>PSumjic5-EzZR_Qkz%@XWo%P6RX(73#QuGd zyIFUG$wUb%Psi&sxVENLxo%JV&6>&t6h^cD$wZ`DE zM_>I)N}lbuedVz8oYlcA+KM5>e714TJ z=|DU_v%F@kE`z|y-1l|7S;NB)x!fn~Rn{{#!7T~#>&nJUV)$@v%`ZlCR7a~3!x1dp zQ7&D7jnfT2Um_`#94u_$sv?YX@pFS#RYgmy&mJZR5V<59KTLvb+kmp4N8TW#gpIZJPgBI1?4{oWs`7bY>7w*GDtF<^7rZ`m$Yt|(QfVcx-uzqe z$1R6^sg>;4mv~rR_LGj$-;sUGGps42=BiFoOTpxOj!(KU>*tp+9UN>#u!ZwP|4Z;W zh5;t42K*NB*}#_k+{mKH)7WRz+U5><`?tqn&m4cHu2~4-ovUj$3m;S$+`)+UCjrQM)`>vd;wLt|22^@g{G)Q1Wz<