What DOCTYPE for hOCR HTML? #1

kba · 2016-04-09T00:50:51Z

No description provided.

amitdo · 2016-07-17T13:15:29Z

Tesseract uses xhtml 1.0 DOCTYPE.
https://github.com/tesseract-ocr/tesseract/blob/3.04.01/api/renderer.cpp#L140

kba · 2016-07-17T13:22:02Z

Thanks for the info, @amitdo, so does the current version of ocropy.

keep update

stweil · 2019-07-18T12:37:44Z

See also the code in Tesseract Git master.

stweil · 2019-07-18T12:44:19Z

Even a very simple hOCR output from Tesseract has lots of errors according to the W3C validator. Is that acceptable?

Would doctype HTML5 be better for new software (Tesseract 5)?

kba · 2019-07-18T14:31:08Z

Is that acceptable?

Many of those error result from XHTML being based on XML and the XHTML DTD defining very strictly what is allowed and what isn't. In my experience, those kinds of errors are irrelevant when using an HTML parser, so I find it acceptable because it doesn't prevent me from working with the data. Producing invalid data is suboptimal though.

Would doctype HTML5 be better for new software (Tesseract 5)?

Definitely, HTML5 is much easier to work with. Had HTML5 been established, hOCR could have been so much simpler and more expressive...

However, I think focussing on ALTO support and to a lesser extent PAGE-XML support would be more important.

kba modified the milestone: Version 1.1 May 9, 2016

kba mentioned this issue Sep 26, 2016

correct MIME type for hOCR? #27

Open

kba pushed a commit that referenced this issue Sep 29, 2016

Merge pull request #1 from kba/master

1a32dc4

keep update

kba added a commit that referenced this issue Sep 29, 2016

Link to issue #2, #1

8127e4a

kba added a commit that referenced this issue Sep 29, 2016

Link to issue #2, #1

1e9577f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What DOCTYPE for hOCR HTML? #1

What DOCTYPE for hOCR HTML? #1

kba commented Apr 9, 2016

amitdo commented Jul 17, 2016

kba commented Jul 17, 2016

stweil commented Jul 18, 2019

stweil commented Jul 18, 2019

kba commented Jul 18, 2019

What DOCTYPE for hOCR HTML? #1

What DOCTYPE for hOCR HTML? #1

Comments

kba commented Apr 9, 2016

amitdo commented Jul 17, 2016

kba commented Jul 17, 2016

stweil commented Jul 18, 2019

stweil commented Jul 18, 2019

kba commented Jul 18, 2019