Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What DOCTYPE for hOCR HTML? #1

Open
kba opened this issue Apr 9, 2016 · 5 comments
Open

What DOCTYPE for hOCR HTML? #1

kba opened this issue Apr 9, 2016 · 5 comments
Milestone

Comments

@kba
Copy link
Owner

kba commented Apr 9, 2016

No description provided.

@kba kba modified the milestone: Version 1.1 May 9, 2016
@amitdo
Copy link
Collaborator

amitdo commented Jul 17, 2016

@kba
Copy link
Owner Author

kba commented Jul 17, 2016

Thanks for the info, @amitdo, so does the current version of ocropy.

kba pushed a commit that referenced this issue Sep 29, 2016
kba added a commit that referenced this issue Sep 29, 2016
kba added a commit that referenced this issue Sep 29, 2016
@stweil
Copy link
Contributor

stweil commented Jul 18, 2019

See also the code in Tesseract Git master.

@stweil
Copy link
Contributor

stweil commented Jul 18, 2019

Even a very simple hOCR output from Tesseract has lots of errors according to the W3C validator. Is that acceptable?

Would doctype HTML5 be better for new software (Tesseract 5)?

@kba
Copy link
Owner Author

kba commented Jul 18, 2019

Is that acceptable?

Many of those error result from XHTML being based on XML and the XHTML DTD defining very strictly what is allowed and what isn't. In my experience, those kinds of errors are irrelevant when using an HTML parser, so I find it acceptable because it doesn't prevent me from working with the data. Producing invalid data is suboptimal though.

Would doctype HTML5 be better for new software (Tesseract 5)?

Definitely, HTML5 is much easier to work with. Had HTML5 been established, hOCR could have been so much simpler and more expressive...

However, I think focussing on ALTO support and to a lesser extent PAGE-XML support would be more important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants