-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What DOCTYPE for hOCR HTML? #1
Comments
Tesseract uses xhtml 1.0 DOCTYPE. |
Thanks for the info, @amitdo, so does the current version of ocropy. |
See also the code in Tesseract Git master. |
Even a very simple hOCR output from Tesseract has lots of errors according to the W3C validator. Is that acceptable? Would doctype HTML5 be better for new software (Tesseract 5)? |
Many of those error result from XHTML being based on XML and the XHTML DTD defining very strictly what is allowed and what isn't. In my experience, those kinds of errors are irrelevant when using an HTML parser, so I find it acceptable because it doesn't prevent me from working with the data. Producing invalid data is suboptimal though.
Definitely, HTML5 is much easier to work with. Had HTML5 been established, hOCR could have been so much simpler and more expressive... However, I think focussing on ALTO support and to a lesser extent PAGE-XML support would be more important. |
No description provided.
The text was updated successfully, but these errors were encountered: