-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check handling of non ASCII characters in hOCR files #53
Comments
stweil@digi: |
Python 3 is a special problem, because hocr-lines (with fixed print statements) outputs wrong text even for normal ASCII characters when using Python 3. We have to decide these questions:
What does lxml do if it reads data without explicit encoding? Is it possible to specify the encoding of the input data? The encoding of the text output could default to UTF-8, with an optional command parameter specifying a different encoding. |
While hOCR files generated by Tesseract use XHTML 1.0 with explicit UTF-8 encoding, OCRopus seems to use UTF-8 encoding without explicit saying so. This needs further investigation, maybe also a fix for OCRopus. |
Well, the crucial question would be, what are the hocr specification saying about the html format. But AFAIK it is not that clear there... |
I looked for an answer there and did not find one. The specification allows both XML and HTML formats. For XML, the situation is simple: explicit encoding is recommended, and UTF-8 encoding is the default (see w3schools.com). For HTML, explicit encoding is possible, but often not used (see w3schools.com for details). |
The best choice for both your questions is IMO UTF-8. If there is no encoding stated explicitely, then I would use UTF-8 as fallback. |
I'd use UTF-8, too, but it looks like the standard says that ISO-8859-1 was the default before HTML 5. For HTML 5, UTF-8 is the default according to w3schools.com. |
There are ways to tell lxml which encoding to use and to handle conflicts between implicit (string level) and explicit (document-defined) encoding. I suspect though that most of the issues have more to do with how Python 2 uses strings. Strings ( Maybe we can start by gathering some test data to better define the expected output and pinpoint the problems? |
The current Ocropus engine outputs UTF-8 and also explicitely state this (twice) in its hocr file, e.g. https://github.com/zuphilip/ocr-fileformat-samples/blob/9fb01c76425c97c572a5824ac354666fddb8602d/samples/hocr/1.1/433934212_0017.html |
As PR #29 shows, there is a problem when
hocr-lines
gets lines which contain umlauts or other non ASCII characters (UTF-8 encoded). Maybe more tools are affected.The text was updated successfully, but these errors were encountered: