Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check handling of non ASCII characters in hOCR files #53

Open
stweil opened this issue Sep 3, 2016 · 9 comments
Open

Check handling of non ASCII characters in hOCR files #53

stweil opened this issue Sep 3, 2016 · 9 comments

Comments

@stweil
Copy link
Collaborator

stweil commented Sep 3, 2016

As PR #29 shows, there is a problem when hocr-lines gets lines which contain umlauts or other non ASCII characters (UTF-8 encoded). Maybe more tools are affected.

@stweil
Copy link
Collaborator Author

stweil commented Sep 4, 2016

stweil@digi:/src/github.com/tmbdev/hocr-tools$ python3 ./hocr-lines test/hocr-lines/umlaut.html
b'\xc3\x84\xc3\x96\xc3\x9c\xc3\xa4\xc3\xb6\xc3\xbc\xc3\x9f'
stweil@digi:
/src/github.com/tmbdev/hocr-tools$ vi test/hocr-lines/umlaut.html
stweil@digi:/src/github.com/tmbdev/hocr-tools$ python2 ./hocr-lines test/hocr-lines/umlaut.html
ÄÖÜäöüß
stweil@digi:
/src/github.com/tmbdev/hocr-tools$ vi test/hocr-lines/umlaut.html
stweil@digi:~/src/github.com/tmbdev/hocr-tools$ python2 ./hocr-lines test/hocr-lines/umlaut.html
���äöü�

@stweil
Copy link
Collaborator Author

stweil commented Sep 4, 2016

Python 3 is a special problem, because hocr-lines (with fixed print statements) outputs wrong text even for normal ASCII characters when using Python 3.

We have to decide these questions:

  • Which encoding do we assume if the input data has no explicit encoding?
  • Which encoding do we want for the text output?

What does lxml do if it reads data without explicit encoding? Is it possible to specify the encoding of the input data?

The encoding of the text output could default to UTF-8, with an optional command parameter specifying a different encoding.

@stweil
Copy link
Collaborator Author

stweil commented Sep 4, 2016

While hOCR files generated by Tesseract use XHTML 1.0 with explicit UTF-8 encoding, OCRopus seems to use UTF-8 encoding without explicit saying so. This needs further investigation, maybe also a fix for OCRopus.

@zuphilip
Copy link
Collaborator

zuphilip commented Sep 4, 2016

Well, the crucial question would be, what are the hocr specification saying about the html format. But AFAIK it is not that clear there...

@stweil
Copy link
Collaborator Author

stweil commented Sep 4, 2016

I looked for an answer there and did not find one. The specification allows both XML and HTML formats. For XML, the situation is simple: explicit encoding is recommended, and UTF-8 encoding is the default (see w3schools.com). For HTML, explicit encoding is possible, but often not used (see w3schools.com for details).

@zuphilip
Copy link
Collaborator

zuphilip commented Sep 4, 2016

The best choice for both your questions is IMO UTF-8. If there is no encoding stated explicitely, then I would use UTF-8 as fallback.

@stweil
Copy link
Collaborator Author

stweil commented Sep 4, 2016

I'd use UTF-8, too, but it looks like the standard says that ISO-8859-1 was the default before HTML 5. For HTML 5, UTF-8 is the default according to w3schools.com.

@kba
Copy link
Contributor

kba commented Sep 4, 2016

There are ways to tell lxml which encoding to use and to handle conflicts between implicit (string level) and explicit (document-defined) encoding.

I suspect though that most of the issues have more to do with how Python 2 uses strings. Strings (str) and bytes are the same datatype which leads to characters outside the latin1 range to become garbled if not correctly decoded. Using unicode throughout for strings can solve many of these problems but will cause new ones when upgrading to Python 3.

Maybe we can start by gathering some test data to better define the expected output and pinpoint the problems?

@zuphilip
Copy link
Collaborator

The current Ocropus engine outputs UTF-8 and also explicitely state this (twice) in its hocr file, e.g. https://github.com/zuphilip/ocr-fileformat-samples/blob/9fb01c76425c97c572a5824ac354666fddb8602d/samples/hocr/1.1/433934212_0017.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants