Check handling of non ASCII characters in hOCR files #53

stweil · 2016-09-03T21:33:12Z

As PR #29 shows, there is a problem when hocr-lines gets lines which contain umlauts or other non ASCII characters (UTF-8 encoded). Maybe more tools are affected.

The text was updated successfully, but these errors were encountered:

stweil · 2016-09-04T06:31:58Z

stweil@digi:/src/github.com/tmbdev/hocr-tools$ python3 ./hocr-lines test/hocr-lines/umlaut.html
b'\xc3\x84\xc3\x96\xc3\x9c\xc3\xa4\xc3\xb6\xc3\xbc\xc3\x9f'
stweil@digi:/src/github.com/tmbdev/hocr-tools$ vi test/hocr-lines/umlaut.html
stweil@digi:/src/github.com/tmbdev/hocr-tools$ python2 ./hocr-lines test/hocr-lines/umlaut.html
ÄÖÜäöüß
stweil@digi:/src/github.com/tmbdev/hocr-tools$ vi test/hocr-lines/umlaut.html
stweil@digi:~/src/github.com/tmbdev/hocr-tools$ python2 ./hocr-lines test/hocr-lines/umlaut.html
Ã�Ã�Ã�Ã¤Ã¶Ã¼Ã�

stweil · 2016-09-04T06:39:14Z

Python 3 is a special problem, because hocr-lines (with fixed print statements) outputs wrong text even for normal ASCII characters when using Python 3.

We have to decide these questions:

Which encoding do we assume if the input data has no explicit encoding?
Which encoding do we want for the text output?

What does lxml do if it reads data without explicit encoding? Is it possible to specify the encoding of the input data?

The encoding of the text output could default to UTF-8, with an optional command parameter specifying a different encoding.

stweil · 2016-09-04T09:44:01Z

While hOCR files generated by Tesseract use XHTML 1.0 with explicit UTF-8 encoding, OCRopus seems to use UTF-8 encoding without explicit saying so. This needs further investigation, maybe also a fix for OCRopus.

zuphilip · 2016-09-04T09:56:46Z

Well, the crucial question would be, what are the hocr specification saying about the html format. But AFAIK it is not that clear there...

stweil · 2016-09-04T10:04:00Z

I looked for an answer there and did not find one. The specification allows both XML and HTML formats. For XML, the situation is simple: explicit encoding is recommended, and UTF-8 encoding is the default (see w3schools.com). For HTML, explicit encoding is possible, but often not used (see w3schools.com for details).

zuphilip · 2016-09-04T10:16:01Z

The best choice for both your questions is IMO UTF-8. If there is no encoding stated explicitely, then I would use UTF-8 as fallback.

stweil · 2016-09-04T10:20:27Z

I'd use UTF-8, too, but it looks like the standard says that ISO-8859-1 was the default before HTML 5. For HTML 5, UTF-8 is the default according to w3schools.com.

kba · 2016-09-04T10:27:23Z

There are ways to tell lxml which encoding to use and to handle conflicts between implicit (string level) and explicit (document-defined) encoding.

I suspect though that most of the issues have more to do with how Python 2 uses strings. Strings (str) and bytes are the same datatype which leads to characters outside the latin1 range to become garbled if not correctly decoded. Using unicode throughout for strings can solve many of these problems but will cause new ones when upgrading to Python 3.

Maybe we can start by gathering some test data to better define the expected output and pinpoint the problems?

zuphilip · 2016-09-13T08:51:18Z

The current Ocropus engine outputs UTF-8 and also explicitely state this (twice) in its hocr file, e.g. https://github.com/zuphilip/ocr-fileformat-samples/blob/9fb01c76425c97c572a5824ac354666fddb8602d/samples/hocr/1.1/433934212_0017.html

stweil mentioned this issue Sep 4, 2016

hocr-lines: Fix printing of lines with UTF-8 characters #29

Merged

zuphilip mentioned this issue Sep 17, 2016

Release v1.0.1 #65

Closed

zuphilip mentioned this issue Oct 9, 2016

Fix hocr-combine, add tests #81

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check handling of non ASCII characters in hOCR files #53

Check handling of non ASCII characters in hOCR files #53

stweil commented Sep 3, 2016

stweil commented Sep 4, 2016

stweil commented Sep 4, 2016

stweil commented Sep 4, 2016

zuphilip commented Sep 4, 2016

stweil commented Sep 4, 2016 •

edited

Loading

zuphilip commented Sep 4, 2016

stweil commented Sep 4, 2016

kba commented Sep 4, 2016 •

edited

Loading

zuphilip commented Sep 13, 2016

Check handling of non ASCII characters in hOCR files #53

Check handling of non ASCII characters in hOCR files #53

Comments

stweil commented Sep 3, 2016

stweil commented Sep 4, 2016

stweil commented Sep 4, 2016

stweil commented Sep 4, 2016

zuphilip commented Sep 4, 2016

stweil commented Sep 4, 2016 • edited Loading

zuphilip commented Sep 4, 2016

stweil commented Sep 4, 2016

kba commented Sep 4, 2016 • edited Loading

zuphilip commented Sep 13, 2016

stweil commented Sep 4, 2016 •

edited

Loading

kba commented Sep 4, 2016 •

edited

Loading