hocr-lines: Fix printing of lines with UTF-8 characters #29

stweil · 2016-08-05T18:39:22Z

Signed-off-by: Stefan Weil sw@weilnetz.de

Signed-off-by: Stefan Weil <sw@weilnetz.de>

zuphilip · 2016-08-08T18:25:01Z

What happens currently when trying to output non ASCII characters? Does the program stop with an error or output something else?

zuphilip · 2016-08-08T18:30:01Z

hocr-lines

@@ -20,4 +20,4 @@ doc = html.fromstring(stream.read())
 lines = doc.xpath("//*[@class='ocr_line']")

 for line in lines:
-    print get_text(line)
+    print get_text(line).encode("utf-8")


It makes sense to look also at the other occurrences of get_text https://github.com/tmbdev/hocr-tools/search?utf8=%E2%9C%93&q=get_text (search again is maybe better)

stweil · 2016-09-03T21:29:13Z

Here are some results from the old code:

# Output to stdout: wrong encoding of umlaut.
stweil@digi:~/src/github.com/tmbdev/hocr-tools$ ./hocr-lines test/hocr-check/ancestor/ok-line.html
Wenn ist das NunstÃ¼ck git und Slotermeyer? Ja! Beiherhund das Oder die Flipperwaldt gersput!

# Output to file or device: UnicodeEncodeError.
stweil@digi:~/src/github.com/tmbdev/hocr-tools$ ./hocr-lines test/hocr-check/ancestor/ok-line.html >/dev/null
Traceback (most recent call last):
  File "./hocr-lines", line 22, in <module>
    print get_text(line)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 18-19: ordinal not in range(128)

The new code worked for production files, but not for test/hocr-check/ancestor which is strange. There are also different results for Python 2 or Python 3.

I close this PR and look for a better solution

stweil · 2016-09-04T06:31:19Z

The PR works for production files, because those files start with a line which describes their character encoding:

<?xml version="1.0" encoding="UTF-8"?>

So the PR works fine for all hOCR files with explicit encoding in UTF-8 (test for other encodings still needed). Even if it does not work with Python 3, I suggest to use it as an intermediate solution and re-open it therefore.

What happens currently when trying to output non ASCII characters? Does the program stop with an error or output something else?

It stops with an error (UnicodeEncodeError), see example above.

zuphilip · 2016-09-04T09:11:39Z

The PR works for production files, because those files start with a line which describes their character encoding:

I am not sure if this is true for all hocr-files. It seems that the output of tesseract has this line, but for example this file https://github.com/tmbdev/hocr-tools/blob/master/test/testdata/sample.html is missing such a line. But we should actually look for an up-to-date output of Ocropus and maybe some other programs producing hocr...

I am fine with the PR as an intermediate solution (we should leave the issue open). However, I guess that the same problem occurs also in hocr-extract-images and maybe other scripts which are using get_text.

stweil · 2016-09-04T09:33:31Z

It seems to be true for hOCR output from Tesseract (our current "production" files). Let's discuss all other cases in issue #53.

hocr-lines: Fix printing of lines with UTF-8 characters

5a75b0d

Signed-off-by: Stefan Weil <sw@weilnetz.de>

zuphilip reviewed Aug 8, 2016
View reviewed changes

stweil closed this Sep 3, 2016

stweil mentioned this pull request Sep 3, 2016

Check handling of non ASCII characters in hOCR files #53

Open

stweil reopened this Sep 4, 2016

stweil merged commit ddc346d into ocropus:master Sep 4, 2016

stweil deleted the utf8 branch September 4, 2016 09:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hocr-lines: Fix printing of lines with UTF-8 characters #29

hocr-lines: Fix printing of lines with UTF-8 characters #29

stweil commented Aug 5, 2016

zuphilip commented Aug 8, 2016

zuphilip Aug 8, 2016

stweil commented Sep 3, 2016

stweil commented Sep 4, 2016

zuphilip commented Sep 4, 2016

stweil commented Sep 4, 2016

hocr-lines: Fix printing of lines with UTF-8 characters #29

hocr-lines: Fix printing of lines with UTF-8 characters #29

Conversation

stweil commented Aug 5, 2016

zuphilip commented Aug 8, 2016

zuphilip Aug 8, 2016

Choose a reason for hiding this comment

stweil commented Sep 3, 2016

stweil commented Sep 4, 2016

zuphilip commented Sep 4, 2016

stweil commented Sep 4, 2016