Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hocr-lines: Fix printing of lines with UTF-8 characters #29

Merged
merged 1 commit into from
Sep 4, 2016

Conversation

stweil
Copy link
Collaborator

@stweil stweil commented Aug 5, 2016

Signed-off-by: Stefan Weil sw@weilnetz.de

Signed-off-by: Stefan Weil <sw@weilnetz.de>
@zuphilip
Copy link
Collaborator

zuphilip commented Aug 8, 2016

What happens currently when trying to output non ASCII characters? Does the program stop with an error or output something else?

@@ -20,4 +20,4 @@ doc = html.fromstring(stream.read())
lines = doc.xpath("//*[@class='ocr_line']")

for line in lines:
print get_text(line)
print get_text(line).encode("utf-8")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to look also at the other occurrences of get_text https://github.com/tmbdev/hocr-tools/search?utf8=%E2%9C%93&q=get_text (search again is maybe better)

@stweil
Copy link
Collaborator Author

stweil commented Sep 3, 2016

Here are some results from the old code:

# Output to stdout: wrong encoding of umlaut.
stweil@digi:~/src/github.com/tmbdev/hocr-tools$ ./hocr-lines test/hocr-check/ancestor/ok-line.html
Wenn ist das Nunstück git und Slotermeyer? Ja! Beiherhund das Oder die Flipperwaldt gersput!

# Output to file or device: UnicodeEncodeError.
stweil@digi:~/src/github.com/tmbdev/hocr-tools$ ./hocr-lines test/hocr-check/ancestor/ok-line.html >/dev/null
Traceback (most recent call last):
  File "./hocr-lines", line 22, in <module>
    print get_text(line)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 18-19: ordinal not in range(128)

The new code worked for production files, but not for test/hocr-check/ancestor which is strange. There are also different results for Python 2 or Python 3.

I close this PR and look for a better solution

@stweil
Copy link
Collaborator Author

stweil commented Sep 4, 2016

The PR works for production files, because those files start with a line which describes their character encoding:

<?xml version="1.0" encoding="UTF-8"?>

So the PR works fine for all hOCR files with explicit encoding in UTF-8 (test for other encodings still needed). Even if it does not work with Python 3, I suggest to use it as an intermediate solution and re-open it therefore.

What happens currently when trying to output non ASCII characters? Does the program stop with an error or output something else?

It stops with an error (UnicodeEncodeError), see example above.

@stweil stweil reopened this Sep 4, 2016
@zuphilip
Copy link
Collaborator

zuphilip commented Sep 4, 2016

The PR works for production files, because those files start with a line which describes their character encoding:

I am not sure if this is true for all hocr-files. It seems that the output of tesseract has this line, but for example this file https://github.com/tmbdev/hocr-tools/blob/master/test/testdata/sample.html is missing such a line. But we should actually look for an up-to-date output of Ocropus and maybe some other programs producing hocr...

I am fine with the PR as an intermediate solution (we should leave the issue open). However, I guess that the same problem occurs also in hocr-extract-images and maybe other scripts which are using get_text.

@stweil
Copy link
Collaborator Author

stweil commented Sep 4, 2016

It seems to be true for hOCR output from Tesseract (our current "production" files). Let's discuss all other cases in issue #53.

@stweil stweil merged commit ddc346d into ocropus:master Sep 4, 2016
@stweil stweil deleted the utf8 branch September 4, 2016 09:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants