-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hocr-lines: Fix printing of lines with UTF-8 characters #29
Conversation
Signed-off-by: Stefan Weil <sw@weilnetz.de>
What happens currently when trying to output non ASCII characters? Does the program stop with an error or output something else? |
@@ -20,4 +20,4 @@ doc = html.fromstring(stream.read()) | |||
lines = doc.xpath("//*[@class='ocr_line']") | |||
|
|||
for line in lines: | |||
print get_text(line) | |||
print get_text(line).encode("utf-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to look also at the other occurrences of get_text
https://github.com/tmbdev/hocr-tools/search?utf8=%E2%9C%93&q=get_text (search again is maybe better)
Here are some results from the old code:
The new code worked for production files, but not for I close this PR and look for a better solution |
The PR works for production files, because those files start with a line which describes their character encoding:
So the PR works fine for all hOCR files with explicit encoding in UTF-8 (test for other encodings still needed). Even if it does not work with Python 3, I suggest to use it as an intermediate solution and re-open it therefore.
It stops with an error ( |
I am not sure if this is true for all hocr-files. It seems that the output of tesseract has this line, but for example this file https://github.com/tmbdev/hocr-tools/blob/master/test/testdata/sample.html is missing such a line. But we should actually look for an up-to-date output of Ocropus and maybe some other programs producing hocr... I am fine with the PR as an intermediate solution (we should leave the issue open). However, I guess that the same problem occurs also in |
It seems to be true for hOCR output from Tesseract (our current "production" files). Let's discuss all other cases in issue #53. |
Signed-off-by: Stefan Weil sw@weilnetz.de