Skip to content
This repository has been archived by the owner on Oct 3, 2022. It is now read-only.

Tesseract: 3.02: Malformed hOCR document: character zones intermixed with non-character zones #8

Open
jwilk opened this issue Jan 15, 2014 · 2 comments

Comments

@jwilk
Copy link
Member

jwilk commented Jan 15, 2014

Issue reported by anonymous at Bitbucket:

Thank you very much for ocrodjvu. I am using ocrodjvu with the options --engine=tesseract -l deu. Versions are:

tesseract: 3.02
ocrodjvu: 0.7.16

With the attached page I get the following exception:

/usr/share/ocrodjvu/lib/hocr.py:435: EncodingWarning: byte 0x10 in position 25317: control character
  contents = utils.sanitize_utf8(contents)
Exception while processing page 1:
Traceback (most recent call last):
  File "/usr/share/ocrodjvu/lib/cli/ocrodjvu.py", line 418, in page_thread
    result = self.process_page(page)
  File "/usr/share/ocrodjvu/lib/cli/ocrodjvu.py", line 401, in process_page
    page_size=size
  File "/usr/share/ocrodjvu/lib/engines/tesseract.py", line 271, in extract_text
    return self._hocr.extract_text(stream, **kwargs)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 473, in extract_text
    scan_result = scan(doc.find('/body'), settings)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 374, in scan
    for zone in _scan(node, settings, settings.page_size):
  File "/usr/share/ocrodjvu/lib/hocr.py", line 239, in _scan
    return get_children(node)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 198, in get_children
    result += _scan(child, settings, page_size)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 257, in _scan
    children = get_children(node)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 198, in get_children
    result += _scan(child, settings, page_size)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 257, in _scan
    children = get_children(node)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 198, in get_children
    result += _scan(child, settings, page_size)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 257, in _scan
    children = get_children(node)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 198, in get_children
    result += _scan(child, settings, page_size)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 285, in _scan
    raise errors.MalformedHocr("character zones intermixed with non-character zones")
MalformedHocr: Malformed hOCR document: character zones intermixed with non-character zones

Attachment: t-p-086.pgm.djvu.zip

@jwilk
Copy link
Member Author

jwilk commented Apr 21, 2014

I can't reproduce it here. :-(
Could you try upgrading Tesseract to 3.02.02 and see if it helps?

@jwilk
Copy link
Member Author

jwilk commented Nov 11, 2014

Ping?

@jwilk jwilk changed the title Crash with tesseract Tesseract: 3.02: Malformed hOCR document: character zones intermixed with non-character zones Feb 11, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Development

No branches or pull requests

1 participant