Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AppleDict binary crash with Japanese characters: lxml.etree.XMLSyntaxError: Char 0x0 out of allowed range #473

Closed
whabbot opened this issue Jun 3, 2023 · 17 comments
Labels

Comments

@whabbot
Copy link

whabbot commented Jun 3, 2023

When attempting to convert the Sanseido Super Daijirin.dictionary file included in MacOS (13.3.1) to csv and sql there is an error lxml.etree.XMLSyntaxError: Char 0x0 out of allowed range, line 1, column 2. This seems similar to #275.

Versions:
Python 3.9.6 (error log included) and 3.11.3
pyglossary 4.6.1
lxml 4.9.2

Error log (full output attached):

Traceback (most recent call last):
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/plugins/appledict_bin/__init__.py", line 423, in convertEntryBytesToXml
    entryRoot = etree.fromstring(entryFull)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/lxml/etree.pyx", line 3257, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1916, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1796, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc
  File "src/lxml/parser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 728, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 657, in lxml.etree._raiseParseError
  File "<string>", line 1
lxml.etree.XMLSyntaxError: Char 0x0 out of allowed range, line 1, column 2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/glossary_v2.py", line 826, in _write
    self._writeEntries(writerList, filename)
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/glossary_v2.py", line 760, in _writeEntries
    for entry in self:
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/glossary_v2.py", line 321, in _readersEntryGen
    yield from self._applyEntryFiltersGen(reader)
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/glossary_v2.py", line 335, in _applyEntryFiltersGen
    for entry in gen:
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/plugins/appledict_bin/__init__.py", line 623, in __iter__
    entry = self.createEntry(entryBytes, articleAddress)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/plugins/appledict_bin/__init__.py", line 371, in createEntry
    entryRoot = self.convertEntryBytesToXml(entryBytes)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/plugins/appledict_bin/__init__.py", line 426, in convertEntryBytesToXml
    f"len(buf)={len(self._buf)}, {entryFull=}",
                    ^^^^^^^^^
AttributeError: 'Reader' object has no attribute '_buf'

error_log.txt

@ilius
Copy link
Owner

ilius commented Jun 4, 2023

No, it's different than #275.
Seems like this glossary XML includes zero byte.

@ilius
Copy link
Owner

ilius commented Jun 4, 2023

I pushed a commit.
Please try the latest code:

sh
pip install git+https://github.com/ilius/pyglossary

@whabbot
Copy link
Author

whabbot commented Jun 4, 2023

Thanks for your fast response! I ran that and it seemed to give the same error but with a slightly different message (full log attached below):

[ERROR] entryFull='<d:entry xmlns:d="http://www.apple.com/DTDs/DictionaryService-1.0.rng" id="7144" d:title="𩺊" class="entry" lang="ja"><span class="hg x_xh0"><span d:prn="1" role="text" class="hw">あら <d:prn></d:prn></span><span class="pr t_アクセントG"><span class="xrg"><span class="xr"><a href="x-dictionary:r:fbm_AccentPatterns" title="アクセントの型"><span class="ph t_アクセント x_rr">2</span></a></span></span></span><span class="fg">【<span class="f"><span class="general-text">𩺊</span></span>】</span></span><span class="sg"><span class="se1 x_xd0"><span class="msDict x_xd1 t_core"><span d:def="1" role="text" class="df">スズキ目の海魚。全長1<span class="general-text">メートル</span>に達する。体形はスズキに似て,やや長く側扁し,口はとがって大きい。背は灰褐色で腹は白色。幼魚には口から尾に至る灰褐色の縦帯がある。冬が旬で美味。北海道以南からフィリピンまでのやや深海に分布。ホタ。スズキ。<d:def></d:def></span></span></span></span></d:entry>'
[ERROR] Exception while calling plugin's write function                                                                                            
Traceback (most recent call last):
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/glossary_v2.py", line 903, in _write
    self._writeEntries(writerList, filename)
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/glossary_v2.py", line 837, in _writeEntries
    for entry in self:
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/glossary_v2.py", line 393, in _readersEntryGen
    yield from self._applyEntryFiltersGen(reader)
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/glossary_v2.py", line 407, in _applyEntryFiltersGen
    for entry in gen:
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/plugins/appledict_bin/__init__.py", line 696, in __iter__
    entry = self.createEntry(entryBytes, articleAddress)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/plugins/appledict_bin/__init__.py", line 399, in createEntry
    entryRoot = self.convertEntryBytesToXml(entryBytes)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/plugins/appledict_bin/__init__.py", line 464, in convertEntryBytesToXml
    raise e
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/plugins/appledict_bin/__init__.py", line 459, in convertEntryBytesToXml
    entryRoot = etree.fromstring(entryFull)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/lxml/etree.pyx", line 3257, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1916, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1796, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc
  File "src/lxml/parser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 728, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 657, in lxml.etree._raiseParseError
  File "<string>", line 1
lxml.etree.XMLSyntaxError: Char 0x0 out of allowed range, line 1, column 2

error_log_2.txt

EDIT:
I also ran with the Sanseido The WISDOM English-Japanese Japanese-English Dictionary.dictionary file and get the same error: error_log_3.txt

@soshial
Copy link
Contributor

soshial commented Jun 4, 2023

I printed out entryBytes in convertEntryBytesToXml() inside exception catch block and it didn't have any 0x0 symbols:

b'<d:entry xmlns:d="http://www.apple.com/DTDs/DictionaryService-1.0.rng" id="7144" d:title="\xf0\xa9\xba\x8a" class="entry" lang="ja"><span class="hg x_xh0"><span d:prn="1" role="text" class="hw">\xe3\x81\x82\xe3\x82\x89 <d:prn></d:prn></span><span class="pr t_\xe3\x82\xa2\xe3\x82\xaf\xe3\x82\xbb\xe3\x83\xb3\xe3\x83\x88G"><span class="xrg"><span class="xr"><a href="x-dictionary:r:fbm_AccentPatterns" title="\xe3\x82\xa2\xe3\x82\xaf\xe3\x82\xbb\xe3\x83\xb3\xe3\x83\x88\xe3\x81\xae\xe5\x9e\x8b"><span class="ph t_\xe3\x82\xa2\xe3\x82\xaf\xe3\x82\xbb\xe3\x83\xb3\xe3\x83\x88 x_rr">2</span></a></span></span></span><span class="fg">\xe3\x80\x90<span class="f"><span class="general-text">\xf0\xa9\xba\x8a</span></span>\xe3\x80\x91</span></span><span class="sg"><span class="se1 x_xd0"><span class="msDict x_xd1 t_core"><span d:def="1" role="text" class="df">\xe3\x82\xb9\xe3\x82\xba\xe3\x82\xad\xe7\x9b\xae\xe3\x81\xae\xe6\xb5\xb7\xe9\xad\x9a\xe3\x80\x82\xe5\x85\xa8\xe9\x95\xb71<span class="general-text">\xe3\x83\xa1\xe3\x83\xbc\xe3\x83\x88\xe3\x83\xab</span>\xe3\x81\xab\xe9\x81\x94\xe3\x81\x99\xe3\x82\x8b\xe3\x80\x82\xe4\xbd\x93\xe5\xbd\xa2\xe3\x81\xaf\xe3\x82\xb9\xe3\x82\xba\xe3\x82\xad\xe3\x81\xab\xe4\xbc\xbc\xe3\x81\xa6\xef\xbc\x8c\xe3\x82\x84\xe3\x82\x84\xe9\x95\xb7\xe3\x81\x8f\xe5\x81\xb4\xe6\x89\x81\xe3\x81\x97\xef\xbc\x8c\xe5\x8f\xa3\xe3\x81\xaf\xe3\x81\xa8\xe3\x81\x8c\xe3\x81\xa3\xe3\x81\xa6\xe5\xa4\xa7\xe3\x81\x8d\xe3\x81\x84\xe3\x80\x82\xe8\x83\x8c\xe3\x81\xaf\xe7\x81\xb0\xe8\xa4\x90\xe8\x89\xb2\xe3\x81\xa7\xe8\x85\xb9\xe3\x81\xaf\xe7\x99\xbd\xe8\x89\xb2\xe3\x80\x82\xe5\xb9\xbc\xe9\xad\x9a\xe3\x81\xab\xe3\x81\xaf\xe5\x8f\xa3\xe3\x81\x8b\xe3\x82\x89\xe5\xb0\xbe\xe3\x81\xab\xe8\x87\xb3\xe3\x82\x8b\xe7\x81\xb0\xe8\xa4\x90\xe8\x89\xb2\xe3\x81\xae\xe7\xb8\xa6\xe5\xb8\xaf\xe3\x81\x8c\xe3\x81\x82\xe3\x82\x8b\xe3\x80\x82\xe5\x86\xac\xe3\x81\x8c\xe6\x97\xac\xe3\x81\xa7\xe7\xbe\x8e\xe5\x91\xb3\xe3\x80\x82\xe5\x8c\x97\xe6\xb5\xb7\xe9\x81\x93\xe4\xbb\xa5\xe5\x8d\x97\xe3\x81\x8b\xe3\x82\x89\xe3\x83\x95\xe3\x82\xa3\xe3\x83\xaa\xe3\x83\x94\xe3\x83\xb3\xe3\x81\xbe\xe3\x81\xa7\xe3\x81\xae\xe3\x82\x84\xe3\x82\x84\xe6\xb7\xb1\xe6\xb5\xb7\xe3\x81\xab\xe5\x88\x86\xe5\xb8\x83\xe3\x80\x82\xe3\x83\x9b\xe3\x82\xbf\xe3\x80\x82\xe3\x82\xb9\xe3\x82\xba\xe3\x82\xad\xe3\x80\x82<d:def></d:def></span></span></span></span></d:entry>\n'

Looks perfectly normal. Hmm... why does it fire this exception?

@whabbot
Copy link
Author

whabbot commented Jun 4, 2023

Looking into it a bit more, I think it may be an issue with etree.fromstring and the size of the unicode characters.

With Sanseido Super Daijirin.dictionary, the failed entry contains the character 𩺊 which is a 4-byte unicode character (\U00029e8a), and with Sanseido The WISDOM English-Japanese Japanese-English Dictionary.dictionary the failed entry contains 𠮟 which is also a 4-byte unicode character (\U00020b9f).

Running just this in Python gives the same error:

>>> from lxml import etree
>>> etree.fromstring('<root>𩺊</root>')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "src/lxml/etree.pyx", line 3257, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1916, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1796, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc
  File "src/lxml/parser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 728, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 657, in lxml.etree._raiseParseError
  File "<string>", line 1
lxml.etree.XMLSyntaxError: Char 0x0 out of allowed range, line 1, column 2

@soshial
Copy link
Contributor

soshial commented Jun 4, 2023

There is some lxml library bug, but etree creation is done successfully with entryRoot = etree.parse(BytesIO(entryBytes)). But I get another error: AttributeError: 'lxml.etree._ElementTree' object has no attribute 'nsmap'

@ilius
Copy link
Owner

ilius commented Jun 4, 2023

etree.parse is not the right function.

It works for me with lxml 4.9.2:

>>> etree.tounicode(etree.fromstring('<entry id="7144" title="test">𩺊</entry>\n'))
'<entry id="7144" title="test">𩺊</entry>'

Maybe because my lxml is installed from Debian's repository.
@whabbot How did you install lxml? Did you run pip and it compiled it?

Can you try this in Python's console?

from lxml import etree

etree.tounicode(etree.fromstring('<entry id="7144" title="test">𩺊</entry>\n'.encode('utf-16')))

@soshial
Copy link
Contributor

soshial commented Jun 4, 2023

My versions are macOS:

  1. lxml 4.9.2
  2. Python 3.9.6 (default, Mar 10 2023, 20:16:38) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin

etree.fromstring('<entry id="7144" title="test">𩺊</entry>\n') returns error.

etree.fromstring('<entry id="7144" title="test">𩺊</entry>\n'.encode(encoding='utf-16')) no error

@ilius
Copy link
Owner

ilius commented Jun 4, 2023

I pushed to branch issue-473
Please try again with this:

pip install git+https://github.com/ilius/pyglossary@issue-473

@whabbot
Copy link
Author

whabbot commented Jun 4, 2023

I installed using pip install git+https://github.com/ilius/pyglossary@issue-473 and now Sanseido The WISDOM English-Japanese Japanese-English Dictionary.dictionary converts without issue.

Sanseido Super Daijirin.dictionary completes without crashing and it seems most if not all entries are there, but there are a lot of errors (>500) being printed to the console, which all look like:

[ERROR] bad unicode in '\ud842\udf9fり付けけり', keyData={'priority': 0, 'parentalControl': 0, 'keyword': '\ud842\udf9fり付けけり', 'headword': '\ud842\udf9fり付ける', 'entryTitle': '', 'anchor': ''}, 
error: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed

Full output: output.txt

@ilius
Copy link
Owner

ilius commented Jun 4, 2023

I pushed a commit.
Please try again.

@whabbot
Copy link
Author

whabbot commented Jun 5, 2023

That works without any errors! Thank you.

@ilius
Copy link
Owner

ilius commented Jun 5, 2023

I merged into master.
Thanks.

@ilius ilius closed this as completed Jun 5, 2023
@soshial
Copy link
Contributor

soshial commented Jun 5, 2023

I have created a bit more elegant solution: soshial@81c8ecc. Words in ketTextData file are actually encoded as UTF16 bytestrings.

@ilius
Copy link
Owner

ilius commented Jun 5, 2023

@soshial Please rebase and add a pull request.

soshial added a commit to soshial/pyglossary that referenced this issue Jun 5, 2023
@soshial
Copy link
Contributor

soshial commented Jun 5, 2023

Done. I tested on different types of AppleDict and it looks okay, but I didn't find tests for AppleDict format variants.

@ilius
Copy link
Owner

ilius commented Jun 10, 2023

Please always open a new issue unless you are sure it's the same bug.
It's hard to find and follow closed issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants