AppleDict binary crash with Japanese characters: lxml.etree.XMLSyntaxError: Char 0x0 out of allowed range #473

whabbot · 2023-06-03T17:04:13Z

When attempting to convert the Sanseido Super Daijirin.dictionary file included in MacOS (13.3.1) to csv and sql there is an error lxml.etree.XMLSyntaxError: Char 0x0 out of allowed range, line 1, column 2. This seems similar to #275.

Versions:
Python 3.9.6 (error log included) and 3.11.3
pyglossary 4.6.1
lxml 4.9.2

Error log (full output attached):

Traceback (most recent call last):
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/plugins/appledict_bin/__init__.py", line 423, in convertEntryBytesToXml
    entryRoot = etree.fromstring(entryFull)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/lxml/etree.pyx", line 3257, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1916, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1796, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc
  File "src/lxml/parser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 728, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 657, in lxml.etree._raiseParseError
  File "<string>", line 1
lxml.etree.XMLSyntaxError: Char 0x0 out of allowed range, line 1, column 2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/glossary_v2.py", line 826, in _write
    self._writeEntries(writerList, filename)
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/glossary_v2.py", line 760, in _writeEntries
    for entry in self:
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/glossary_v2.py", line 321, in _readersEntryGen
    yield from self._applyEntryFiltersGen(reader)
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/glossary_v2.py", line 335, in _applyEntryFiltersGen
    for entry in gen:
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/plugins/appledict_bin/__init__.py", line 623, in __iter__
    entry = self.createEntry(entryBytes, articleAddress)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/plugins/appledict_bin/__init__.py", line 371, in createEntry
    entryRoot = self.convertEntryBytesToXml(entryBytes)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/plugins/appledict_bin/__init__.py", line 426, in convertEntryBytesToXml
    f"len(buf)={len(self._buf)}, {entryFull=}",
                    ^^^^^^^^^
AttributeError: 'Reader' object has no attribute '_buf'

error_log.txt

The text was updated successfully, but these errors were encountered:

ilius · 2023-06-04T09:55:34Z

No, it's different than #275.
Seems like this glossary XML includes zero byte.

ilius · 2023-06-04T11:56:40Z

I pushed a commit.
Please try the latest code:

sh
pip install git+https://github.com/ilius/pyglossary

whabbot · 2023-06-04T12:07:00Z

Thanks for your fast response! I ran that and it seemed to give the same error but with a slightly different message (full log attached below):

[ERROR] entryFull='<d:entry xmlns:d="http://www.apple.com/DTDs/DictionaryService-1.0.rng" id="7144" d:title="𩺊" class="entry" lang="ja"><span class="hg x_xh0"><span d:prn="1" role="text" class="hw">あら <d:prn></d:prn></span><span class="pr t_アクセントG"><span class="xrg"><span class="xr"><a href="x-dictionary:r:fbm_AccentPatterns" title="アクセントの型"><span class="ph t_アクセント x_rr">2</span></a></span></span></span><span class="fg">【<span class="f"><span class="general-text">𩺊</span></span>】</span></span><span class="sg"><span class="se1 x_xd0"><span class="msDict x_xd1 t_core"><span d:def="1" role="text" class="df">スズキ目の海魚。全長1<span class="general-text">メートル</span>に達する。体形はスズキに似て，やや長く側扁し，口はとがって大きい。背は灰褐色で腹は白色。幼魚には口から尾に至る灰褐色の縦帯がある。冬が旬で美味。北海道以南からフィリピンまでのやや深海に分布。ホタ。スズキ。<d:def></d:def></span></span></span></span></d:entry>'
[ERROR] Exception while calling plugin's write function                                                                                            
Traceback (most recent call last):
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/glossary_v2.py", line 903, in _write
    self._writeEntries(writerList, filename)
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/glossary_v2.py", line 837, in _writeEntries
    for entry in self:
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/glossary_v2.py", line 393, in _readersEntryGen
    yield from self._applyEntryFiltersGen(reader)
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/glossary_v2.py", line 407, in _applyEntryFiltersGen
    for entry in gen:
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/plugins/appledict_bin/__init__.py", line 696, in __iter__
    entry = self.createEntry(entryBytes, articleAddress)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/plugins/appledict_bin/__init__.py", line 399, in createEntry
    entryRoot = self.convertEntryBytesToXml(entryBytes)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/plugins/appledict_bin/__init__.py", line 464, in convertEntryBytesToXml
    raise e
  File "/Users/user/projects/apple_dict/pyglossary/.env/lib/python3.11/site-packages/pyglossary/plugins/appledict_bin/__init__.py", line 459, in convertEntryBytesToXml
    entryRoot = etree.fromstring(entryFull)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/lxml/etree.pyx", line 3257, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1916, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1796, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc
  File "src/lxml/parser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 728, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 657, in lxml.etree._raiseParseError
  File "<string>", line 1
lxml.etree.XMLSyntaxError: Char 0x0 out of allowed range, line 1, column 2

error_log_2.txt

EDIT:
I also ran with the Sanseido The WISDOM English-Japanese Japanese-English Dictionary.dictionary file and get the same error: error_log_3.txt

soshial · 2023-06-04T14:39:20Z

I printed out entryBytes in convertEntryBytesToXml() inside exception catch block and it didn't have any 0x0 symbols:

b'<d:entry xmlns:d="http://www.apple.com/DTDs/DictionaryService-1.0.rng" id="7144" d:title="\xf0\xa9\xba\x8a" class="entry" lang="ja">\xe3\x81\x82\xe3\x82\x89 <d:prn></d:prn><a href="x-dictionary:r:fbm_AccentPatterns" title="\xe3\x82\xa2\xe3\x82\xaf\xe3\x82\xbb\xe3\x83\xb3\xe3\x83\x88\xe3\x81\xae\xe5\x9e\x8b">2</a>\xe3\x80\x90\xf0\xa9\xba\x8a\xe3\x80\x91\xe3\x82\xb9\xe3\x82\xba\xe3\x82\xad\xe7\x9b\xae\xe3\x81\xae\xe6\xb5\xb7\xe9\xad\x9a\xe3\x80\x82\xe5\x85\xa8\xe9\x95\xb71\xe3\x83\xa1\xe3\x83\xbc\xe3\x83\x88\xe3\x83\xab\xe3\x81\xab\xe9\x81\x94\xe3\x81\x99\xe3\x82\x8b\xe3\x80\x82\xe4\xbd\x93\xe5\xbd\xa2\xe3\x81\xaf\xe3\x82\xb9\xe3\x82\xba\xe3\x82\xad\xe3\x81\xab\xe4\xbc\xbc\xe3\x81\xa6\xef\xbc\x8c\xe3\x82\x84\xe3\x82\x84\xe9\x95\xb7\xe3\x81\x8f\xe5\x81\xb4\xe6\x89\x81\xe3\x81\x97\xef\xbc\x8c\xe5\x8f\xa3\xe3\x81\xaf\xe3\x81\xa8\xe3\x81\x8c\xe3\x81\xa3\xe3\x81\xa6\xe5\xa4\xa7\xe3\x81\x8d\xe3\x81\x84\xe3\x80\x82\xe8\x83\x8c\xe3\x81\xaf\xe7\x81\xb0\xe8\xa4\x90\xe8\x89\xb2\xe3\x81\xa7\xe8\x85\xb9\xe3\x81\xaf\xe7\x99\xbd\xe8\x89\xb2\xe3\x80\x82\xe5\xb9\xbc\xe9\xad\x9a\xe3\x81\xab\xe3\x81\xaf\xe5\x8f\xa3\xe3\x81\x8b\xe3\x82\x89\xe5\xb0\xbe\xe3\x81\xab\xe8\x87\xb3\xe3\x82\x8b\xe7\x81\xb0\xe8\xa4\x90\xe8\x89\xb2\xe3\x81\xae\xe7\xb8\xa6\xe5\xb8\xaf\xe3\x81\x8c\xe3\x81\x82\xe3\x82\x8b\xe3\x80\x82\xe5\x86\xac\xe3\x81\x8c\xe6\x97\xac\xe3\x81\xa7\xe7\xbe\x8e\xe5\x91\xb3\xe3\x80\x82\xe5\x8c\x97\xe6\xb5\xb7\xe9\x81\x93\xe4\xbb\xa5\xe5\x8d\x97\xe3\x81\x8b\xe3\x82\x89\xe3\x83\x95\xe3\x82\xa3\xe3\x83\xaa\xe3\x83\x94\xe3\x83\xb3\xe3\x81\xbe\xe3\x81\xa7\xe3\x81\xae\xe3\x82\x84\xe3\x82\x84\xe6\xb7\xb1\xe6\xb5\xb7\xe3\x81\xab\xe5\x88\x86\xe5\xb8\x83\xe3\x80\x82\xe3\x83\x9b\xe3\x82\xbf\xe3\x80\x82\xe3\x82\xb9\xe3\x82\xba\xe3\x82\xad\xe3\x80\x82<d:def></d:def></d:entry>\n'

Looks perfectly normal. Hmm... why does it fire this exception?

whabbot · 2023-06-04T14:58:56Z

Looking into it a bit more, I think it may be an issue with etree.fromstring and the size of the unicode characters.

With Sanseido Super Daijirin.dictionary, the failed entry contains the character 𩺊 which is a 4-byte unicode character (\U00029e8a), and with Sanseido The WISDOM English-Japanese Japanese-English Dictionary.dictionary the failed entry contains 𠮟 which is also a 4-byte unicode character (\U00020b9f).

Running just this in Python gives the same error:

>>> from lxml import etree
>>> etree.fromstring('<root>𩺊</root>')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "src/lxml/etree.pyx", line 3257, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1916, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1796, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc
  File "src/lxml/parser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 728, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 657, in lxml.etree._raiseParseError
  File "<string>", line 1
lxml.etree.XMLSyntaxError: Char 0x0 out of allowed range, line 1, column 2

soshial · 2023-06-04T16:19:47Z

There is some lxml library bug, but etree creation is done successfully with entryRoot = etree.parse(BytesIO(entryBytes)). But I get another error: AttributeError: 'lxml.etree._ElementTree' object has no attribute 'nsmap'

ilius · 2023-06-04T19:26:58Z

etree.parse is not the right function.

It works for me with lxml 4.9.2:

>>> etree.tounicode(etree.fromstring('<entry id="7144" title="test">𩺊</entry>\n'))
'<entry id="7144" title="test">𩺊</entry>'

Maybe because my lxml is installed from Debian's repository.
@whabbot How did you install lxml? Did you run pip and it compiled it?

Can you try this in Python's console?

from lxml import etree

etree.tounicode(etree.fromstring('<entry id="7144" title="test">𩺊</entry>\n'.encode('utf-16')))

soshial · 2023-06-04T19:58:44Z

My versions are macOS:

lxml 4.9.2
Python 3.9.6 (default, Mar 10 2023, 20:16:38) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin

etree.fromstring('<entry id="7144" title="test">𩺊</entry>\n') returns error.

etree.fromstring('<entry id="7144" title="test">𩺊</entry>\n'.encode(encoding='utf-16')) no error

…473" This reverts commit 1bfc973.

…#473

ilius · 2023-06-04T20:48:42Z

I pushed to branch issue-473
Please try again with this:

pip install git+https://github.com/ilius/pyglossary@issue-473

whabbot · 2023-06-04T22:57:47Z

I installed using pip install git+https://github.com/ilius/pyglossary@issue-473 and now Sanseido The WISDOM English-Japanese Japanese-English Dictionary.dictionary converts without issue.

Sanseido Super Daijirin.dictionary completes without crashing and it seems most if not all entries are there, but there are a lot of errors (>500) being printed to the console, which all look like:

[ERROR] bad unicode in '\ud842\udf9fり付けけり', keyData={'priority': 0, 'parentalControl': 0, 'keyword': '\ud842\udf9fり付けけり', 'headword': '\ud842\udf9fり付ける', 'entryTitle': '', 'anchor': ''}, 
error: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed

Full output: output.txt

ilius · 2023-06-04T23:57:24Z

I pushed a commit.
Please try again.

whabbot · 2023-06-05T00:47:04Z

That works without any errors! Thank you.

…#473

ilius · 2023-06-05T08:33:55Z

I merged into master.
Thanks.

soshial · 2023-06-05T08:47:23Z

I have created a bit more elegant solution: soshial@81c8ecc. Words in ketTextData file are actually encoded as UTF16 bytestrings.

ilius · 2023-06-05T10:28:27Z

@soshial Please rebase and add a pull request.

soshial · 2023-06-05T11:37:57Z

Done. I tested on different types of AppleDict and it looks okay, but I didn't find tests for AppleDict format variants.

ilius · 2023-06-10T16:34:48Z

Please always open a new issue unless you are sure it's the same bug.
It's hard to find and follow closed issues.

ilius added a commit that referenced this issue Jun 4, 2023

appledict binary: remove possible zero byte from entryBytes, #473

1bfc973

ilius added a commit that referenced this issue Jun 4, 2023

Revert "appledict binary: remove possible zero byte from entryBytes, #…

109db7e

…473" This reverts commit 1bfc973.

ilius added a commit that referenced this issue Jun 4, 2023

appledict binary: fix possible XMLSyntaxError with wide unicode chars, …

76d2516

…#473

ilius added a commit that referenced this issue Jun 4, 2023

appledict binary: fix wide unicode chars in readKeyTextData, #473

a330437

ilius added a commit that referenced this issue Jun 5, 2023

appledict binary: fix possible XMLSyntaxError with wide unicode chars, …

6f871b7

…#473

ilius added a commit that referenced this issue Jun 5, 2023

appledict binary: fix wide unicode chars in readKeyTextData, #473

a7bbbce

ilius closed this as completed Jun 5, 2023

soshial added a commit to soshial/pyglossary that referenced this issue Jun 5, 2023

AppleDict-bin: support UTF16 4-byte symbols ilius#473

cbc871d

ilius pushed a commit that referenced this issue Jun 5, 2023

AppleDict-bin: support UTF16 4-byte symbols #473

e64af6a

ilius added the Bug label Jun 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AppleDict binary crash with Japanese characters: lxml.etree.XMLSyntaxError: Char 0x0 out of allowed range #473

AppleDict binary crash with Japanese characters: lxml.etree.XMLSyntaxError: Char 0x0 out of allowed range #473

whabbot commented Jun 3, 2023 •

edited

Loading

ilius commented Jun 4, 2023 •

edited

Loading

ilius commented Jun 4, 2023

whabbot commented Jun 4, 2023 •

edited

Loading

soshial commented Jun 4, 2023 •

edited

Loading

whabbot commented Jun 4, 2023

soshial commented Jun 4, 2023 •

edited

Loading

ilius commented Jun 4, 2023

soshial commented Jun 4, 2023 •

edited by ilius

Loading

ilius commented Jun 4, 2023

whabbot commented Jun 4, 2023 •

edited

Loading

ilius commented Jun 4, 2023

whabbot commented Jun 5, 2023

ilius commented Jun 5, 2023

soshial commented Jun 5, 2023 •

edited

Loading

ilius commented Jun 5, 2023

soshial commented Jun 5, 2023

ilius commented Jun 10, 2023

AppleDict binary crash with Japanese characters: lxml.etree.XMLSyntaxError: Char 0x0 out of allowed range #473

AppleDict binary crash with Japanese characters: lxml.etree.XMLSyntaxError: Char 0x0 out of allowed range #473

Comments

whabbot commented Jun 3, 2023 • edited Loading

ilius commented Jun 4, 2023 • edited Loading

ilius commented Jun 4, 2023

whabbot commented Jun 4, 2023 • edited Loading

soshial commented Jun 4, 2023 • edited Loading

whabbot commented Jun 4, 2023

soshial commented Jun 4, 2023 • edited Loading

ilius commented Jun 4, 2023

soshial commented Jun 4, 2023 • edited by ilius Loading

ilius commented Jun 4, 2023

whabbot commented Jun 4, 2023 • edited Loading

ilius commented Jun 4, 2023

whabbot commented Jun 5, 2023

ilius commented Jun 5, 2023

soshial commented Jun 5, 2023 • edited Loading

ilius commented Jun 5, 2023

soshial commented Jun 5, 2023

ilius commented Jun 10, 2023

whabbot commented Jun 3, 2023 •

edited

Loading

ilius commented Jun 4, 2023 •

edited

Loading

whabbot commented Jun 4, 2023 •

edited

Loading

soshial commented Jun 4, 2023 •

edited

Loading

soshial commented Jun 4, 2023 •

edited

Loading

soshial commented Jun 4, 2023 •

edited by ilius

Loading

whabbot commented Jun 4, 2023 •

edited

Loading

soshial commented Jun 5, 2023 •

edited

Loading