-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AppleDict binary crash with Japanese characters: lxml.etree.XMLSyntaxError: Char 0x0 out of allowed range #473
Comments
No, it's different than #275. |
I pushed a commit.
|
Thanks for your fast response! I ran that and it seemed to give the same error but with a slightly different message (full log attached below):
EDIT: |
I printed out
Looks perfectly normal. Hmm... why does it fire this exception? |
Looking into it a bit more, I think it may be an issue with With Running just this in Python gives the same error:
|
There is some lxml library bug, but etree creation is done successfully with |
It works for me with >>> etree.tounicode(etree.fromstring('<entry id="7144" title="test">𩺊</entry>\n'))
'<entry id="7144" title="test">𩺊</entry>' Maybe because my lxml is installed from Debian's repository. Can you try this in Python's console? from lxml import etree
etree.tounicode(etree.fromstring('<entry id="7144" title="test">𩺊</entry>\n'.encode('utf-16'))) |
My versions are macOS:
|
I pushed to branch pip install git+https://github.com/ilius/pyglossary@issue-473 |
I installed using
Full output: output.txt |
I pushed a commit. |
That works without any errors! Thank you. |
I merged into master. |
I have created a bit more elegant solution: soshial@81c8ecc. Words in ketTextData file are actually encoded as UTF16 bytestrings. |
@soshial Please rebase and add a pull request. |
Done. I tested on different types of AppleDict and it looks okay, but I didn't find tests for AppleDict format variants. |
Please always open a new issue unless you are sure it's the same bug. |
When attempting to convert the
Sanseido Super Daijirin.dictionary
file included in MacOS (13.3.1) to csv and sql there is an errorlxml.etree.XMLSyntaxError: Char 0x0 out of allowed range, line 1, column 2
. This seems similar to #275.Versions:
Python 3.9.6 (error log included) and 3.11.3
pyglossary 4.6.1
lxml 4.9.2
Error log (full output attached):
error_log.txt
The text was updated successfully, but these errors were encountered: