Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid xmlChar value 55357, line 49, column 101 #1

Open
kmanwar89 opened this issue Jan 13, 2018 · 4 comments
Open

Invalid xmlChar value 55357, line 49, column 101 #1

kmanwar89 opened this issue Jan 13, 2018 · 4 comments

Comments

@kmanwar89
Copy link

kmanwar89 commented Jan 13, 2018

Hi there,

Ran your script and received the following error text as output:

XXXXXXX@ubuntu:~/Desktop/Untitled Folder/smsxml2html-master$ python smsxml2html.py -o ~/Desktop -n 11111111111 sms-20171119175633.xml
Traceback (most recent call last):
File "smsxml2html.py", line 241, in
main()
File "smsxml2html.py", line 223, in main
tree = etree.parse(input, parser = lxml_parser)
File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse (src/lxml/lxml.etree.c:85131)
File "src/lxml/parser.pxi", line 1782, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:124005)
File "src/lxml/parser.pxi", line 1808, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:124374)
File "src/lxml/parser.pxi", line 1712, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:123169)
File "src/lxml/parser.pxi", line 1115, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:117533)
File "src/lxml/parser.pxi", line 573, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:110510)
File "src/lxml/parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:112276)
File "src/lxml/parser.pxi", line 613, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:111124)
lxml.etree.XMLSyntaxError: xmlParseCharRef: invalid xmlChar value 55357, line 49, column 101

Any help would be appreciated. I ran this on an Ubuntu 16.04.1 VM and received the same output whether using Python 2.X or 3.X.

@KermMartian
Copy link
Owner

Thanks! It sounds like your input XML contains https://apps.timwhitlock.info/unicode/inspect?s=%F0%9F%98%84 (the 😄 character), and I'm not handling that properly. Can you please view the XML in a Unicode-aware text editor, check line 49, column 101, and confirm that that's what it contains?

@T2Fr
Copy link
Contributor

T2Fr commented Aug 14, 2018

Seems to be working with lxml_parser = XMLParser(huge_tree = True, recover = True)

@akobel
Copy link

akobel commented Aug 22, 2018

Had a pile of poo here, literally: 💩 (which, apparently like anything including surrogate pairs, causes trouble). It was encoded by ��. Other than that, some more smileys.

The following piece of code made it work for me:

from lxml.etree import XMLParser, parse
from io import BytesIO
lxml_parser = XMLParser(huge_tree = True)

payload = open (input, 'rb').read().decode ('utf-8')
chrifnotspecial = lambda dec: '&#%d;' % dec if dec in [ 10, 13, 35, 38, 59, 60, 62 ] else chr (dec) # don't convert `\n\r#&;<>`
payload = re.sub (r'&#(\d+);', lambda x: chrifnotspecial (int (x.group(1))), payload)
# No idea why 'INFORMATION SEPARATOR's ended up in some messages,
# but I decide that I don't need them, and they make the parser barf out...
for dec in [ 28, 29, 30, 31 ]:
    payload = payload.replace (chr (dec), '')

# combine surrogate pairs
payload = payload.encode ('utf-16', 'surrogatepass').decode ('utf-16')
payload = payload.encode ('utf-8')

tree = etree.parse (BytesIO (payload), parser = lxml_parser)
root = tree.getroot()

Obviously, this should replace the existing code in main() for reading the input file.

This assumes that you are using Python 3 (required for chr() with inputs > 255; unichr() of Python 2 should do the job as well, but I didn't test). smsxml2html is almost Python3-compatible except for two minor parts: You have to replace the two occurences of iteritems with items, and msg.text.encode('utf8') by msg.text (or msg.text.strip(), possibly, if you preserve whitespace, but want to drop superfluous spaces at beginning and end of a message). If encoding='utf-8' is given as an additional argument for open(output_path, 'w'), I guess that this should even be fully backwards compatible.

I'm not an expert on Unicode representation by any means. IIUC, the encoding used in those XML files is more UTF-16-ish than UTF-8-ish, and "normal" (more clever) means to convert (like passing the XML to BeautifulSoup) fail because an entity-wise conversion yields invalid results in some intermediate stage (since &#55357; does not translate to a character, but a "surrogate code point", which is more like a modifier for the next character). I might well be totally wrong here; almost my entire wisdom is based on a StackOverflow post concerning &#55357; and some pile of poo reference.
Anyway, my understanding of what I do is: convert each entity independendly and blindly, then convert to UTF-16 while keeping those surrogate pairs alone, then read and interpret them, and then encode again to the more well-received (at least to me) UTF-8.
By the way, no clue why apparently I need to go via the BytesIO, but this works. Using etree.fromstring() instead of etree.parse() did not (although, AFAICS, this should do the same after removing the encoding tag in line 1 of the XML?)...

Caveat: this is a pretty brute-force-ish hands-on approach. Worked for me; YMMV.
I found some evidence that, apparently, funny characters in XML attributes are not really covered by the XML standard, although it seems that XML 1.1 relaxed it somewhat. In any case, the file produced by SMS Backup & Restore seem to not strictly obey the standard in all cases.
This is pretty much "best-effort recovery", with lowest effort for me.

Note that I preserve, among others, the &#10; encoding for linebreaks, which would be silently converted to normal spaces by the LXML parser. I like to have white-space: pre-wrap; in the CSS for .month_convos td; I appreciate if my conversation partners spend the effort to type line breaks, so who am I to drop them in the archives?

@akobel
Copy link

akobel commented Aug 22, 2018

By the way, @T2Fr: recover = True "Seems to be working" - but, unfortunately, at the expense of ignoring the character. These days, that can mean "discarding the message", which would be 💩☹... 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants