lxml doesn’t like control characters #96

SimonSapin · 2013-07-22T10:17:29Z

Same issue as #33, but with other non-whitespace C0 control characters: U+0001 to U+0008, U+000B, U+000C, U+000E to U+001F.

Each of these trigger the exception below:

html5lib.parse('<p>&#1;', treebuilder='lxml')
html5lib.parse('<p>\x01', treebuilder='lxml')
html5lib.parse('<p id="&#1;">', treebuilder='lxml')
html5lib.parse('<p id="\x01">', treebuilder='lxml')

Traceback (most recent call last):
  File "/tmp/a.py", line 4, in <module>
    html5lib.parse('<p>&#1;', treebuilder='lxml')
  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 28, in parse
    return p.parse(doc, encoding=encoding)
  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 224, in parse
    parseMeta=parseMeta, useChardet=useChardet)
  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 93, in _parse
    self.mainLoop()
  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 183, in mainLoop
    new_token = phase.processCharacters(new_token)
  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 991, in processCharacters
    self.tree.insertText(token["data"])
  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/_base.py", line 320, in insertText
    parent.insertText(data)
  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/etree_lxml.py", line 240, in insertText
    builder.Element.insertText(self, data, insertBefore)
  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/etree.py", line 108, in insertText
    self._element.text += data
  File "lxml.etree.pyx", line 921, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:41467)
  File "apihelpers.pxi", line 652, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:18888)
  File "apihelpers.pxi", line 1335, in lxml.etree._utf8 (src/lxml/lxml.etree.c:24701)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

U+000C in text (but not in attribute values) is replaced by U+0020 with a warning:

DataLossWarning: Text cannot contain U+000C

libxml2’s HTML parser replaces them with nothing, which I slightly prefer. Anyway, this is probably what should happen for every character that lxml doesn’t like.

The text was updated successfully, but these errors were encountered:

EmilStenstrom · 2014-05-18T12:20:00Z

Here's a workaround for anyone that needs to get things working before this bug is fixed. Just run this code over the html before sending it to html5lib:

import re
def remove_control_characters(html):
    def str_to_int(s, default, base=10):
        if int(s, base) < 0x10000:
            return unichr(int(s, base))
        return default
    html = re.sub(ur"&#(\d+);?", lambda c: str_to_int(c.group(1), c.group(0)), html)
    html = re.sub(ur"&#[xX]([0-9a-fA-F]+);?", lambda c: str_to_int(c.group(1), c.group(0), base=16), html)
    html = re.sub(ur"[\x00-\x08\x0b\x0e-\x1f\x7f]", "", html)
    return html

lastorset · 2014-06-06T13:43:43Z

Pull request: #162

bradleyayers · 2015-08-19T07:58:19Z

@SimonSapin these characters are not valid HTML, see https://en.wikipedia.org/wiki/Character_encodings_in_HTML#Illegal_characters

SimonSapin · 2015-08-19T08:13:30Z

@bradleyayers To support that claim, Wikipedia links to a document named "SGML Declaration of HTML 4" and published in 1999. The relevant specification is https://whatwg.org/html. Also, what do you mean exactly by "valid"? Conformance requirements are different for authors and implementations.

bradleyayers · 2015-08-19T08:41:06Z

By "valid" I mean "able to be represented in HTML". I'm saying it's not possible to represent U+0001 in HTML.

It can't be represented by numeric character references (see https://html.spec.whatwg.org/multipage/syntax.html#character-references):

The numeric character reference forms described above are allowed to reference any Unicode code point other than U+0000, U+000D, permanently undefined Unicode characters (noncharacters), surrogates (U+D800–U+DFFF), and control characters other than space characters.

Nor can they be represented by encoding them (see https://html.spec.whatwg.org/multipage/syntax.html#preprocessing-the-input-stream):

Any occurrences of any characters in the ranges U+0001 to U+0008, …snip… are parse errors. These are all control characters or permanently undefined Unicode characters (noncharacters).

gsnedders · 2015-08-19T11:33:13Z

For the reference for numeric character references, note the start of the section:

This section only applies to documents, authoring tools, and markup generators. In particular, it does not apply to conformance checkers; conformance checkers must use the requirements given in the next section ("parsing HTML documents").

The actual parsing of a character reference (https://html.spec.whatwg.org/multipage/syntax.html#tokenizing-character-references) says:

Otherwise, return a character token for the Unicode character whose code point is that number. Additionally, if the number is in the range 0x0001 to 0x0008, 0x000D to 0x001F, 0x007F to 0x009F, 0xFDD0 to 0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, or 0x10FFFF, then this is a parse error.

Follow the cross-reference for parse error:

Certain points in the parsing algorithm are said to be parse errors. The error handling for parse errors is well-defined (that's the processing rules described throughout this specification), but user agents, while parsing an HTML document, may abort the parser at the first parse error that they encounter for which they do not wish to apply the rules described in this specification.

If you follow the error handling, note those characters are never replaced by anything else, and hence they end up in the DOM. The same is true for the ranges in the pre-processing.

EmilStenstrom · 2015-08-27T09:20:09Z

@gsnedders Do I understand this correctly that this is not a bug in html5lib (which just lets these characters through according to spec), but in lxml which does not expect those characters?

SimonSapin · 2015-08-27T09:24:47Z

@EmilStenstrom I’d say that’s debatable. https://html.spec.whatwg.org/multipage/#coercing-an-html-dom-into-an-infoset describes how to map to a more restricted XML API. The problem is raising exceptions rather than doing this coercion.

EmilStenstrom · 2015-08-27T10:44:20Z

@SimonSapin Doesn't the exception come from lxml rather than html5lib?

SimonSapin · 2015-08-27T10:58:22Z

It does, but html5lib should munge the data to avoid trigerring this exception.

willkg · 2017-11-10T21:08:09Z

I've got part of a PR done. As I understand it, we need to replace bad characters as follows:

in elements -> "Uxxxx"
in attribute names -> "Uxxxx"
in attribute values -> "\uFFFD"
in characters -> "\uFFFD"
in comments -> "\uFFFD"

The code for doing this should be in html5lib/_ihatexml:InfosetFilter.

After that, the following should all work:

"""
>>> from html5lib import parse, serialize
>>> serialize(parse('<p>\x01</p>', treebuilder='lxml'), tree='lxml')
u'<p>\ufffd'
>>> serialize(parse('<p\x01 id="foo">', treebuilder='lxml'), tree='lxml')
u'<pU1 id=foo></pU1>'
>>> serialize(parse('<p i\x01d="foo">', treebuilder='lxml'), tree='lxml')
u'<p iU1d=foo>'
>>> serialize(parse('<p id="\x01">', treebuilder='lxml'), tree='lxml')
u'<p id=\ufffd>'
>>> serialize(parse('<!-- \x01 is the best -->', treebuilder='lxml'), tree='lxml')
u'<!-- \ufffd is the best -->'
"""

You can run that through python -m doctest.

Is that the behavior we're looking for to match correct HTML parsing?

SimonSapin · 2017-11-10T21:35:15Z

@willk That looks correct, except that U + hexadecimal should be zero-padded to six digits according to https://html.spec.whatwg.org/multipage/parsing.html#coercing-an-html-dom-into-an-infoset, for example foo<bar to fooU00003Cbar.

willkg · 2017-11-10T21:53:18Z

Ok--so the fixed version of the doctest looks like this:

"""
>>> from html5lib import parse, serialize
>>> serialize(parse('<p>\x01</p>', treebuilder='lxml'), tree='lxml')
u'<p>\ufffd'
>>> serialize(parse('<p\x01 id="foo">', treebuilder='lxml'), tree='lxml')
u'<pU00001 id=foo></pU00001>'
>>> serialize(parse('<p i\x01d="foo">', treebuilder='lxml'), tree='lxml')
u'<p iU000001d=foo>'
>>> serialize(parse('<p id="\x01">', treebuilder='lxml'), tree='lxml')
u'<p id=\ufffd>'
>>> serialize(parse('<!-- \x01 is the best -->', treebuilder='lxml'), tree='lxml')
u'<!-- \ufffd is the best -->'
"""

For some reason, when I run that, the element item doesn't pass. Looks like something is converting U000001 back to \x001 after InfosetFilter has fixed things. Here's the output I'm getting:

$ python -m doctest code_96.py 
html5lib/_ihatexml.py:231: DataLossWarning: Text cannot contain U+000001
  warnings.warn("Text cannot contain U+%06X" % ord(badchar), DataLossWarning)
**********************************************************************
File "code_96.py", line 7, in code_96
Failed example:
    serialize(parse('<p� id="foo">', treebuilder='lxml'), tree='lxml')
Expected:
    u'<pU00001 id=foo></pU00001>'
Got:
    u'<p\x001 id=foo></p\x001>'
**********************************************************************
1 items had failures:
   1 of   6 in code_96
***Test Failed*** 1 failures.

I'll have to look into that more.

willkg · 2017-11-11T16:54:36Z

Aha! In InfosetFilter.fromXmlName, it runs the element name through replacementRegexp which converts the U00001 to \x001.

html5lib-python/html5lib/_ihatexml.py

Line 278 in 85bc5fa

for item in set(self.replacementRegexp.findall(name)):

Assuming that's correct, then we should be running attribute names through fromXmlName, too. So then we end up with this:

"""
>>> from html5lib import parse, serialize
>>> serialize(parse('<p>\x01</p>', treebuilder='lxml'), tree='lxml')
u'<p>\ufffd'
>>> serialize(parse('<p\x01 id="foo">', treebuilder='lxml'), tree='lxml')
u'<p\x001 id=foo></pU00001>'
>>> serialize(parse('<p i\x01d="foo">', treebuilder='lxml'), tree='lxml')
u'<p i\x001d=foo>'
>>> serialize(parse('<p id="\x01">', treebuilder='lxml'), tree='lxml')
u'<p id=\ufffd>'
>>> serialize(parse('<!-- \x01 is the best -->', treebuilder='lxml'), tree='lxml')
u'<!-- \ufffd is the best -->'
"""

@gsnedders I'm a bit fuzzy on the appropriate semantics for HTML -> XML -> HTML. Does that look correct?

gsnedders · 2017-11-13T02:45:35Z

So there are many more complex cases, primarily those outside of the BMP, especially once you start worrying about narrow/wide Python builds.

There's also the big difference between XML 1.0 4th Edition and 5th Edition, which depending on version of libxml2 that lxml is linked against (or compiled against? not sure which), described in E9 of https://www.w3.org/XML/xml-V10-4e-errata#E09 (IMO, this change should never have happened, but that ship has sailed). This notably changes the definition of Name.

These challenges are why I've never actually fixed this, because while there are easy fixes for the easy cases, the underlying problem is much wider.

gsnedders · 2017-11-13T03:13:23Z

I'm trying to take another stab at this right now, fixing this generally.

willkg · 2017-11-13T04:11:08Z

@gsnedders If it helps, I threw my WIP in a branch here: https://github.com/willkg/html5lib-python/tree/96-control-characters

willkg · 2017-11-27T14:51:29Z

I'm going to bump this out of the 1.0 milestone.

@gsnedders If you can get to this before December 1st, I'm game for re-adding it.

lpla · 2020-05-07T11:01:50Z

Here's a workaround for anyone that needs to get things working before this bug is fixed. Just run this code over the html before sending it to html5lib:

import re
def remove_control_characters(html):
    def str_to_int(s, default, base=10):
        if int(s, base) < 0x10000:
            return unichr(int(s, base))
        return default
    html = re.sub(ur"&#(\d+);?", lambda c: str_to_int(c.group(1), c.group(0)), html)
    html = re.sub(ur"&#[xX]([0-9a-fA-F]+);?", lambda c: str_to_int(c.group(1), c.group(0), base=16), html)
    html = re.sub(ur"[\x00-\x08\x0b\x0e-\x1f\x7f]", "", html)
    return html

In case anyone needs to use @EmilStenstrom code in Python 3, I just ported it:

def remove_control_characters(html):
    def str_to_int(s, default, base=10):
        if int(s, base) < 0x10000:
            return chr(int(s, base)).encode()
        return default
    html = re.sub(br"&#(\d+);?", lambda c: str_to_int(c.group(1), c.group(0)), html)
    html = re.sub(br"&#[xX]([0-9a-fA-F]+);?", lambda c: str_to_int(c.group(1), c.group(0), base=16), html)
    html = re.sub(br"[\x00-\x08\x0b\x0e-\x1f\x7f]", b"", html)
    return html

EmilStenstrom · 2020-05-07T11:13:21Z

@lpla Ours have evolved after slowly correcting errors when parsing erroneously encoded text in hundreds of thousands of HTML e-mails. This is the current version we are using, compatible with both python 2 (narrow and wide builds) and python 3, and with type hints:

import re

def remove_control_characters(html):
    # type: (t.Text) -> t.Text
    """
    Strip invalid XML characters that `lxml` cannot parse.
    """
    # See: https://github.com/html5lib/html5lib-python/issues/96
    #
    # The XML 1.0 spec defines the valid character range as:
    # Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
    #
    # We can instead match the invalid characters by inverting that range into:
    # InvalidChar ::= #xb | #xc | #xFFFE | #xFFFF | [#x0-#x8] | [#xe-#x1F] | [#xD800-#xDFFF]
    #
    # Sources:
    # https://www.w3.org/TR/REC-xml/#charsets,
    # https://lsimons.wordpress.com/2011/03/17/stripping-illegal-characters-out-of-xml-in-python/
    def strip_illegal_xml_characters(s, default, base=10):
        # Compare the "invalid XML character range" numerically
        n = int(s, base)
        if n in (0xb, 0xc, 0xFFFE, 0xFFFF) or 0x0 <= n <= 0x8 or 0xe <= n <= 0x1F or 0xD800 <= n <= 0xDFFF:
            return ""
        return default

    # We encode all non-ascii characters to XML char-refs, so for example "💖" becomes: "&#x1F496;"
    # Otherwise we'd remove emojis by mistake on narrow-unicode builds of Python
    html = html.encode("ascii", "xmlcharrefreplace").decode("utf-8")
    html = re.sub(r"&#(\d+);?", lambda c: strip_illegal_xml_characters(c.group(1), c.group(0)), html)
    html = re.sub(r"&#[xX]([0-9a-fA-F]+);?", lambda c: strip_illegal_xml_characters(c.group(1), c.group(0), base=16), html)
    html = ILLEGAL_XML_CHARS_RE.sub("", html)
    return html


# A regex matching the "invalid XML character range"
ILLEGAL_XML_CHARS_RE = re.compile(r"[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]")

lpla · 2020-05-07T11:30:05Z

Any license for this code?

EmilStenstrom · 2020-05-07T11:31:55Z

Any license for this code?

I hereby release it as public domain.

gsnedders · 2020-06-16T19:05:55Z

At this point, the only Python release we support narrow builds on is 2.7; all versions of Py3 we support are always wide. This, to be fair, makes this a lot easier to fix, so we should probably take a stab at this soon.

SimonSapin · 2020-06-17T07:33:42Z

How do narrow v.s. wide builds affect lxml/libxml2 being peculiar about control characters?

gsnedders · 2020-06-17T18:32:05Z

@SimonSapin they don't (the string is converted to UTF-8 before being passed to libxml2 IIRC), but they do affect our ability to detect what strings will trigger it (given we can't just iterate through a string and compare the iterable values to the production in XML, either ourselves or with a regex); the only complexity is whether libxml2 is enforcing XML 4e or 5e

aleray mentioned this issue Oct 9, 2013

All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters #117

Closed

lastorset pushed a commit to lastorset/html5lib-python that referenced this issue Jun 6, 2014

Replace invalid characters with U+FFFD (fixes html5lib#96)

6a6b74a

lastorset pushed a commit to lastorset/html5lib-python that referenced this issue Jun 6, 2014

Replace invalid characters with U+FFFD (fixes html5lib#96)

603440e

camillemonchicourt mentioned this issue Mar 18, 2016

WEASYPRINT - Pdf Export character error - All strings must be XML compatible GeotrekCE/Geotrek-admin#1591

Closed

gsnedders modified the milestones: 1.0, 0.999999999 Jun 6, 2016

willkg modified the milestones: 0.9999999999, 1.0 Oct 3, 2017

gsnedders self-assigned this Nov 13, 2017

willkg removed this from the 1.0 milestone Nov 27, 2017

joaquingx mentioned this issue Jan 13, 2019

Add HTML5Parser option scrapy/parsel#133

Closed

DeadWisdom mentioned this issue Jun 10, 2019

Control characters break rendering forgeworks/quill-delta-python#4

Open

gsnedders mentioned this issue Jun 16, 2020

New release on PyPI? #419

Closed

morbus mentioned this issue Sep 18, 2020

ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters kevinzg/facebook-scraper#120

Closed

alexmv mentioned this issue Mar 14, 2024

Parsing HTML email in email mirror can fail with invalid XML characters zulip/zulip#29301

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lxml doesn’t like control characters #96

lxml doesn’t like control characters #96

SimonSapin commented Jul 22, 2013

EmilStenstrom commented May 18, 2014

lastorset commented Jun 6, 2014

bradleyayers commented Aug 19, 2015

SimonSapin commented Aug 19, 2015

bradleyayers commented Aug 19, 2015

gsnedders commented Aug 19, 2015

EmilStenstrom commented Aug 27, 2015

SimonSapin commented Aug 27, 2015

EmilStenstrom commented Aug 27, 2015

SimonSapin commented Aug 27, 2015

willkg commented Nov 10, 2017

SimonSapin commented Nov 10, 2017

willkg commented Nov 10, 2017

willkg commented Nov 11, 2017

gsnedders commented Nov 13, 2017

gsnedders commented Nov 13, 2017

willkg commented Nov 13, 2017 •

edited

Loading

willkg commented Nov 27, 2017

lpla commented May 7, 2020

EmilStenstrom commented May 7, 2020 •

edited

Loading

lpla commented May 7, 2020

EmilStenstrom commented May 7, 2020

gsnedders commented Jun 16, 2020

SimonSapin commented Jun 17, 2020

gsnedders commented Jun 17, 2020 •

edited

Loading

lxml doesn’t like control characters #96

lxml doesn’t like control characters #96

Comments

SimonSapin commented Jul 22, 2013

EmilStenstrom commented May 18, 2014

lastorset commented Jun 6, 2014

bradleyayers commented Aug 19, 2015

SimonSapin commented Aug 19, 2015

bradleyayers commented Aug 19, 2015

gsnedders commented Aug 19, 2015

EmilStenstrom commented Aug 27, 2015

SimonSapin commented Aug 27, 2015

EmilStenstrom commented Aug 27, 2015

SimonSapin commented Aug 27, 2015

willkg commented Nov 10, 2017

SimonSapin commented Nov 10, 2017

willkg commented Nov 10, 2017

willkg commented Nov 11, 2017

gsnedders commented Nov 13, 2017

gsnedders commented Nov 13, 2017

willkg commented Nov 13, 2017 • edited Loading

willkg commented Nov 27, 2017

lpla commented May 7, 2020

EmilStenstrom commented May 7, 2020 • edited Loading

lpla commented May 7, 2020

EmilStenstrom commented May 7, 2020

gsnedders commented Jun 16, 2020

SimonSapin commented Jun 17, 2020

gsnedders commented Jun 17, 2020 • edited Loading

willkg commented Nov 13, 2017 •

edited

Loading

EmilStenstrom commented May 7, 2020 •

edited

Loading

gsnedders commented Jun 17, 2020 •

edited

Loading