lxml serializer goes "past" individual elements #338

cjerdonek · 2017-04-30T10:12:38Z

I noticed that when serializing an individual element, the lxml serializer continues "past" the element. This causes inconsistent behavior between the "etree" and "lxml" tree types.

A minimal way to reproduce:

def render(html, tree_type):
    root = html5lib.parseFragment(html, namespaceHTMLElements=False,
                             treebuilder=tree_type)
    # Get the <b> element.
    element = root[0][0]
    print(html5lib.serialize(element, tree=tree_type))

html = '<div><b>foo</b></div>'

render(html, 'etree')
render(html, 'lxml')

Outputs:

<b>foo</b>
<b>foo</b></div>

The text was updated successfully, but these errors were encountered:

cjerdonek · 2017-04-30T10:22:42Z

Also, lxml.etree and lxml.html do behave the right way:

tree_type = 'lxml'
html = '<div><b>foo</b></div>'
root = html5lib.parseFragment(html, namespaceHTMLElements=False,
                         treebuilder=tree_type)
# Get the <b> element.
element = root[0][0]
print(html5lib.serialize(element, tree=tree_type))
print(lxml.etree.tostring(element, encoding='unicode'))
print(lxml.html.tostring(element, encoding='unicode'))

Outputs:

<b>foo</b></div>
<b>foo</b>
<b>foo</b>

cjerdonek · 2017-05-01T19:05:31Z

I'm guessing this issue has to do with the fact that etree_lxml.TreeWalker.__init__() wraps tree in a different object before passing it to base.NonRecursiveTreeWalker.__init__():

class TreeWalker(base.NonRecursiveTreeWalker):
    def __init__(self, tree):
        # pylint:disable=redefined-variable-type
        if isinstance(tree, list):
            self.fragmentChildren = set(tree)
            tree = FragmentRoot(tree)
        else:
            self.fragmentChildren = set()
            tree = Root(tree)
        base.NonRecursiveTreeWalker.__init__(self, tree)
        self.filter = _ihatexml.InfosetFilter()

NonRecursiveTreeWalker does an identity check on self.tree to decide whether to terminate:

if self.tree is currentNode:
    currentNode = None
    break

So the wrapping could be tripping up that logic.

indivisible · 2017-05-02T00:53:20Z

You're right, removing the Root wrapper line solves this problem and some others too.

That wrapper also gives problems when you try to walk over an element that has siblings. What is it even supposed to accomplish?

The walker is broken for list of elements, too (like returned by parseFragments when there are multiple root elements). In its current state the lxml walker is broken, and only works in some edge cases.

gsnedders · 2017-05-02T15:10:16Z

That wrapper also gives problems when you try to walk over an element that has siblings. What is it even supposed to accomplish?

IIRC, it's meant to ensure that the DOCTYPE gets included when we have a whole tree. We probably should just special-case getting an lxml.etree._ElementTree object for that.

The walker is broken for list of elements, too (like returned by parseFragments when there are multiple root elements). In its current state the lxml walker is broken, and only works in some edge cases.

As far as I can tell (from what we have tests for), it works fine for whole trees and for whole fragments. What it doesn't work for is arbitrary elements.

cjerdonek · 2017-05-02T16:37:24Z

For fragments, I'm getting that lxml.etree elements do work when passing treeType 'lxml' to getTreeWalker() used for the serializer (which is the case this issue was originally about), but it doesn't work when passing 'etree'. So it's the opposite of this issue.

Here's code to reproduce:

def render(element, treeType, implementation=None):
    walker_cls = html5lib.getTreeWalker(treeType, implementation=implementation)
    walker = walker_cls(element)
    serializer = HTMLSerializer()
    print(serializer.render(walker))

elements = html5lib.parseFragment('<i>a</i><i>b</i>', treebuilder='lxml')

# Works.
render(elements, 'lxml')
# Both of these fail with the exception below.
try:
    render(elements, 'etree')
except Exception as err:
    print(err)
render(elements, 'etree', implementation=lxml.etree)

which outputs:

<i>a</i><i>b</i>
'list' object has no attribute 'getroot'
Traceback (most recent call last):
  File "test-html5lib.py", line 28, in <module>
    render(elements, 'etree', implementation=lxml.etree)
  File "test-html5lib.py", line 20, in render
    print(serializer.render(walker))
  File "/.../python3.6/site-packages/html5lib/serializer.py", line 323, in render
    return "".join(list(self.serialize(treewalker)))
  File "/.../python3.6/site-packages/html5lib/serializer.py", line 209, in serialize
    for token in treewalker:
  File "/.../python3.6/site-packages/html5lib/filters/optionaltags.py", line 18, in __iter__
    for previous, token, next in self.slider():
  File "/.../python3.6/site-packages/html5lib/filters/optionaltags.py", line 9, in slider
    for token in self.source:
  File "/.../python3.6/site-packages/html5lib/treewalkers/base.py", line 94, in __iter__
    details = self.getNodeDetails(currentNode)
  File "/.../python3.6/site-packages/html5lib/treewalkers/etree.py", line 48, in getNodeDetails
    node = node.getroot()
AttributeError: 'list' object has no attribute 'getroot'

Should I file a different issue?

indivisible · 2017-05-02T23:27:48Z

Actually now that I've looked at things a bit better wouldn't it be better to scrap the lxml specific stuff and replace it with the etree-API stuff? If I use TreeBuilder = html5lib.treebuilders.etree.getETreeBuilder(lxml.etree)['TreeBuilder'] and likewise for the TreeWalker everything starts working as expected.

Edit: actually better make that html5lib.treewalkers.getTreeWalker('etree', lxml.etree)

cjerdonek · 2017-05-02T23:37:53Z

Perhaps, but note that (at least as it stands now) the etree serializer has some of its own issues, like the one I mention in the comment before this one, and also this issue: #341

cjerdonek · 2017-05-02T23:40:15Z

Also, I think the original idea behind the lxml serializer is that it is optimized / faster. E.g. see here: https://html5lib.readthedocs.io/en/latest/html5lib.html#html5lib.__init__.getTreeWalker

indivisible · 2017-05-03T00:04:18Z

#341 doesn't look like a bug: the Entity class in lxml.etree is an lxml exclusive extension to the API.

cjerdonek · 2017-05-03T00:11:43Z

Okay, well then all the more reason not to scrap the lxml serializer. Something should be capable of serializing lxml.etree.Entity elements.

gsnedders · 2017-05-03T17:27:55Z

Also, I think the original idea behind the lxml serializer is that it is optimized / faster. E.g. see here: https://html5lib.readthedocs.io/en/latest/html5lib.html#html5lib.__init__.getTreeWalker

AFAIK, it was mostly down to API differences around things like Entity and maybe getroot and docinfo and the like. It might simply be unneeded on the tree walker side, though is definitely needed on the tree builder side (where lxml enforces element names etc. match the XML productions, though maybe we have even crazier stuff to undo that which IMO we shouldn't?). If it's practical, I'm not opposed to simply killing the lxml tree walker.

cjerdonek · 2017-05-03T17:54:47Z

If it's practical, I'm not opposed to simply killing the lxml tree walker.

I think it's important to keep around for the reason I mentioned in my comment just above -- namely, otherwise, there would be no way to serialize trees containing lxml.etree.Entity objects, for example (since #341 was closed).

indivisible · 2017-05-03T18:02:22Z

I've been looking into it, and the problems I've ran into so far are:

Comments: lxml does not support comments with '--' in them, and comments ending with '-'.
doctype: the etree TreeBuilder tries to put the doctype in an Element. In lxml it's stored in DocInfo, and can't be added to tree after it's created.

Still, some small changes enable the etree builder/walker to work. Adding Entity support to it wouldn't be hard either.

gsnedders added the serializer label Apr 30, 2017

cjerdonek mentioned this issue May 1, 2017

etree serializer fails when serializing lxml.html element containing a comment #340

Closed

cjerdonek mentioned this issue May 7, 2017

etree lxml.etree walker can't serialize full documents #345

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lxml serializer goes "past" individual elements #338

lxml serializer goes "past" individual elements #338

cjerdonek commented Apr 30, 2017

cjerdonek commented Apr 30, 2017

cjerdonek commented May 1, 2017

indivisible commented May 2, 2017

gsnedders commented May 2, 2017

cjerdonek commented May 2, 2017

indivisible commented May 2, 2017 •

edited

cjerdonek commented May 2, 2017

cjerdonek commented May 2, 2017

indivisible commented May 3, 2017

cjerdonek commented May 3, 2017 •

edited

gsnedders commented May 3, 2017

cjerdonek commented May 3, 2017

indivisible commented May 3, 2017

lxml serializer goes "past" individual elements #338

lxml serializer goes "past" individual elements #338

Comments

cjerdonek commented Apr 30, 2017

cjerdonek commented Apr 30, 2017

cjerdonek commented May 1, 2017

indivisible commented May 2, 2017

gsnedders commented May 2, 2017

cjerdonek commented May 2, 2017

indivisible commented May 2, 2017 • edited

cjerdonek commented May 2, 2017

cjerdonek commented May 2, 2017

indivisible commented May 3, 2017

cjerdonek commented May 3, 2017 • edited

gsnedders commented May 3, 2017

cjerdonek commented May 3, 2017

indivisible commented May 3, 2017

indivisible commented May 2, 2017 •

edited

cjerdonek commented May 3, 2017 •

edited