Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lxml serializer goes "past" individual elements #338

Open
cjerdonek opened this issue Apr 30, 2017 · 13 comments
Open

lxml serializer goes "past" individual elements #338

cjerdonek opened this issue Apr 30, 2017 · 13 comments

Comments

@cjerdonek
Copy link

I noticed that when serializing an individual element, the lxml serializer continues "past" the element. This causes inconsistent behavior between the "etree" and "lxml" tree types.

A minimal way to reproduce:

def render(html, tree_type):
    root = html5lib.parseFragment(html, namespaceHTMLElements=False,
                             treebuilder=tree_type)
    # Get the <b> element.
    element = root[0][0]
    print(html5lib.serialize(element, tree=tree_type))

html = '<div><b>foo</b></div>'

render(html, 'etree')
render(html, 'lxml')

Outputs:

<b>foo</b>
<b>foo</b></div>
@cjerdonek
Copy link
Author

Also, lxml.etree and lxml.html do behave the right way:

tree_type = 'lxml'
html = '<div><b>foo</b></div>'
root = html5lib.parseFragment(html, namespaceHTMLElements=False,
                         treebuilder=tree_type)
# Get the <b> element.
element = root[0][0]
print(html5lib.serialize(element, tree=tree_type))
print(lxml.etree.tostring(element, encoding='unicode'))
print(lxml.html.tostring(element, encoding='unicode'))

Outputs:

<b>foo</b></div>
<b>foo</b>
<b>foo</b>

@cjerdonek
Copy link
Author

I'm guessing this issue has to do with the fact that etree_lxml.TreeWalker.__init__() wraps tree in a different object before passing it to base.NonRecursiveTreeWalker.__init__():

class TreeWalker(base.NonRecursiveTreeWalker):
    def __init__(self, tree):
        # pylint:disable=redefined-variable-type
        if isinstance(tree, list):
            self.fragmentChildren = set(tree)
            tree = FragmentRoot(tree)
        else:
            self.fragmentChildren = set()
            tree = Root(tree)
        base.NonRecursiveTreeWalker.__init__(self, tree)
        self.filter = _ihatexml.InfosetFilter()

NonRecursiveTreeWalker does an identity check on self.tree to decide whether to terminate:

if self.tree is currentNode:
    currentNode = None
    break

So the wrapping could be tripping up that logic.

@indivisible
Copy link

You're right, removing the Root wrapper line solves this problem and some others too.

That wrapper also gives problems when you try to walk over an element that has siblings. What is it even supposed to accomplish?

The walker is broken for list of elements, too (like returned by parseFragments when there are multiple root elements). In its current state the lxml walker is broken, and only works in some edge cases.

@gsnedders
Copy link
Member

That wrapper also gives problems when you try to walk over an element that has siblings. What is it even supposed to accomplish?

IIRC, it's meant to ensure that the DOCTYPE gets included when we have a whole tree. We probably should just special-case getting an lxml.etree._ElementTree object for that.

The walker is broken for list of elements, too (like returned by parseFragments when there are multiple root elements). In its current state the lxml walker is broken, and only works in some edge cases.

As far as I can tell (from what we have tests for), it works fine for whole trees and for whole fragments. What it doesn't work for is arbitrary elements.

@cjerdonek
Copy link
Author

For fragments, I'm getting that lxml.etree elements do work when passing treeType 'lxml' to getTreeWalker() used for the serializer (which is the case this issue was originally about), but it doesn't work when passing 'etree'. So it's the opposite of this issue.

Here's code to reproduce:

def render(element, treeType, implementation=None):
    walker_cls = html5lib.getTreeWalker(treeType, implementation=implementation)
    walker = walker_cls(element)
    serializer = HTMLSerializer()
    print(serializer.render(walker))

elements = html5lib.parseFragment('<i>a</i><i>b</i>', treebuilder='lxml')

# Works.
render(elements, 'lxml')
# Both of these fail with the exception below.
try:
    render(elements, 'etree')
except Exception as err:
    print(err)
render(elements, 'etree', implementation=lxml.etree)

which outputs:

<i>a</i><i>b</i>
'list' object has no attribute 'getroot'
Traceback (most recent call last):
  File "test-html5lib.py", line 28, in <module>
    render(elements, 'etree', implementation=lxml.etree)
  File "test-html5lib.py", line 20, in render
    print(serializer.render(walker))
  File "/.../python3.6/site-packages/html5lib/serializer.py", line 323, in render
    return "".join(list(self.serialize(treewalker)))
  File "/.../python3.6/site-packages/html5lib/serializer.py", line 209, in serialize
    for token in treewalker:
  File "/.../python3.6/site-packages/html5lib/filters/optionaltags.py", line 18, in __iter__
    for previous, token, next in self.slider():
  File "/.../python3.6/site-packages/html5lib/filters/optionaltags.py", line 9, in slider
    for token in self.source:
  File "/.../python3.6/site-packages/html5lib/treewalkers/base.py", line 94, in __iter__
    details = self.getNodeDetails(currentNode)
  File "/.../python3.6/site-packages/html5lib/treewalkers/etree.py", line 48, in getNodeDetails
    node = node.getroot()
AttributeError: 'list' object has no attribute 'getroot'

Should I file a different issue?

@indivisible
Copy link

indivisible commented May 2, 2017

Actually now that I've looked at things a bit better wouldn't it be better to scrap the lxml specific stuff and replace it with the etree-API stuff? If I use TreeBuilder = html5lib.treebuilders.etree.getETreeBuilder(lxml.etree)['TreeBuilder'] and likewise for the TreeWalker everything starts working as expected.

Edit: actually better make that html5lib.treewalkers.getTreeWalker('etree', lxml.etree)

@cjerdonek
Copy link
Author

Perhaps, but note that (at least as it stands now) the etree serializer has some of its own issues, like the one I mention in the comment before this one, and also this issue: #341

@cjerdonek
Copy link
Author

Also, I think the original idea behind the lxml serializer is that it is optimized / faster. E.g. see here: https://html5lib.readthedocs.io/en/latest/html5lib.html#html5lib.__init__.getTreeWalker

@indivisible
Copy link

#341 doesn't look like a bug: the Entity class in lxml.etree is an lxml exclusive extension to the API.

@cjerdonek
Copy link
Author

cjerdonek commented May 3, 2017

Okay, well then all the more reason not to scrap the lxml serializer. Something should be capable of serializing lxml.etree.Entity elements.

@gsnedders
Copy link
Member

Also, I think the original idea behind the lxml serializer is that it is optimized / faster. E.g. see here: https://html5lib.readthedocs.io/en/latest/html5lib.html#html5lib.__init__.getTreeWalker

AFAIK, it was mostly down to API differences around things like Entity and maybe getroot and docinfo and the like. It might simply be unneeded on the tree walker side, though is definitely needed on the tree builder side (where lxml enforces element names etc. match the XML productions, though maybe we have even crazier stuff to undo that which IMO we shouldn't?). If it's practical, I'm not opposed to simply killing the lxml tree walker.

@cjerdonek
Copy link
Author

If it's practical, I'm not opposed to simply killing the lxml tree walker.

I think it's important to keep around for the reason I mentioned in my comment just above -- namely, otherwise, there would be no way to serialize trees containing lxml.etree.Entity objects, for example (since #341 was closed).

@indivisible
Copy link

I've been looking into it, and the problems I've ran into so far are:

  1. Comments: lxml does not support comments with '--' in them, and comments ending with '-'.

  2. doctype: the etree TreeBuilder tries to put the doctype in an Element. In lxml it's stored in DocInfo, and can't be added to tree after it's created.

Still, some small changes enable the etree builder/walker to work. Adding Entity support to it wouldn't be hard either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants