Fix inheritance in lxml.html #340

xmo-odoo · 2022-03-07T13:25:33Z

As the old comment / FIXME from 8132c75 notes, the mixin should come first for the inheritance to be correct (the left-most class is the first in the MRO, at least if no diamond inheritance is involved).

Also fix the odd super in HtmlMixin likely stemming from the incorrect MRO.

Fixes the inheritance order of all HTML* base classes though it probably doesn't matter for other than HtmlElement.

xmo-odoo · 2022-03-07T13:31:07Z

Also looking at HtmlElementClassLookup there seems to be some oddities with the * tags. The docstring explains that the lookup can be initialised with classes and/or mixins, and that mixins associated with * should be mixed into all element classes.

I've not tested it but if my reading is correct the *-mixins will only apply to the classes (or default fallbacks), if some element classes are created solely through mixins (which seems feasible) those won't get *-mixins, and the same applies to HtmlElement fallbacks (so *-mixin can't be used to apply changes to all elements from that lookup object).

But I don't know what the exact contract is supposed to be, or if that matters.

xmo-odoo · 2022-03-07T13:34:24Z

Also didn't find a contributing guide so not sure how forward- or back- ports might work (if they're done / supported for minor issues).

scoder · 2022-03-07T14:50:47Z

Thanks. That looks correct. Can't say why it wasn't done like this at the time.

It's probably also missing tests for comments, PIs and entities, that's why it didn't show so far. Could you please add some?

the *-mixins will only apply to the classes (or default fallbacks), if some element classes are created solely through mixins (which seems feasible) those won't get *-mixins, and the same applies to HtmlElement fallbacks (so *-mixin can't be used to apply changes to all elements from that lookup object).

Not sure what exactly you're referring to here. Do you mean elements created by instantiating element classes (rather than parsing)? That behaviour is documented on the element classes page.

Regarding backports, lxml tends to be mostly backwards compatible, so that rarely happens.

xmo-odoo · 2022-03-07T15:01:22Z

the *-mixins will only apply to the classes (or default fallbacks), if some element classes are created solely through mixins (which seems feasible) those won't get *-mixins, and the same applies to HtmlElement fallbacks (so *-mixin can't be used to apply changes to all elements from that lookup object).

Not sure what exactly you're referring to here. Do you mean elements created by instantiating element classes (rather than parsing)? That behaviour is documented on the element classes page.

No I mean that if we set up a parser with the lookup

HtmlElementClassLookup(mixins=[
    ('*', A),
    ('foo', B),
])

Then I think parsed foo elements will not get the A mixin, and neither will any element parsed which is not part of the _default_element_classes, at least according to my reading (as noted I haven't tested it yet, I was just reading through the code). I find that a bit odd but I don't know what the expectations of the lxml maintainers would be (and also I don't know how "public" HtmlElementClassLookup is supposed to be, it's unprefixed but it's not listed in __all__).

scoder · 2022-03-07T15:11:33Z

I also read it this way, that it's exclusive. It either matches a specific class or the fallback class. While it wouldn't necessarily have to be this way, one advantage is that it makes it possible to exclude certain tags from inheriting the default mixin. If the default was always applied, that would be difficult.

xmo-odoo · 2022-03-08T07:25:08Z

I also read it this way, that it's exclusive. It either matches a specific class or the fallback class. While it wouldn't necessarily have to be this way, one advantage is that it makes it possible to exclude certain tags from inheriting the default mixin. If the default was always applied, that would be difficult.

That interpretation could make some sense but it seems somewhat inconsistent as it seems to behave differently if the class being augmented is passed through classes or is one of the defaults, in that case both mixins will apply.

On the other hand the mixins and classes features don't seem used by the module itself when instantiating its parsers so if the object is not considered part of the public API (is just an internal feature) an other option would be to deprecate those things and leave HtmlElementClassLookup as just a registry of "specialised" classes (for form elements)?

Also coming back to

It's probably also missing tests for comments, PIs and entities, that's why it didn't show so far. Could you please add some?

Sure but there doesn't seem to be much to test aside from testing that Python's inheritance works the way it's specced?

Thanks. That looks correct. Can't say why it wasn't done like this at the time.

Maybe time or couldn't be arsed, that happens.

scoder · 2022-03-08T10:42:35Z

On the other hand the mixins and classes features don't seem used by the module itself when instantiating its parsers so if the object is not considered part of the public API (is just an internal feature) an other option would be to deprecate those things and leave HtmlElementClassLookup as just a registry of "specialised" classes (for form elements)?

Well, if it's not used much, then ISTM that we don't have a problem. :) At least not a problem that couldn't be solved with documentation rather than deprecation.

It's probably also missing tests for comments, PIs and entities, that's why it didn't show so far. Could you please add some?

Sure but there doesn't seem to be much to test aside from testing that Python's inheritance works the way it's specced?

Or rather that Comment, PI and Entity classes are correctly inserted and still behave as expected when found in the tree, and that XML attribute setting on elements does what it should (there might be tests for that, at least). I mean, the mere fact that your changes broke no tests shows that there were tests missing.

I actually wonder if the set() method is still needed. It seems redundant now. Maybe it's just there to change the docstring? Might be, then we could still keep it.

xmo-odoo · 2022-03-08T12:06:11Z

Sure but there doesn't seem to be much to test aside from testing that Python's inheritance works the way it's specced?

Or rather that Comment, PI and Entity classes are correctly inserted and still behave as expected when found in the tree, and that XML attribute setting on elements does what it should (there might be tests for that, at least). I mean, the mere fact that your changes broke no tests shows that there were tests missing.

It might also be that the change doesn't break anything because it manipulates the MRO in ways which were not intentionally leveraged (and the old behaviour had to be worked around hence the attributes being re-set). But I see your point.

I actually wonder if the set() method is still needed. It seems redundant now. Maybe it's just there to change the docstring? Might be, then we could still keep it.

It is still useful for the docstring, as normally set doesn't accept a None, and also to add a default value so on an HtmlElement one can call e.g. e.set('disabled') and that's equivalent to <e disabled>, which is not supported by the XML API. Though I should probably check if there are tests because it looks like this behaviour is more magical than I'd expected, it's not just that (somehow) the element-creation accepts and casts None to empty strings for html elements.

scoder · 2022-03-08T17:44:16Z

Element.set(name, None) is actually allowed for HTML documents, whether you use lxml.html or just the HTML parser in lxml.etree. I added that to the docstring in lxml.etree. So it's only the default argument value that lxml.html adds, which seems helpful for HTML but not for lxml.etree.

xmo-odoo · 2022-03-09T07:03:41Z

Element.set(name, None) is actually allowed for HTML documents, whether you use lxml.html or just the HTML parser in lxml.etree.

Right, but without the override that's not visible through the API doc, because you get the information from _Element/ElementBase, which doesn't really know because it's a special-case based on the parser/document associated with the specific element (if I understand correctly).

I added that to the docstring in lxml.etree.

Which one? Or is it in an unreleased revision? For the currently published stuff all I see is

    set(self, key, value)
    
    Sets an element attribute.

xmo-odoo · 2022-03-09T11:00:47Z

@scoder I've one more question: I'm trying to write a test to ensure

Comment, PI and Entity classes are correctly inserted and still behave as expected when found in the tree

however there seems to be a snag: resolve_entities does not exist on the HTMLParser and apparently it's an opt-out flag in libxml2, so I can't find a way to actually get an HtmlEntity out of a parse.

Also more of a propriety / cleanliness question: remove_pis and remove_flags are False by default, which is what I want, should I assume they remain that or should I use an explicit HTMLParser with those flags explicitely set to false, to ignore / discard changes to the default configuration?

scoder · 2022-03-09T11:19:39Z

I added that to the docstring in lxml.etree.

Which one? Or is it in an unreleased revision?

Yes, a fresh commit in 3bd8db7

As the old comment / FIXME from 8132c75 notes, the mixin should come first for the inheritance to be correct (the left-most class is the first in the MRO, at least if no diamond inheritance is involved). Also fix the odd `super` in `HtmlMixin` likely stemming from the incorrect MRO. Fixes the inheritance order of all `HTML*` base classes though it probably doesn't matter for other than `HtmlElement`.

xmo-odoo · 2022-03-09T13:28:30Z

Added a pair of tests to check for mixin features (the only one which seems really applicable is base_url so I checked that), as well as a few tests around the behaviour of HtmlElement.set as I couldn't find any invocation entirely leaving out the value (I also added an explicit None and an empty string for good measure).

Also updated the tox file to specifically allow make, otherwise some of the envs complain. It still fails locally on 3.8 and 3.10, possibly because everything's installed using pyenv (though that makes it unclear why it'd succeed for 3.5, 3.6, 3.7, and 3.9 which are also installed using pyenv).

src/lxml/html/tests/test_basic.py

scoder · 2022-05-17T08:22:40Z

Thanks!

scoder reviewed Mar 9, 2022

View reviewed changes

src/lxml/html/tests/test_basic.py Outdated Show resolved Hide resolved

Give test a more descriptive name.

e93881b

scoder merged commit a90d0ee into lxml:master May 17, 2022

xmo-odoo deleted the html-fix-inheritance branch May 17, 2022 08:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix inheritance in lxml.html #340

Fix inheritance in lxml.html #340

xmo-odoo commented Mar 7, 2022

xmo-odoo commented Mar 7, 2022 •

edited

xmo-odoo commented Mar 7, 2022

scoder commented Mar 7, 2022

xmo-odoo commented Mar 7, 2022 •

edited

scoder commented Mar 7, 2022

xmo-odoo commented Mar 8, 2022

scoder commented Mar 8, 2022

xmo-odoo commented Mar 8, 2022

scoder commented Mar 8, 2022

xmo-odoo commented Mar 9, 2022 •

edited

xmo-odoo commented Mar 9, 2022

scoder commented Mar 9, 2022

xmo-odoo commented Mar 9, 2022

scoder commented May 17, 2022

Fix inheritance in lxml.html #340

Fix inheritance in lxml.html #340

Conversation

xmo-odoo commented Mar 7, 2022

xmo-odoo commented Mar 7, 2022 • edited

xmo-odoo commented Mar 7, 2022

scoder commented Mar 7, 2022

xmo-odoo commented Mar 7, 2022 • edited

scoder commented Mar 7, 2022

xmo-odoo commented Mar 8, 2022

scoder commented Mar 8, 2022

xmo-odoo commented Mar 8, 2022

scoder commented Mar 8, 2022

xmo-odoo commented Mar 9, 2022 • edited

xmo-odoo commented Mar 9, 2022

scoder commented Mar 9, 2022

xmo-odoo commented Mar 9, 2022

scoder commented May 17, 2022

xmo-odoo commented Mar 7, 2022 •

edited

xmo-odoo commented Mar 7, 2022 •

edited

xmo-odoo commented Mar 9, 2022 •

edited