Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix inheritance in lxml.html #340

Merged
merged 2 commits into from May 17, 2022
Merged

Fix inheritance in lxml.html #340

merged 2 commits into from May 17, 2022

Conversation

xmo-odoo
Copy link
Contributor

@xmo-odoo xmo-odoo commented Mar 7, 2022

As the old comment / FIXME from 8132c75 notes, the mixin should come first for the inheritance to be correct (the left-most class is the first in the MRO, at least if no diamond inheritance is involved).

Also fix the odd super in HtmlMixin likely stemming from the incorrect MRO.

Fixes the inheritance order of all HTML* base classes though it probably doesn't matter for other than HtmlElement.

@xmo-odoo
Copy link
Contributor Author

xmo-odoo commented Mar 7, 2022

Also looking at HtmlElementClassLookup there seems to be some oddities with the * tags. The docstring explains that the lookup can be initialised with classes and/or mixins, and that mixins associated with * should be mixed into all element classes.

I've not tested it but if my reading is correct the *-mixins will only apply to the classes (or default fallbacks), if some element classes are created solely through mixins (which seems feasible) those won't get *-mixins, and the same applies to HtmlElement fallbacks (so *-mixin can't be used to apply changes to all elements from that lookup object).

But I don't know what the exact contract is supposed to be, or if that matters.

@xmo-odoo
Copy link
Contributor Author

xmo-odoo commented Mar 7, 2022

Also didn't find a contributing guide so not sure how forward- or back- ports might work (if they're done / supported for minor issues).

@scoder
Copy link
Member

scoder commented Mar 7, 2022

Thanks. That looks correct. Can't say why it wasn't done like this at the time.

It's probably also missing tests for comments, PIs and entities, that's why it didn't show so far. Could you please add some?

the *-mixins will only apply to the classes (or default fallbacks), if some element classes are created solely through mixins (which seems feasible) those won't get *-mixins, and the same applies to HtmlElement fallbacks (so *-mixin can't be used to apply changes to all elements from that lookup object).

Not sure what exactly you're referring to here. Do you mean elements created by instantiating element classes (rather than parsing)? That behaviour is documented on the element classes page.

Regarding backports, lxml tends to be mostly backwards compatible, so that rarely happens.

@xmo-odoo
Copy link
Contributor Author

xmo-odoo commented Mar 7, 2022

the *-mixins will only apply to the classes (or default fallbacks), if some element classes are created solely through mixins (which seems feasible) those won't get *-mixins, and the same applies to HtmlElement fallbacks (so *-mixin can't be used to apply changes to all elements from that lookup object).

Not sure what exactly you're referring to here. Do you mean elements created by instantiating element classes (rather than parsing)? That behaviour is documented on the element classes page.

No I mean that if we set up a parser with the lookup

HtmlElementClassLookup(mixins=[
    ('*', A),
    ('foo', B),
])

Then I think parsed foo elements will not get the A mixin, and neither will any element parsed which is not part of the _default_element_classes, at least according to my reading (as noted I haven't tested it yet, I was just reading through the code). I find that a bit odd but I don't know what the expectations of the lxml maintainers would be (and also I don't know how "public" HtmlElementClassLookup is supposed to be, it's unprefixed but it's not listed in __all__).

@scoder
Copy link
Member

scoder commented Mar 7, 2022

I also read it this way, that it's exclusive. It either matches a specific class or the fallback class. While it wouldn't necessarily have to be this way, one advantage is that it makes it possible to exclude certain tags from inheriting the default mixin. If the default was always applied, that would be difficult.

@xmo-odoo
Copy link
Contributor Author

xmo-odoo commented Mar 8, 2022

I also read it this way, that it's exclusive. It either matches a specific class or the fallback class. While it wouldn't necessarily have to be this way, one advantage is that it makes it possible to exclude certain tags from inheriting the default mixin. If the default was always applied, that would be difficult.

That interpretation could make some sense but it seems somewhat inconsistent as it seems to behave differently if the class being augmented is passed through classes or is one of the defaults, in that case both mixins will apply.

On the other hand the mixins and classes features don't seem used by the module itself when instantiating its parsers so if the object is not considered part of the public API (is just an internal feature) an other option would be to deprecate those things and leave HtmlElementClassLookup as just a registry of "specialised" classes (for form elements)?

Also coming back to

It's probably also missing tests for comments, PIs and entities, that's why it didn't show so far. Could you please add some?

Sure but there doesn't seem to be much to test aside from testing that Python's inheritance works the way it's specced?

Thanks. That looks correct. Can't say why it wasn't done like this at the time.

Maybe time or couldn't be arsed, that happens.

@scoder
Copy link
Member

scoder commented Mar 8, 2022

On the other hand the mixins and classes features don't seem used by the module itself when instantiating its parsers so if the object is not considered part of the public API (is just an internal feature) an other option would be to deprecate those things and leave HtmlElementClassLookup as just a registry of "specialised" classes (for form elements)?

Well, if it's not used much, then ISTM that we don't have a problem. :) At least not a problem that couldn't be solved with documentation rather than deprecation.

It's probably also missing tests for comments, PIs and entities, that's why it didn't show so far. Could you please add some?

Sure but there doesn't seem to be much to test aside from testing that Python's inheritance works the way it's specced?

Or rather that Comment, PI and Entity classes are correctly inserted and still behave as expected when found in the tree, and that XML attribute setting on elements does what it should (there might be tests for that, at least). I mean, the mere fact that your changes broke no tests shows that there were tests missing.

I actually wonder if the set() method is still needed. It seems redundant now. Maybe it's just there to change the docstring? Might be, then we could still keep it.

@xmo-odoo
Copy link
Contributor Author

xmo-odoo commented Mar 8, 2022

Sure but there doesn't seem to be much to test aside from testing that Python's inheritance works the way it's specced?

Or rather that Comment, PI and Entity classes are correctly inserted and still behave as expected when found in the tree, and that XML attribute setting on elements does what it should (there might be tests for that, at least). I mean, the mere fact that your changes broke no tests shows that there were tests missing.

It might also be that the change doesn't break anything because it manipulates the MRO in ways which were not intentionally leveraged (and the old behaviour had to be worked around hence the attributes being re-set). But I see your point.

I actually wonder if the set() method is still needed. It seems redundant now. Maybe it's just there to change the docstring? Might be, then we could still keep it.

It is still useful for the docstring, as normally set doesn't accept a None, and also to add a default value so on an HtmlElement one can call e.g. e.set('disabled') and that's equivalent to <e disabled>, which is not supported by the XML API. Though I should probably check if there are tests because it looks like this behaviour is more magical than I'd expected, it's not just that (somehow) the element-creation accepts and casts None to empty strings for html elements.

@scoder
Copy link
Member

scoder commented Mar 8, 2022

Element.set(name, None) is actually allowed for HTML documents, whether you use lxml.html or just the HTML parser in lxml.etree. I added that to the docstring in lxml.etree. So it's only the default argument value that lxml.html adds, which seems helpful for HTML but not for lxml.etree.

@xmo-odoo
Copy link
Contributor Author

xmo-odoo commented Mar 9, 2022

Element.set(name, None) is actually allowed for HTML documents, whether you use lxml.html or just the HTML parser in lxml.etree.

Right, but without the override that's not visible through the API doc, because you get the information from _Element/ElementBase, which doesn't really know because it's a special-case based on the parser/document associated with the specific element (if I understand correctly).

I added that to the docstring in lxml.etree.

Which one? Or is it in an unreleased revision? For the currently published stuff all I see is

    set(self, key, value)
    
    Sets an element attribute.

@xmo-odoo
Copy link
Contributor Author

xmo-odoo commented Mar 9, 2022

@scoder I've one more question: I'm trying to write a test to ensure

Comment, PI and Entity classes are correctly inserted and still behave as expected when found in the tree

however there seems to be a snag: resolve_entities does not exist on the HTMLParser and apparently it's an opt-out flag in libxml2, so I can't find a way to actually get an HtmlEntity out of a parse.

Also more of a propriety / cleanliness question: remove_pis and remove_flags are False by default, which is what I want, should I assume they remain that or should I use an explicit HTMLParser with those flags explicitely set to false, to ignore / discard changes to the default configuration?

@scoder
Copy link
Member

scoder commented Mar 9, 2022

I added that to the docstring in lxml.etree.

Which one? Or is it in an unreleased revision?

Yes, a fresh commit in 3bd8db7

As the old comment / FIXME from
8132c75 notes, the mixin should come
first for the inheritance to be correct (the left-most class is the
first in the MRO, at least if no diamond inheritance is involved).

Also fix the odd `super` in `HtmlMixin` likely stemming from the
incorrect MRO.

Fixes the inheritance order of all `HTML*` base classes though it
probably doesn't matter for other than `HtmlElement`.
@xmo-odoo
Copy link
Contributor Author

xmo-odoo commented Mar 9, 2022

Added a pair of tests to check for mixin features (the only one which seems really applicable is base_url so I checked that), as well as a few tests around the behaviour of HtmlElement.set as I couldn't find any invocation entirely leaving out the value (I also added an explicit None and an empty string for good measure).

Also updated the tox file to specifically allow make, otherwise some of the envs complain. It still fails locally on 3.8 and 3.10, possibly because everything's installed using pyenv (though that makes it unclear why it'd succeed for 3.5, 3.6, 3.7, and 3.9 which are also installed using pyenv).

@scoder scoder merged commit a90d0ee into lxml:master May 17, 2022
@scoder
Copy link
Member

scoder commented May 17, 2022

Thanks!

@xmo-odoo xmo-odoo deleted the html-fix-inheritance branch May 17, 2022 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants