Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not blindly copy all of the namespaces when tostring():ing a subtree. #184

Merged
merged 9 commits into from Aug 20, 2016

Conversation

Pelleplutt
Copy link
Contributor

When using a subtree of a document do not simply copy all of the
namespaces from all of the parents down. Only copy those that we
actually use within the subtree. This as copying all namespaces will
bloat the subtree with information it should not have.

This might seem harmless to do in the average case, but it will cause problems when serializing the XML, specifically C14N serialization which will according to specification retain all ns declarations on the root level element. So if this tostring() execution then will insert all parent namespace declarations into the now new root element we will unnecessarily bloat the ns declarations on this new toplevel element.

Having this said I am not confident this is the best code for doing this, feel free to point me in the direction of better code if you will.

When using a subtree of a document do not simply copy all of the
namespaces from all of the parents down. Only copy those that we
actually use within the subtree. This as copying all namespaces will
bloat the subtree with information it should not have.
@scoder
Copy link
Member

scoder commented Dec 14, 2015

I generally like the change (and it's well written), but sadly, there are cases where _copyParentNamespaces() must really create an exact copy of all namespaces for correctness. There is a failing C14N test due to this change (although the failure is most likely not C14N related). I don't know if it's possible to distinguish between cases where this helps and those where it renders the output incorrect.

@Pelleplutt
Copy link
Contributor Author

Sorry for my n00b-ness on this, I honestly tried running the test suite myself and failed. How do you actually run the tests locally?

@Pelleplutt
Copy link
Contributor Author

I implemented your improvements. Regarding the loops; I tried to follow what seemed to be the local convention for the looping, but I guess I found old bad examples on how it was done. Thanks for the pointer.

Add control ability of the sleek ns copying to _copyParentNamespaces()
so we can turn it off or on depending on the context. This as this is
not always desired behavior. Leave it active only on c14n serialization
for now.
@Pelleplutt
Copy link
Contributor Author

I made a suggestion for how to retain the original behavior in all cases except for the c14n serialization (as this is where it really makes a difference). Not a big fan of pushing boolean flags to functions like this (as it quickly grows out of hand...). But it does the trick.

I guess one option would be to make a wrapper for _plainFakeRootDoc() (like done for _fakeRootDoc()) to limit the impact, but as it was only used in the raw form in two locations I left this out.

@Pelleplutt
Copy link
Contributor Author

Any feedback on this?

cdef void _copyParentNamespaces(xmlNode* c_from_node, xmlNode* c_to_node) nogil:
u"""Copy the namespaces of all ancestors of c_from_node to c_to_node.
cdef void _copyParentNamespaces(xmlNode* c_from_node, xmlNode* c_to_node, bint applied_ns_only) nogil:
u"""Copy the namespaces of all ancestors of c_from_node to c_to_node that are used by c_to_node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... used by c_from_node, it seems.

@scoder
Copy link
Member

scoder commented Jan 22, 2016

To run the tests, I just say python test.py -vv -p. Or call py.test src/lxml/tests if you have pytest. (Some of the HTML tests still seem to fail when run with pytest, thus the selected subdirectory.)

@scoder
Copy link
Member

scoder commented Jan 22, 2016

Hmm, I got confused and didn't understand that your intention was to make this change specifically for C14N. I'm sure there is code out there that uses C14N serialisation to build persistent hash IDs etc. and it's an important use case, I'd say. This change would suddenly lead to different output for existing data. I see less of a problem with normal XML serialisation.

That suggests that this should be treated as a new feature that should be optional, in order to keep backwards compatibility by default. I'm ok with always enabling it for normal serialisation (there is no advantage in having redundant namespace declarations there), but the C14N case would need an explicit flag option that users would have to enable in order to get the new behaviour.

Sorry for not noticing this earlier.

@Pelleplutt
Copy link
Contributor Author

Yes, the point is specifically to add it for C14N as it seems to be in direct violation of the specification when serialising the XML.

As I interpret https://www.w3.org/TR/xml-c14n section 2.4 this should be done. See second paragraph in this section.

This means that when we create a new child and simply copy all of the namespace declarations from parent declarations we are in fact introducing declarations that should not be present on that node in the current context. Hence my change. This is also the way other implementations seems to behave (as I am doing this to solve digest mismatch issues where this case exists).

@scoder
Copy link
Member

scoder commented Jan 23, 2016

The paragraph you are referencing says:

"""
The processing of an element node E MUST be modified slightly when an XPath node-set is given as input and the element's parent is omitted from the node-set. The method for processing the attribute axis of an element E in the node-set is enhanced. All element nodes along E's ancestor axis are examined for nearest occurrences of attributes in the xml namespace, such as xml:lang and xml:space (whether or not they are in the node-set). From this list of attributes, remove any that are in E's attribute axis (whether or not they are in the node-set). Then, lexicographically merge this attribute list with the nodes of E's attribute axis that are in the node-set. The result of visiting the attribute axis is computed by processing the attribute nodes in this merged attribute list.
"""

There is no hint here that unused namespace prefix mappings should be removed. Instead, it says that only those from the ancestors should not be added that the element E declares itself, i.e. all other prefix declarations should be included, whether they are used in E's subtree or not.

Note also that this paragraph only refers to the xml namespace, not arbitrary namespace-prefix mappings.

@Pelleplutt
Copy link
Contributor Author

Perhaps we are reading this differently. I interpret that section as if only xml namespaces present in the node set being copied should be added.

"From this list of attributes, remove any that are in E's attribute axis (whether or not they are in the node-set). Then, lexicographically merge this attribute list with the nodes of E's attribute axis that are in the node-set."

For me this sound like

  1. Do not add duplicates "remove any that are in E's attribute axis (whether or not they are in the node-set)"
  2. Add the attributes, i.e. in this context namespace declarations, that are in the node set. " merge this attribute list with the nodes of E's attribute axis that are in the node-set"

But I will admit that parsing those sections of text are not trivial and I am not a native English speaker. Looking at other implementations does not give me a conclusion about this either, Perl's XML::LibXML behaves as lxml currently does (copies all the parent namespaces). Java (org.apache.xml.security.c14n.Canonicalizer) does not seem to help at all but you rather have to add the namespaces manually when doing this. Whatever implementation the bank I am testing against used it does however behave like I suggest, copy only the namespaces actually used.

But this is your code scoder and you can decide how you want it to behave. I am happy if you choose to include the code at all. So your call here.

@scoder scoder merged commit d71219b into lxml:master Aug 20, 2016
scoder added a commit that referenced this pull request Aug 20, 2016
This reverts commit d71219b, reversing
changes made to 714b95c.
scoder added a commit that referenced this pull request Aug 20, 2016
scoder added a commit that referenced this pull request Aug 20, 2016
@scoder
Copy link
Member

scoder commented Aug 20, 2016

I've cleaned up your changes here, but then reverted the merge after some reconsideration:
https://github.com/scoder/lxml/tree/exclude_unused_ns_from_c14n

I'm still having difficulties to understand why this cannot be left to users. They can call cleanup_namespaces() on their tree directly before serialisation to get the same result. Changing the default behaviour here is annoying because in order to allow users who depend on it to get back the old behaviour, we'd have to add a new legacy option, and people would have to start using that option from a certain lxml version on but avoid passing it in old versions that don't accept it.

Also, your change does not handle file serialisation but only string serialisation, which adds an inconsistency.

I do see your interest in making this the default and I agree with it. But ISTM that it's not worth breaking existing code that relies on the current behavour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants