Do not blindly copy all of the namespaces when tostring():ing a subtree. #184

Pelleplutt · 2015-12-14T15:21:07Z

When using a subtree of a document do not simply copy all of the
namespaces from all of the parents down. Only copy those that we
actually use within the subtree. This as copying all namespaces will
bloat the subtree with information it should not have.

This might seem harmless to do in the average case, but it will cause problems when serializing the XML, specifically C14N serialization which will according to specification retain all ns declarations on the root level element. So if this tostring() execution then will insert all parent namespace declarations into the now new root element we will unnecessarily bloat the ns declarations on this new toplevel element.

Having this said I am not confident this is the best code for doing this, feel free to point me in the direction of better code if you will.

When using a subtree of a document do not simply copy all of the namespaces from all of the parents down. Only copy those that we actually use within the subtree. This as copying all namespaces will bloat the subtree with information it should not have.

scoder · 2015-12-14T19:25:26Z

I generally like the change (and it's well written), but sadly, there are cases where _copyParentNamespaces() must really create an exact copy of all namespaces for correctness. There is a failing C14N test due to this change (although the failure is most likely not C14N related). I don't know if it's possible to distinguish between cases where this helps and those where it renders the output incorrect.

Pelleplutt · 2015-12-15T07:46:52Z

Sorry for my n00b-ness on this, I honestly tried running the test suite myself and failed. How do you actually run the tests locally?

Pelleplutt · 2015-12-15T08:43:45Z

I implemented your improvements. Regarding the loops; I tried to follow what seemed to be the local convention for the looping, but I guess I found old bad examples on how it was done. Thanks for the pointer.

Add control ability of the sleek ns copying to _copyParentNamespaces() so we can turn it off or on depending on the context. This as this is not always desired behavior. Leave it active only on c14n serialization for now.

Pelleplutt · 2015-12-15T08:49:31Z

I made a suggestion for how to retain the original behavior in all cases except for the c14n serialization (as this is where it really makes a difference). Not a big fan of pushing boolean flags to functions like this (as it quickly grows out of hand...). But it does the trick.

I guess one option would be to make a wrapper for _plainFakeRootDoc() (like done for _fakeRootDoc()) to limit the impact, but as it was only used in the raw form in two locations I left this out.

Pelleplutt · 2016-01-20T15:06:21Z

Any feedback on this?

scoder · 2016-01-22T10:11:53Z

src/lxml/proxy.pxi

-cdef void _copyParentNamespaces(xmlNode* c_from_node, xmlNode* c_to_node) nogil:
-    u"""Copy the namespaces of all ancestors of c_from_node to c_to_node.
+cdef void _copyParentNamespaces(xmlNode* c_from_node, xmlNode* c_to_node, bint applied_ns_only) nogil:
+    u"""Copy the namespaces of all ancestors of c_from_node to c_to_node that are used by c_to_node.


... used by c_from_node, it seems.

scoder · 2016-01-22T10:14:46Z

To run the tests, I just say python test.py -vv -p. Or call py.test src/lxml/tests if you have pytest. (Some of the HTML tests still seem to fail when run with pytest, thus the selected subdirectory.)

scoder · 2016-01-22T10:25:54Z

Hmm, I got confused and didn't understand that your intention was to make this change specifically for C14N. I'm sure there is code out there that uses C14N serialisation to build persistent hash IDs etc. and it's an important use case, I'd say. This change would suddenly lead to different output for existing data. I see less of a problem with normal XML serialisation.

That suggests that this should be treated as a new feature that should be optional, in order to keep backwards compatibility by default. I'm ok with always enabling it for normal serialisation (there is no advantage in having redundant namespace declarations there), but the C14N case would need an explicit flag option that users would have to enable in order to get the new behaviour.

Sorry for not noticing this earlier.

Pelleplutt · 2016-01-22T10:43:16Z

Yes, the point is specifically to add it for C14N as it seems to be in direct violation of the specification when serialising the XML.

As I interpret https://www.w3.org/TR/xml-c14n section 2.4 this should be done. See second paragraph in this section.

This means that when we create a new child and simply copy all of the namespace declarations from parent declarations we are in fact introducing declarations that should not be present on that node in the current context. Hence my change. This is also the way other implementations seems to behave (as I am doing this to solve digest mismatch issues where this case exists).

scoder · 2016-01-23T15:16:16Z

The paragraph you are referencing says:

"""
The processing of an element node E MUST be modified slightly when an XPath node-set is given as input and the element's parent is omitted from the node-set. The method for processing the attribute axis of an element E in the node-set is enhanced. All element nodes along E's ancestor axis are examined for nearest occurrences of attributes in the xml namespace, such as xml:lang and xml:space (whether or not they are in the node-set). From this list of attributes, remove any that are in E's attribute axis (whether or not they are in the node-set). Then, lexicographically merge this attribute list with the nodes of E's attribute axis that are in the node-set. The result of visiting the attribute axis is computed by processing the attribute nodes in this merged attribute list.
"""

There is no hint here that unused namespace prefix mappings should be removed. Instead, it says that only those from the ancestors should not be added that the element E declares itself, i.e. all other prefix declarations should be included, whether they are used in E's subtree or not.

Note also that this paragraph only refers to the xml namespace, not arbitrary namespace-prefix mappings.

Pelleplutt · 2016-01-25T09:14:13Z

Perhaps we are reading this differently. I interpret that section as if only xml namespaces present in the node set being copied should be added.

"From this list of attributes, remove any that are in E's attribute axis (whether or not they are in the node-set). Then, lexicographically merge this attribute list with the nodes of E's attribute axis that are in the node-set."

For me this sound like

Do not add duplicates "remove any that are in E's attribute axis (whether or not they are in the node-set)"
Add the attributes, i.e. in this context namespace declarations, that are in the node set. " merge this attribute list with the nodes of E's attribute axis that are in the node-set"

But I will admit that parsing those sections of text are not trivial and I am not a native English speaker. Looking at other implementations does not give me a conclusion about this either, Perl's XML::LibXML behaves as lxml currently does (copies all the parent namespaces). Java (org.apache.xml.security.c14n.Canonicalizer) does not seem to help at all but you rather have to add the namespaces manually when doing this. Whatever implementation the bank I am testing against used it does however behave like I suggest, copy only the namespaces actually used.

But this is your code scoder and you can decide how you want it to behave. I am happy if you choose to include the code at all. So your call here.

This reverts commit d71219b, reversing changes made to 714b95c.

This reverts commit 6ea2dab.

…r""" This reverts commit 8c19b66.

scoder · 2016-08-20T11:50:13Z

I've cleaned up your changes here, but then reverted the merge after some reconsideration:
https://github.com/scoder/lxml/tree/exclude_unused_ns_from_c14n

I'm still having difficulties to understand why this cannot be left to users. They can call cleanup_namespaces() on their tree directly before serialisation to get the same result. Changing the default behaviour here is annoying because in order to allow users who depend on it to get back the old behaviour, we'd have to add a new legacy option, and people would have to start using that option from a certain lxml version on but avoid passing it in old versions that don't accept it.

Also, your change does not handle file serialisation but only string serialisation, which adds an inconsistency.

I do see your interest in making this the default and I agree with it. But ISTM that it's not worth breaking existing code that relies on the current behavour.

perltrustly added 2 commits December 15, 2015 09:05

Use a more intuitive looping form.

d9ff0ab

Limit the use of the GIL to be held more locally.

50355f9

Limit the impact of the sparse namespace copy to c14n.

d79f3b6

Add control ability of the sleek ns copying to _copyParentNamespaces() so we can turn it off or on depending on the context. This as this is not always desired behavior. Leave it active only on c14n serialization for now.

scoder reviewed Jan 22, 2016
View reviewed changes

Pelleplutt added 5 commits January 22, 2016 14:21

Rename applied_ns_only->used_only.

355fc3d

Default used_only to True as per suggestion by scoder.

4f5ffb2

Rename apply_ns->should_copy. Better name

2d0686f

Use keyword arguments to get rid of the silly one digit bools all over.

cfab5eb

Adapt the testcases to the new c14n output.

9892e20

scoder merged commit d71219b into lxml:master Aug 20, 2016

scoder added a commit that referenced this pull request Aug 20, 2016

Revert "Merge pull request #184 from Pelleplutt/master"

6ea2dab

This reverts commit d71219b, reversing changes made to 714b95c.

scoder added a commit that referenced this pull request Aug 20, 2016

Revert "Revert "Merge pull request #184 from Pelleplutt/master""

8c19b66

This reverts commit 6ea2dab.

scoder added a commit that referenced this pull request Aug 20, 2016

Revert "Revert "Revert "Merge pull request #184 from Pelleplutt/maste…

0549e49

…r""" This reverts commit 8c19b66.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not blindly copy all of the namespaces when tostring():ing a subtree. #184

Do not blindly copy all of the namespaces when tostring():ing a subtree. #184

Pelleplutt commented Dec 14, 2015

scoder commented Dec 14, 2015

Pelleplutt commented Dec 15, 2015

Pelleplutt commented Dec 15, 2015

Pelleplutt commented Dec 15, 2015

Pelleplutt commented Jan 20, 2016

scoder Jan 22, 2016

scoder commented Jan 22, 2016

scoder commented Jan 22, 2016

Pelleplutt commented Jan 22, 2016

scoder commented Jan 23, 2016

Pelleplutt commented Jan 25, 2016

scoder commented Aug 20, 2016

Do not blindly copy all of the namespaces when tostring():ing a subtree. #184

Do not blindly copy all of the namespaces when tostring():ing a subtree. #184

Conversation

Pelleplutt commented Dec 14, 2015

scoder commented Dec 14, 2015

Pelleplutt commented Dec 15, 2015

Pelleplutt commented Dec 15, 2015

Pelleplutt commented Dec 15, 2015

Pelleplutt commented Jan 20, 2016

scoder Jan 22, 2016

Choose a reason for hiding this comment

scoder commented Jan 22, 2016

scoder commented Jan 22, 2016

Pelleplutt commented Jan 22, 2016

scoder commented Jan 23, 2016

Pelleplutt commented Jan 25, 2016

scoder commented Aug 20, 2016