-
-
Notifications
You must be signed in to change notification settings - Fork 566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not blindly copy all of the namespaces when tostring():ing a subtree. #184
Conversation
When using a subtree of a document do not simply copy all of the namespaces from all of the parents down. Only copy those that we actually use within the subtree. This as copying all namespaces will bloat the subtree with information it should not have.
I generally like the change (and it's well written), but sadly, there are cases where |
Sorry for my n00b-ness on this, I honestly tried running the test suite myself and failed. How do you actually run the tests locally? |
I implemented your improvements. Regarding the loops; I tried to follow what seemed to be the local convention for the looping, but I guess I found old bad examples on how it was done. Thanks for the pointer. |
Add control ability of the sleek ns copying to _copyParentNamespaces() so we can turn it off or on depending on the context. This as this is not always desired behavior. Leave it active only on c14n serialization for now.
I made a suggestion for how to retain the original behavior in all cases except for the c14n serialization (as this is where it really makes a difference). Not a big fan of pushing boolean flags to functions like this (as it quickly grows out of hand...). But it does the trick. I guess one option would be to make a wrapper for _plainFakeRootDoc() (like done for _fakeRootDoc()) to limit the impact, but as it was only used in the raw form in two locations I left this out. |
Any feedback on this? |
cdef void _copyParentNamespaces(xmlNode* c_from_node, xmlNode* c_to_node) nogil: | ||
u"""Copy the namespaces of all ancestors of c_from_node to c_to_node. | ||
cdef void _copyParentNamespaces(xmlNode* c_from_node, xmlNode* c_to_node, bint applied_ns_only) nogil: | ||
u"""Copy the namespaces of all ancestors of c_from_node to c_to_node that are used by c_to_node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... used by c_from_node
, it seems.
To run the tests, I just say |
Hmm, I got confused and didn't understand that your intention was to make this change specifically for C14N. I'm sure there is code out there that uses C14N serialisation to build persistent hash IDs etc. and it's an important use case, I'd say. This change would suddenly lead to different output for existing data. I see less of a problem with normal XML serialisation. That suggests that this should be treated as a new feature that should be optional, in order to keep backwards compatibility by default. I'm ok with always enabling it for normal serialisation (there is no advantage in having redundant namespace declarations there), but the C14N case would need an explicit flag option that users would have to enable in order to get the new behaviour. Sorry for not noticing this earlier. |
Yes, the point is specifically to add it for C14N as it seems to be in direct violation of the specification when serialising the XML. As I interpret https://www.w3.org/TR/xml-c14n section 2.4 this should be done. See second paragraph in this section. This means that when we create a new child and simply copy all of the namespace declarations from parent declarations we are in fact introducing declarations that should not be present on that node in the current context. Hence my change. This is also the way other implementations seems to behave (as I am doing this to solve digest mismatch issues where this case exists). |
The paragraph you are referencing says: """ There is no hint here that unused namespace prefix mappings should be removed. Instead, it says that only those from the ancestors should not be added that the element E declares itself, i.e. all other prefix declarations should be included, whether they are used in E's subtree or not. Note also that this paragraph only refers to the xml namespace, not arbitrary namespace-prefix mappings. |
Perhaps we are reading this differently. I interpret that section as if only xml namespaces present in the node set being copied should be added. "From this list of attributes, remove any that are in E's attribute axis (whether or not they are in the node-set). Then, lexicographically merge this attribute list with the nodes of E's attribute axis that are in the node-set." For me this sound like
But I will admit that parsing those sections of text are not trivial and I am not a native English speaker. Looking at other implementations does not give me a conclusion about this either, Perl's XML::LibXML behaves as lxml currently does (copies all the parent namespaces). Java (org.apache.xml.security.c14n.Canonicalizer) does not seem to help at all but you rather have to add the namespaces manually when doing this. Whatever implementation the bank I am testing against used it does however behave like I suggest, copy only the namespaces actually used. But this is your code scoder and you can decide how you want it to behave. I am happy if you choose to include the code at all. So your call here. |
…r""" This reverts commit 8c19b66.
I've cleaned up your changes here, but then reverted the merge after some reconsideration: I'm still having difficulties to understand why this cannot be left to users. They can call Also, your change does not handle file serialisation but only string serialisation, which adds an inconsistency. I do see your interest in making this the default and I agree with it. But ISTM that it's not worth breaking existing code that relies on the current behavour. |
When using a subtree of a document do not simply copy all of the
namespaces from all of the parents down. Only copy those that we
actually use within the subtree. This as copying all namespaces will
bloat the subtree with information it should not have.
This might seem harmless to do in the average case, but it will cause problems when serializing the XML, specifically C14N serialization which will according to specification retain all ns declarations on the root level element. So if this tostring() execution then will insert all parent namespace declarations into the now new root element we will unnecessarily bloat the ns declarations on this new toplevel element.
Having this said I am not confident this is the best code for doing this, feel free to point me in the direction of better code if you will.