Add support for writing CDATA escaped strings in lxml.etree.xmlfile.write()#458
Conversation
lxml.xmlfile.write()lxml.etree.xmlfile.write()
|
It's also worth noting that the types-lxml package will also need to update the argument types for |
|
Thanks. The tests look good. The implementation should call libxml2 directly, though. Use this: bstring = (<CDATA>content)._utf8_data
tree.xmlOutputBufferWrite(self._c_out, 9, "<![CDATA[")
tree.xmlOutputBufferWrite(self._c_out, len(bstring), _cstr(bstring))
tree.xmlOutputBufferWrite(self._c_out, 3, "]]>") |
|
Ok, this required a bit more work. I pushed an implementation to the master branch, let's see what your tests say. They are probably incomplete, given the new (non-trivial) implementation. Maybe you can extend them some more? Since the section splitting is done by lxml now, there are cases like "]] at the string end" etc. that need to be covered. |
Originally that was how I coded this solution, but while reading on the docs from MDN about CDATA sections (because I unsure if CDATA sections were valid within HTML) it notes:
And realized that code wouldn't properly escape the ending CDATA tag and output invalid XML. That's why I wrote the But if the lxml project is trying to move away from delegating too much to libxml, that's my bad for not noticing. |
Certainly not your fault, and there isn't a general move like that. The thing is rather that we don't have a node here that libxml2 could serialise, and creating a node just to write it out and throw it away seems wasteful. The function you called is actually a rather high-level function in lxml that does a lot more than just the simple serialisation, so that's wasteful as well. Thanks for providing the code and the tests, that made it easy to find a good implementation. |
While refactoring a project for my workplace that utilizes lxml (specifically the incremental
xmlfileclass) to create XML export files from data streams on various REST API's, I was adding support for namespaces to my code when I ran into the same issue as this user on StackOverflow "lmxl incremental XML serialisation repeats namespaces"So I converted my code from writing
Elementinstances directly to using the Pythonwithsyntax to preserve parent node namespaces without repeating them on the child nodes (which would bloat the export file size):Example from:
Example to:
But I found out that breaks some of my existing export files that use CDATA escaping because
lxml.etree.xmlfile.write()supports strings and Element nodes containing CDATA text, but doesn't support CDATA instances as direct arguments and raisesTypeErrorwhen one is provided.Working code, but duplicates namespaces:
New code without duplicating namespaces, but raises
TypeError:So this PR allows
lxml.etree.xmlfile.write()to write CDATA instances to maintain parity with howElement.textworks. I've also included unit tests that also check to make sure the CDATA is output and correctly escapes the encapsulated text string(s).