Skip to content

Add support for writing CDATA escaped strings in lxml.etree.xmlfile.write()#458

Merged
scoder merged 5 commits into
lxml:masterfrom
LanetheGreat:xmlfile-cdata-support
Mar 31, 2025
Merged

Add support for writing CDATA escaped strings in lxml.etree.xmlfile.write()#458
scoder merged 5 commits into
lxml:masterfrom
LanetheGreat:xmlfile-cdata-support

Conversation

@LanetheGreat
Copy link
Copy Markdown
Contributor

@LanetheGreat LanetheGreat commented Mar 27, 2025

While refactoring a project for my workplace that utilizes lxml (specifically the incremental xmlfile class) to create XML export files from data streams on various REST API's, I was adding support for namespaces to my code when I ran into the same issue as this user on StackOverflow "lmxl incremental XML serialisation repeats namespaces"

So I converted my code from writing Element instances directly to using the Python with syntax to preserve parent node namespaces without repeating them on the child nodes (which would bloat the export file size):
Example from:

from io import BytesIO
from lxml import etree

file = BytesIO()
with etree.xmlfile(file) as xf:
    with xf.element('rss', nsmap={'g': 'http://base.google.com/ns/1.0'}):
        with xf.element('channel'):
            with xf.element('item'):
                elem = etree.Element('{http://base.google.com/ns/1.0}field1')
                elem.text = 'data 1'
                xf.write(elem)
                
                elem = etree.Element('{http://base.google.com/ns/1.0}field2')
                elem.text = 'data 2'
                xf.write(elem)

assert file.getvalue() == b'<rss xmlns:g="http://base.google.com/ns/1.0"><channel><item><ns0:field1 xmlns:ns0="http://base.google.com/ns/1.0">data 1</ns0:field1><ns0:field2 xmlns:ns0="http://base.google.com/ns/1.0">data 2</ns0:field2></item></channel></rss>'

Example to:

from io import BytesIO
from lxml import etree

file = BytesIO()
with etree.xmlfile(file) as xf:
    with xf.element('rss', nsmap={'g': 'http://base.google.com/ns/1.0'}):
        with xf.element('channel'):
            with xf.element('item'):
                with xf.element('{http://base.google.com/ns/1.0}field1'):
                    xf.write('data 1')
                with xf.element('{http://base.google.com/ns/1.0}field2'):
                    xf.write('data 2')

assert file.getvalue() == b'<rss xmlns:g="http://base.google.com/ns/1.0"><channel><item><g:field1>data 1</g:field1><g:field2>data 2</g:field2></item></channel></rss>'

But I found out that breaks some of my existing export files that use CDATA escaping because lxml.etree.xmlfile.write() supports strings and Element nodes containing CDATA text, but doesn't support CDATA instances as direct arguments and raises TypeError when one is provided.

Working code, but duplicates namespaces:

from io import BytesIO
from lxml import etree

file = BytesIO()
with etree.xmlfile(file) as xf:
    with xf.element('rss', nsmap={'g': 'http://base.google.com/ns/1.0'}):
        with xf.element('channel'):
            with xf.element('item'):
                elem = etree.Element('{http://base.google.com/ns/1.0}field1')
                elem.text = etree.CDATA('data 1')
                xf.write(elem)
                
                elem = etree.Element('{http://base.google.com/ns/1.0}field2')
                elem.text = etree.CDATA('data 2')
                xf.write(elem)

assert file.getvalue() == b'<rss xmlns:g="http://base.google.com/ns/1.0"><channel><item><ns0:field1 xmlns:ns0="http://base.google.com/ns/1.0"><![CDATA[data 1]]></ns0:field1><ns0:field2 xmlns:ns0="http://base.google.com/ns/1.0"><![CDATA[data 2]]></ns0:field2></item></channel></rss>'

New code without duplicating namespaces, but raises TypeError:

from io import BytesIO
from lxml import etree

file = BytesIO()
with etree.xmlfile(file) as xf:
    with xf.element('rss', nsmap={'g': 'http://base.google.com/ns/1.0'}):
        with xf.element('channel'):
            with xf.element('item'):
                with xf.element('{http://base.google.com/ns/1.0}field1'):
                    xf.write(etree.CDATA('data 1'))
                with xf.element('{http://base.google.com/ns/1.0}field2'):
                    xf.write(etree.CDATA('data 2'))

assert file.getvalue() == b'<rss xmlns:g="http://base.google.com/ns/1.0"><channel><item><g:field1><![CDATA[data 1]]></g:field1><g:field2><![CDATA[data 2]]></g:field2></item></channel></rss>'

So this PR allows lxml.etree.xmlfile.write() to write CDATA instances to maintain parity with how Element.text works. I've also included unit tests that also check to make sure the CDATA is output and correctly escapes the encapsulated text string(s).

@LanetheGreat LanetheGreat changed the title Add support for writing CDATA escaped strings in lxml.xmlfile.write() Add support for writing CDATA escaped strings in lxml.etree.xmlfile.write() Mar 27, 2025
@LanetheGreat
Copy link
Copy Markdown
Contributor Author

It's also worth noting that the types-lxml package will also need to update the argument types for lxml.etree.xmlfile.write() if this PR gets merged in.

@scoder
Copy link
Copy Markdown
Member

scoder commented Mar 28, 2025

Thanks. The tests look good. The implementation should call libxml2 directly, though. Use this:

                bstring = (<CDATA>content)._utf8_data
                tree.xmlOutputBufferWrite(self._c_out, 9, "<![CDATA[")
                tree.xmlOutputBufferWrite(self._c_out, len(bstring), _cstr(bstring))
                tree.xmlOutputBufferWrite(self._c_out, 3, "]]>")

Comment thread src/lxml/serializer.pxi Outdated
@scoder
Copy link
Copy Markdown
Member

scoder commented Mar 28, 2025

Ok, this required a bit more work. I pushed an implementation to the master branch, let's see what your tests say. They are probably incomplete, given the new (non-trivial) implementation. Maybe you can extend them some more? Since the section splitting is done by lxml now, there are cases like "]] at the string end" etc. that need to be covered.

@LanetheGreat
Copy link
Copy Markdown
Contributor Author

Thanks. The tests look good. The implementation should call libxml2 directly, though. Use this:

                bstring = (<CDATA>content)._utf8_data
                tree.xmlOutputBufferWrite(self._c_out, 9, "<![CDATA[")
                tree.xmlOutputBufferWrite(self._c_out, len(bstring), _cstr(bstring))
                tree.xmlOutputBufferWrite(self._c_out, 3, "]]>")

Originally that was how I coded this solution, but while reading on the docs from MDN about CDATA sections (because I unsure if CDATA sections were valid within HTML) it notes:

The only sequence which is not allowed within a CDATA section is the closing sequence of a CDATA section itself, ]]>.

And realized that code wouldn't properly escape the ending CDATA tag and output invalid XML. That's why I wrote the test_cdata_escaping() test to guard against miscoding that, and opted to use _writeNodeToBuffer() since it was already in use by the same code segment and eventually delegates writing the CDATA section down to tree.xmlNodeDumpOutput() from libxml/tree.h which properly escapes the tail tag and splits it into 2 CDATA sections to preserve the text.

But if the lxml project is trying to move away from delegating too much to libxml, that's my bad for not noticing.

@scoder
Copy link
Copy Markdown
Member

scoder commented Mar 31, 2025

But if the lxml project is trying to move away from delegating too much to libxml

Certainly not your fault, and there isn't a general move like that. The thing is rather that we don't have a node here that libxml2 could serialise, and creating a node just to write it out and throw it away seems wasteful. The function you called is actually a rather high-level function in lxml that does a lot more than just the simple serialisation, so that's wasteful as well.

Thanks for providing the code and the tests, that made it easy to find a good implementation.

@scoder scoder merged commit 7a502cf into lxml:master Mar 31, 2025
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants