Add support for writing CDATA escaped strings in `lxml.etree.xmlfile.write()` by LanetheGreat · Pull Request #458 · lxml/lxml

LanetheGreat · 2025-03-27T20:42:07Z

While refactoring a project for my workplace that utilizes lxml (specifically the incremental xmlfile class) to create XML export files from data streams on various REST API's, I was adding support for namespaces to my code when I ran into the same issue as this user on StackOverflow "lmxl incremental XML serialisation repeats namespaces"

So I converted my code from writing Element instances directly to using the Python with syntax to preserve parent node namespaces without repeating them on the child nodes (which would bloat the export file size):
Example from:

from io import BytesIO
from lxml import etree

file = BytesIO()
with etree.xmlfile(file) as xf:
    with xf.element('rss', nsmap={'g': 'http://base.google.com/ns/1.0'}):
        with xf.element('channel'):
            with xf.element('item'):
                elem = etree.Element('{http://base.google.com/ns/1.0}field1')
                elem.text = 'data 1'
                xf.write(elem)
                
                elem = etree.Element('{http://base.google.com/ns/1.0}field2')
                elem.text = 'data 2'
                xf.write(elem)

assert file.getvalue() == b'<rss xmlns:g="http://base.google.com/ns/1.0"><channel><item><ns0:field1 xmlns:ns0="http://base.google.com/ns/1.0">data 1</ns0:field1><ns0:field2 xmlns:ns0="http://base.google.com/ns/1.0">data 2</ns0:field2></item></channel></rss>'

Example to:

from io import BytesIO
from lxml import etree

file = BytesIO()
with etree.xmlfile(file) as xf:
    with xf.element('rss', nsmap={'g': 'http://base.google.com/ns/1.0'}):
        with xf.element('channel'):
            with xf.element('item'):
                with xf.element('{http://base.google.com/ns/1.0}field1'):
                    xf.write('data 1')
                with xf.element('{http://base.google.com/ns/1.0}field2'):
                    xf.write('data 2')

assert file.getvalue() == b'<rss xmlns:g="http://base.google.com/ns/1.0"><channel><item><g:field1>data 1</g:field1><g:field2>data 2</g:field2></item></channel></rss>'

But I found out that breaks some of my existing export files that use CDATA escaping because lxml.etree.xmlfile.write() supports strings and Element nodes containing CDATA text, but doesn't support CDATA instances as direct arguments and raises TypeError when one is provided.

Working code, but duplicates namespaces:

from io import BytesIO
from lxml import etree

file = BytesIO()
with etree.xmlfile(file) as xf:
    with xf.element('rss', nsmap={'g': 'http://base.google.com/ns/1.0'}):
        with xf.element('channel'):
            with xf.element('item'):
                elem = etree.Element('{http://base.google.com/ns/1.0}field1')
                elem.text = etree.CDATA('data 1')
                xf.write(elem)
                
                elem = etree.Element('{http://base.google.com/ns/1.0}field2')
                elem.text = etree.CDATA('data 2')
                xf.write(elem)

assert file.getvalue() == b'<rss xmlns:g="http://base.google.com/ns/1.0"><channel><item><ns0:field1 xmlns:ns0="http://base.google.com/ns/1.0"><![CDATA[data 1]]></ns0:field1><ns0:field2 xmlns:ns0="http://base.google.com/ns/1.0"><![CDATA[data 2]]></ns0:field2></item></channel></rss>'

New code without duplicating namespaces, but raises TypeError:

from io import BytesIO
from lxml import etree

file = BytesIO()
with etree.xmlfile(file) as xf:
    with xf.element('rss', nsmap={'g': 'http://base.google.com/ns/1.0'}):
        with xf.element('channel'):
            with xf.element('item'):
                with xf.element('{http://base.google.com/ns/1.0}field1'):
                    xf.write(etree.CDATA('data 1'))
                with xf.element('{http://base.google.com/ns/1.0}field2'):
                    xf.write(etree.CDATA('data 2'))

assert file.getvalue() == b'<rss xmlns:g="http://base.google.com/ns/1.0"><channel><item><g:field1><![CDATA[data 1]]></g:field1><g:field2><![CDATA[data 2]]></g:field2></item></channel></rss>'

So this PR allows lxml.etree.xmlfile.write() to write CDATA instances to maintain parity with how Element.text works. I've also included unit tests that also check to make sure the CDATA is output and correctly escapes the encapsulated text string(s).

LanetheGreat · 2025-03-27T21:05:00Z

It's also worth noting that the types-lxml package will also need to update the argument types for lxml.etree.xmlfile.write() if this PR gets merged in.

scoder · 2025-03-28T07:19:06Z

Thanks. The tests look good. The implementation should call libxml2 directly, though. Use this:

                bstring = (<CDATA>content)._utf8_data
                tree.xmlOutputBufferWrite(self._c_out, 9, "<![CDATA[")
                tree.xmlOutputBufferWrite(self._c_out, len(bstring), _cstr(bstring))
                tree.xmlOutputBufferWrite(self._c_out, 3, "]]>")

scoder · 2025-03-28T11:05:58Z

Ok, this required a bit more work. I pushed an implementation to the master branch, let's see what your tests say. They are probably incomplete, given the new (non-trivial) implementation. Maybe you can extend them some more? Since the section splitting is done by lxml now, there are cases like "]] at the string end" etc. that need to be covered.

LanetheGreat · 2025-03-30T19:08:35Z

Thanks. The tests look good. The implementation should call libxml2 directly, though. Use this:

                bstring = (<CDATA>content)._utf8_data
                tree.xmlOutputBufferWrite(self._c_out, 9, "<![CDATA[")
                tree.xmlOutputBufferWrite(self._c_out, len(bstring), _cstr(bstring))
                tree.xmlOutputBufferWrite(self._c_out, 3, "]]>")

Originally that was how I coded this solution, but while reading on the docs from MDN about CDATA sections (because I unsure if CDATA sections were valid within HTML) it notes:

The only sequence which is not allowed within a CDATA section is the closing sequence of a CDATA section itself, ]]>.

And realized that code wouldn't properly escape the ending CDATA tag and output invalid XML. That's why I wrote the test_cdata_escaping() test to guard against miscoding that, and opted to use _writeNodeToBuffer() since it was already in use by the same code segment and eventually delegates writing the CDATA section down to tree.xmlNodeDumpOutput() from libxml/tree.h which properly escapes the tail tag and splits it into 2 CDATA sections to preserve the text.

But if the lxml project is trying to move away from delegating too much to libxml, that's my bad for not noticing.

scoder · 2025-03-31T08:13:17Z

But if the lxml project is trying to move away from delegating too much to libxml

Certainly not your fault, and there isn't a general move like that. The thing is rather that we don't have a node here that libxml2 could serialise, and creating a node just to write it out and throw it away seems wasteful. The function you called is actually a rather high-level function in lxml that does a lot more than just the simple serialisation, so that's wasteful as well.

Thanks for providing the code and the tests, that made it easy to find a good implementation.

Add support for writing CDATA escaped strings in lxml.xmlfile.write()

2c0a2ed

LanetheGreat changed the title ~~Add support for writing CDATA escaped strings in lxml.xmlfile.write()~~ Add support for writing CDATA escaped strings in lxml.etree.xmlfile.write() Mar 27, 2025

scoder reviewed Mar 28, 2025

View reviewed changes

Comment thread src/lxml/serializer.pxi Outdated

scoder added 2 commits March 28, 2025 08:20

Use direct libxml2 calls.

1951dcd

Merge branch 'master' into xmlfile-cdata-support

9e394c9

Merge branch 'master' into xmlfile-cdata-support

df68fdf

Merge branch 'master' into xmlfile-cdata-support

7bafe5f

scoder merged commit 7a502cf into lxml:master Mar 31, 2025
48 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for writing CDATA escaped strings in `lxml.etree.xmlfile.write()`#458

Add support for writing CDATA escaped strings in `lxml.etree.xmlfile.write()`#458
scoder merged 5 commits into
lxml:masterfrom
LanetheGreat:xmlfile-cdata-support

LanetheGreat commented Mar 27, 2025 •

edited

Loading

Uh oh!

LanetheGreat commented Mar 27, 2025

Uh oh!

scoder commented Mar 28, 2025

Uh oh!

Uh oh!

scoder commented Mar 28, 2025

Uh oh!

LanetheGreat commented Mar 30, 2025

Uh oh!

scoder commented Mar 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

LanetheGreat commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LanetheGreat commented Mar 27, 2025

Uh oh!

scoder commented Mar 28, 2025

Uh oh!

Uh oh!

scoder commented Mar 28, 2025

Uh oh!

LanetheGreat commented Mar 30, 2025

Uh oh!

scoder commented Mar 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LanetheGreat commented Mar 27, 2025 •

edited

Loading