Skip to content


Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Fetching contributors…

Cannot retrieve contributors at this time

82 lines (56 sloc) 2.727 kb
Exposing libxml2 functionalities
* See whether XInclude support can mimic ElementTree's API.
* Test XML entities, also in an ElementTree context.
In general
* test namespaces more in-depth
* will namespace nodes of unknown namespaces be added (and never freed?)
Top level
* ProcessingInstruction
* _setroot(), even though this is not strictly a public method.
* expose prefix support?
* Relaxed NG compact notation (rnc versus rng) support. May consider
integrating this:
Notes on implementing iterparse
"iterparse" will be (or will return) an iterable object, let's call it
IterParse for clarity. A class is basically the only way of implementing
iterators in Pyrex. For the internal SAX part, IterParse will likely work a
lot like lxml.sax.ElementTreeContentHandler.
We'd need a custom wrapper to the default libxml2 SAX handler to intercept the
parse events (this means implementing C helper functions for the SAX events)
/after/ they were processed by libxml2. See xmlSAXVersion (SAX2.c) on how to
retrieve the SAX2 default parser structure.
IterParse should pass chunks into the parser and buffer the events it
receives. When its __next__() method is called, it returns one event or passes
new chunks until there is an event to return. This is needed as IterParse has
to convert between libxml2 push (SAX) and Python pull (iter).
As for the input to the libxml2 parser, there are two possible ways: one is to
pass data chunks in through xmlParseChunk and the other is to use
xmlCreateIOParserCtxt and implement xmlInputReadCallback (xmlio.h) to have
libxml2 request data by itself. However, xmlParseChunk allows us to control
how far libxml2 parses in advance, so this is preferable.
Python events (start, end, start-ns, end-ns) are created as follows:
* "*-ns" events must be extracted from the libxml2 xmlSAX2StartElementNs call
(passed in arguments "prefix"/"URI" and the char* array "namespaces"). They
must be stored on a stack to build the respective "end-ns" events.
* "start" is somewhat tricky, as it would be a bad idea to allow modifications
of the XML structure during that iterator cycle. Maybe it's enough to document
that, but there may be ways to crash lxml with certain tree operations. Note
also that care has to be taken to prevent Python from garbage collecting the
element before the "end" event. The best way to do that is to store a Python
reference to that element on a stack.
* "end" is simple then: pop the element from the stack and return it.
Jump to Line
Something went wrong with that request. Please try again.