Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Fetching contributors…

Cannot retrieve contributors at this time

537 lines (410 sloc) 18.201 kb
=====================
APIs specific to lxml
=====================
lxml tries to follow established APIs wherever possible. Sometimes, however,
the need to expose a feature in an easy way led to the invention of a new API.
lxml.etree
----------
lxml.etree tries to follow the `ElementTree API`_ wherever it can. There are
however some incompatibilities (see `compatibility`_). The extensions are
documented here.
.. _`ElementTree API`: http://effbot.org/zone/element-index.htm
.. _`compatibility`: compatibility.html
If you need to know which version of lxml is installed, you can access the
``lxml.etree.LXML_VERSION`` attribute to retrieve a version tuple. Note,
however, that it did not exist before version 1.0, so you will get an
AttributeError in older versions. The versions of libxml2 and libxslt are
available through the attributes ``LIBXML_VERSION`` and ``LIBXSLT_VERSION``.
The following examples usually assume this to be executed first::
>>> from lxml import etree
>>> from StringIO import StringIO
Parsers
-------
One of the differences is the parser. There is support for both XML and
(broken) HTML. Both are based on libxml2 and therefore only support options
that are backed by the library. Parsers take a number of keyword arguments.
The following is an example for namespace cleanup during parsing, first with
the default parser, then with a parametrized one::
>>> xml = '<a xmlns="test"><b xmlns="test"/></a>'
>>> et = etree.parse(StringIO(xml))
>>> print etree.tostring(et.getroot())
<a xmlns="test"><b xmlns="test"/></a>
>>> parser = etree.XMLParser(ns_clean=True)
>>> et = etree.parse(StringIO(xml), parser)
>>> print etree.tostring(et.getroot())
<a xmlns="test"><b/></a>
HTML parsing is similarly simple. The parsers have a ``recover`` keyword
argument that the HTMLParser sets by default. It lets libxml2 try its best to
return something usable without raising an exception. Note that this
functionality depends entirely on libxml2. You should use libxml2 version
2.6.21 or newer to take advantage of this feature::
>>> broken_html = "<html><head><title>test<body><h1>page title</h3>"
>>> parser = etree.HTMLParser()
>>> et = etree.parse(StringIO(broken_html), parser)
>>> print etree.tostring(et.getroot())
<html><head><title>test</title></head><body><h1>page title</h1></body></html>
Lxml has an HTML function, similar to the XML shortcut known from
ElementTree::
>>> html = etree.HTML(broken_html)
>>> print etree.tostring(html)
<html><head><title>test</title></head><body><h1>page title</h1></body></html>
The use of the libxml2 parsers makes some additional information available at
the API level. Currently, ElementTree objects can access the DOCTYPE
information provided by a parsed document, as well as the XML version and the
original encoding::
>>> pub_id = "-//W3C//DTD XHTML 1.0 Transitional//EN"
>>> sys_url = "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
>>> doctype_string = '<!DOCTYPE html PUBLIC "%s" "%s">' % (pub_id, sys_url)
>>> xml_header = '<?xml version="1.0" encoding="ascii"?>'
>>> xhtml = xml_header + doctype_string + '<html><body></body></html>'
>>> et = etree.parse(StringIO(xhtml))
>>> docinfo = et.docinfo
>>> print docinfo.public_id
-//W3C//DTD XHTML 1.0 Transitional//EN
>>> print docinfo.system_url
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
>>> docinfo.doctype == doctype_string
True
>>> print docinfo.xml_version
1.0
>>> print docinfo.encoding
ascii
Error handling on exceptions
----------------------------
Libxml2 provides error messages for failures, be it during parsing, XPath
evaluation or schema validation. Whenever an exception is raised, you can
retrieve the errors that occured and "might have" lead to the problem::
>>> etree.clearErrorLog()
>>> broken_xml = '<a>'
>>> try:
... etree.parse(StringIO(broken_xml))
... except etree.XMLSyntaxError, e:
... pass # just put the exception into e
>>> log = e.error_log.filter_levels(etree.ErrorLevels.FATAL)
>>> print log
<string>:1:FATAL:PARSER:ERR_TAG_NOT_FINISHED: Premature end of data in tag a line 1
This might look a little cryptic at first, but it is the information that
libxml2 gives you. At least the message at the end should give you a hint
what went wrong and you can see that the fatal error (FATAL) happened during
parsing (PARSER) line 1 of a string (<string>, or filename if available).
Here, PARSER is the so-called error domain, see lxml.etree.ErrorDomains for
that. You can get it from a log entry like this::
>>> entry = log[0]
>>> print entry.domain_name, entry.type_name, entry.filename
PARSER ERR_TAG_NOT_FINISHED <string>
Python unicode strings
----------------------
lxml.etree has broader support for Python unicode strings than the ElementTree
library. First of all, where ElementTree would raise an exception, the
parsers in lxml.etree can handle unicode strings straight away::
>>> uxml = u'<test> \uf8d1 + \uf8d2 </test>'
>>> uxml
u'<test> \uf8d1 + \uf8d2 </test>'
>>> root = etree.XML(uxml)
This requires, however, that unicode strings do not specify a conflicting
encoding themselves and thus lie about their real encoding::
>>> try:
... broken = etree.XML(u'<?xml encoding="ASCII"?>\n' + uxml)
... except etree.XMLSyntaxError:
... print "This is not well-formed XML!"
This is not well-formed XML!
To serialize the result, you would normally use the ``tostring`` module
function, which serializes to plain ASCII by default or a number of other
encodings if asked for::
>>> etree.tostring(root)
'<test> &#63697; + &#63698; </test>'
>>> etree.tostring(root, 'UTF-8', xml_declaration=False)
'<test> \xef\xa3\x91 + \xef\xa3\x92 </test>'
As an extension, lxml.etree has a new ``lxml.etree.tounicode()`` function that
you can call on XML tree objects to retrieve a Python unicode representation::
>>> etree.tounicode(root)
u'<test> \uf8d1 + \uf8d2 </test>'
>>> el = etree.Element("test")
>>> etree.tounicode(el)
u'<test/>'
>>> subel = etree.SubElement(el, "subtest")
>>> etree.tounicode(el)
u'<test><subtest/></test>'
>>> et = etree.ElementTree(el)
>>> etree.tounicode(et)
u'<test><subtest/></test>'
If you want to save the result to a file or pass it over the network, you
should use ``write()`` or ``tostring()`` with an encoding argument (typically
UTF-8) to serialize the XML. The main reason is that unicode strings returned
by ``tounicode()`` never have an XML declaration and therefore do not specify
an encoding. In contrast, the ``tostring()`` function automatically adds a
declaration as needed that reflects the encoding of the returned string. This
makes it possible for other parsers to correctly parse the XML byte stream.
Note that using ``tostring()`` with UTF-8 is also typically faster.
xpath method on ElementTree, Element
------------------------------------
lxml.etree extends the ElementTree and Element interfaces with an xpath
method. For ElementTree, the xpath method performs a global xpath query
against the document. When xpath is used on an element, the xpath expression
is performed taking the element as the xpath context node.
You call the xpath() method with the XPath expression to use. Optionally, you
can provide a second argument, which should be a dictionary mapping the
namespace prefixes used in the XPath expression to namespace URIs.
The return values of xpath vary, depending on the XPath expression used:
* True or False, when the XPath expression has a boolean result
* a float, when the XPath expression has a numeric result (integer or float)
* a (unicode) string, when the XPath expression has a string result.
* a list of items, when the XPath expression has a list as result. The
items may include element nodes, strings. When the nodeset would
contain text nodes or attributes, the node result is also a string
(the text node content or attribute value). When the nodeset would
contain a comment, the result contains a string as well, inside
``<!--`` and ``-->`` markers.
Example::
>>> f = StringIO('<foo><bar></bar></foo>')
>>> doc = etree.parse(f)
>>> r = doc.xpath('/foo/bar')
>>> len(r)
1
>>> r[0].tag
'bar'
Example of using namespace prefixes::
>>> f = StringIO('''\
... <a:foo xmlns:a="http://codespeak.net/ns/test1"
... xmlns:b="http://codespeak.net/ns/test2">
... <b:bar>Text</b:bar>
... </a:foo>
... ''')
>>> doc = etree.parse(f)
>>> r = doc.xpath('/t:foo/b:bar', {'t': 'http://codespeak.net/ns/test1',
... 'b': 'http://codespeak.net/ns/test2'})
>>> len(r)
1
>>> r[0].tag
'{http://codespeak.net/ns/test2}bar'
>>> r[0].text
'Text'
XSLT
----
lxml.etree introduces a new class, lxml.etree.XSLT. The class can be
given an ElementTree object to construct an XSLT transformer::
>>> f = StringIO('''\
... <xsl:stylesheet version="1.0"
... xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
... <xsl:template match="/">
... <foo><xsl:value-of select="/a/b/text()" /></foo>
... </xsl:template>
... </xsl:stylesheet>''')
>>> xslt_doc = etree.parse(f)
>>> transform = etree.XSLT(xslt_doc)
You can then run the transformation on an ElementTree document by simply
calling it, and this results in another ElementTree object::
>>> f = StringIO('<a><b>Text</b></a>')
>>> doc = etree.parse(f)
>>> result = transform(doc)
The result object can be accessed like a normal ElementTree document::
>>> result.getroot().text
'Text'
but, as opposed to normal ElementTree objects, can also be turned into an (XML
or text) string by applying the str() function::
>>> str(result)
'<?xml version="1.0"?>\n<foo>Text</foo>\n'
The result is always a plain string, encoded as requested by the
``xsl:output`` element in the stylesheet. If you want a Python unicode string
instead, you should set this encoding to ``UTF-8`` (unless the `ASCII` default
is sufficient). This allows you to call the builtin ``unicode()`` function on
the result::
>>> unicode(result)
u'<?xml version="1.0"?>\n<foo>Text</foo>\n'
You can use other encodings at the cost of multiple recoding. Encodings that
are not supported by Python will result in an error::
>>> xslt_tree = etree.XML('''\
... <xsl:stylesheet version="1.0"
... xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
... <xsl:output encoding="UCS4"/>
... <xsl:template match="/">
... <foo><xsl:value-of select="/a/b/text()" /></foo>
... </xsl:template>
... </xsl:stylesheet>''')
>>> transform = etree.XSLT(xslt_tree)
>>> result = transform(doc)
>>> unicode(result)
Traceback (most recent call last):
[...]
LookupError: unknown encoding: UCS4
It is possible to pass parameters, in the form of XPath expressions, to the
XSLT template::
>>> xslt_tree = etree.XML('''\
... <xsl:stylesheet version="1.0"
... xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
... <xsl:template match="/">
... <foo><xsl:value-of select="$a" /></foo>
... </xsl:template>
... </xsl:stylesheet>''')
>>> transform = etree.XSLT(xslt_tree)
>>> f = StringIO('<a><b>Text</b></a>')
>>> doc = etree.parse(f)
The parameters are passed as keyword parameters to the transform call. First
let's try passing in a simple string expression::
>>> result = transform(doc, a="'A'")
>>> str(result)
'<?xml version="1.0"?>\n<foo>A</foo>\n'
Let's try a non-string XPath expression now::
>>> result = transform(doc, a="/a/b/text()")
>>> str(result)
'<?xml version="1.0"?>\n<foo>Text</foo>\n'
There's also a convenience method on the tree object for doing XSL
transformations. This is less efficient if you want to apply the same XSL
transformation to multiple documents, but is shorter to write for one-shot
operations, as you do not have to instantiate a stylesheet yourself::
>>> result = doc.xslt(xslt_tree, a="'A'")
>>> str(result)
'<?xml version="1.0"?>\n<foo>A</foo>\n'
By default, XSLT supports all extension functions from libxslt and libexslt as
well as Python regular expressions through EXSLT. Note that some extensions
enable style sheets to read and write files on the local file system. See the
`document loader documentation`_ on how to deal with this.
.. _`document loader documentation`: resolvers.html
RelaxNG
-------
lxml.etree introduces a new class, lxml.etree.RelaxNG. The class can
be given an ElementTree object to construct a Relax NG validator::
>>> f = StringIO('''\
... <element name="a" xmlns="http://relaxng.org/ns/structure/1.0">
... <zeroOrMore>
... <element name="b">
... <text />
... </element>
... </zeroOrMore>
... </element>
... ''')
>>> relaxng_doc = etree.parse(f)
>>> relaxng = etree.RelaxNG(relaxng_doc)
You can then validate some ElementTree document against the schema. You'll get
back True if the document is valid against the Relax NG schema, and False if
not::
>>> valid = StringIO('<a><b></b></a>')
>>> doc = etree.parse(valid)
>>> relaxng.validate(doc)
1
>>> invalid = StringIO('<a><c></c></a>')
>>> doc2 = etree.parse(invalid)
>>> relaxng.validate(doc2)
0
Calling the schema object has the same effect as calling its validate
method. This is sometimes used in conditional statements::
>>> invalid = StringIO('<a><c></c></a>')
>>> doc2 = etree.parse(invalid)
>>> if not relaxng(doc2):
... print "invalid!"
invalid!
If you prefer getting an exception when validating, you can use the
``assert_`` or ``assertValid`` methods::
>>> relaxng.assertValid(doc2)
Traceback (most recent call last):
[...]
DocumentInvalid: Document does not comply with schema
>>> relaxng.assert_(doc2)
Traceback (most recent call last):
[...]
AssertionError: Document does not comply with schema
Starting with version 0.9, lxml now has a simple API to report the errors
generated by libxml2. If you want to find out why the validation failed in the
second case, you can look up the error log of the validation process and check
it for relevant messages::
>>> log = relaxng.error_log
>>> print log.filter_from_errors()
<string>:1:ERROR:RELAXNGV:ERR_LT_IN_ATTRIBUTE: Did not expect element c there
You can see that the error (ERROR) happened during RelaxNG validation
(RELAXNGV). The message then tells you what went wrong. Note that this error
is local to the RelaxNG object. It will only contain log entries that
appeares during the validation. The DocumentInvalid exception raised by the
``assertValid`` method above provides access to the global error log (like all
other lxml exceptions).
Similar to XSLT, there's also a less efficient but easier shortcut method to
do one-shot RelaxNG validation::
>>> doc.relaxng(relaxng_doc)
1
>>> doc2.relaxng(relaxng_doc)
0
XMLSchema
---------
lxml.etree also has a XML Schema (XSD) support, using the class
lxml.etree.XMLSchema. This support is very similar to the Relax NG
support. The class can be given an ElementTree object to construct a
XMLSchema validator::
>>> f = StringIO('''\
... <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
... <xsd:element name="a" type="AType"/>
... <xsd:complexType name="AType">
... <xsd:sequence>
... <xsd:element name="b" type="xsd:string" />
... </xsd:sequence>
... </xsd:complexType>
... </xsd:schema>
... ''')
>>> xmlschema_doc = etree.parse(f)
>>> xmlschema = etree.XMLSchema(xmlschema_doc)
You can then validate some ElementTree document with this. Like with
RelaxNG, you'll get back true if the document is valid against the XML
schema, and false if not::
>>> valid = StringIO('<a><b></b></a>')
>>> doc = etree.parse(valid)
>>> xmlschema.validate(doc)
1
>>> invalid = StringIO('<a><c></c></a>')
>>> doc2 = etree.parse(invalid)
>>> xmlschema.validate(doc2)
0
Calling the schema object has the same effect as calling its validate
method. This is sometimes used in conditional statements::
>>> invalid = StringIO('<a><c></c></a>')
>>> doc2 = etree.parse(invalid)
>>> if not xmlschema(doc2):
... print "invalid!"
invalid!
If you prefer getting an exception when validating, you can use the
``assert_`` or ``assertValid`` methods::
>>> xmlschema.assertValid(doc2)
Traceback (most recent call last):
[...]
DocumentInvalid: Document does not comply with schema
>>> xmlschema.assert_(doc2)
Traceback (most recent call last):
[...]
AssertionError: Document does not comply with schema
Error reporting works like for the RelaxNG class::
>>> log = xmlschema.error_log
>>> errors = log.filter_from_errors()
>>> print errors[0].domain_name
SCHEMASV
>>> print errors[0].type_name
SCHEMAV_ELEMENT_CONTENT
If you were to print this log entry, you would get something like the following::
<string>:1:ERROR::SCHEMAV_ELEMENT_CONTENT: Element 'c': This element is not expected. Expected is ( b ).
Similar to XSLT and RelaxNG, there's also a less efficient but easier shortcut
method to do XML Schema validation::
>>> doc.xmlschema(xmlschema_doc)
1
>>> doc2.xmlschema(xmlschema_doc)
0
xinclude
--------
Simple XInclude support exists. You can make xinclude statements in a
document be processed by calling the xinclude() method on a tree::
>>> data = StringIO('''\
... <doc xmlns:xi="http://www.w3.org/2001/XInclude">
... <foo/>
... <xi:include href="doc/test.xml" />
... </doc>''')
>>> tree = etree.parse(data)
>>> tree.xinclude()
>>> etree.tostring(tree.getroot())
'<doc xmlns:xi="http://www.w3.org/2001/XInclude">\n<foo/>\n<a xml:base="doc/test.xml"/>\n</doc>'
write_c14n on ElementTree
-------------------------
The lxml.etree.ElementTree class has a method write_c14n, which takes
one argument: a file object. This file object will receive an UTF-8
representation of the canonicalized form of the XML, following the W3C
C14N recommendation. For example::
>>> f = StringIO('<a><b/></a>')
>>> tree = etree.parse(f)
>>> f2 = StringIO()
>>> tree.write_c14n(f2)
>>> f2.getvalue()
'<a><b></b></a>'
Jump to Line
Something went wrong with that request. Please try again.