Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Fetching contributors…

Cannot retrieve contributors at this time

file 884 lines (687 sloc) 30.063 kb
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884
=====================
APIs specific to lxml
=====================

lxml tries to follow established APIs wherever possible. Sometimes, however,
the need to expose a feature in an easy way led to the invention of a new API.

.. contents::
..
   1 lxml.etree
   2 Other Element APIs
   3 Trees and Documents
   4 Iteration
   5 Parsers
   6 iterparse and iterwalk
   7 Error handling on exceptions
   8 Python unicode strings
   9 XPath
   10 XSLT
   11 RelaxNG
   12 XMLSchema
   13 xinclude
   14 write_c14n on ElementTree


lxml.etree
----------

lxml.etree tries to follow the `ElementTree API`_ wherever it can. There are
however some incompatibilities (see `compatibility`_). The extensions are
documented here.

.. _`ElementTree API`: http://effbot.org/zone/element-index.htm
.. _`compatibility`: compatibility.html

If you need to know which version of lxml is installed, you can access the
``lxml.etree.LXML_VERSION`` attribute to retrieve a version tuple. Note,
however, that it did not exist before version 1.0, so you will get an
AttributeError in older versions. The versions of libxml2 and libxslt are
available through the attributes ``LIBXML_VERSION`` and ``LIBXSLT_VERSION``.

The following examples usually assume this to be executed first::

  >>> from lxml import etree
  >>> from StringIO import StringIO


Other Element APIs
------------------

While lxml.etree itself uses the ElementTree API, it is possible to replace
the Element implementation by `custom element subclasses`_. This has been
used to implement well-known XML APIs on top of lxml. The ``lxml.elements``
package contains examples. Currently, there is a data-binding implementation
called `objectify`_, which is similar to the `Amara bindery`_ tool.

Additionally, the `lxml.elements.classlookup`_ module provides a number of
different schemes to customize the mapping between libxml2 nodes and the
Element classes used by lxml.etree.

.. _`custom element subclasses`: namespace_extensions.html
.. _`objectify`: objectify.html
.. _`lxml.elements.classlookup`: elements.html#lxml.elements.classlookup
.. _`Amara bindery`: http://uche.ogbuji.net/tech/4suite/amara/


Trees and Documents
-------------------

Compared to the original ElementTree API, lxml.etree has an extended tree
model. It knows about parents and siblings of elements::

  >>> root = etree.Element("root")
  >>> a = etree.SubElement(root, "a")
  >>> b = etree.SubElement(root, "b")
  >>> c = etree.SubElement(root, "c")
  >>> d = etree.SubElement(root, "d")
  >>> e = etree.SubElement(d, "e")
  >>> b.getparent() == root
  True
  >>> print b.getnext().tag
  c
  >>> print c.getprevious().tag
  b

Elements always live within a document context in lxml. This implies that
there is also a notion of an absolute document root. You can retrieve an
ElementTree for the root node of a document from any of its elements::

  >>> tree = d.getroottree()
  >>> print tree.getroot().tag
  root

Note that this is different from wrapping an Element in an ElementTree. You
can use ElementTrees to create XML trees with an explicit root node::

  >>> tree = etree.ElementTree(d)
  >>> print tree.getroot().tag
  d
  >>> print etree.tostring(tree)
  <d><e/></d>

All operations that you run on such an ElementTree (like XPath, XSLT, etc.)
will understand the explicitly chosen root as root node of a document. They
will not see any elements outside the ElementTree. However, ElementTrees do
not modify their Elements::

  >>> element = tree.getroot()
  >>> print element.tag
  d
  >>> print element.getparent().tag
  root
  >>> print element.getroottree().getroot().tag
  root

The rule is that all operations that are applied to Elements use either the
Element itself as reference point, or the absolute root of the document that
contains this Element (e.g. for absolute XPath expressions). All operations
on an ElementTree use its explicit root node as reference.


Iteration
---------

The ElementTree API makes Elements iterable to supports iteration over their
children. Using the tree defined above, we get::

  >>> [ el.tag for el in root ]
  ['a', 'b', 'c', 'd']

Tree traversal is commonly based on the ``element.getiterator()`` method::

  >>> [ el.tag for el in root.getiterator() ]
  ['root', 'a', 'b', 'c', 'd', 'e']

lxml.etree also supports this, but additionally features an extended API for
iteration over the children, following/preceding siblings, ancestors and
descendants of an element, as defined by the respective XPath axis::

  >>> [ el.tag for el in root.iterchildren() ]
  ['a', 'b', 'c', 'd']
  >>> [ el.tag for el in root.iterchildren(reversed=True) ]
  ['d', 'c', 'b', 'a']
  >>> [ el.tag for el in b.itersiblings() ]
  ['c', 'd']
  >>> [ el.tag for el in c.itersiblings(preceding=True) ]
  ['b', 'a']
  >>> [ el.tag for el in e.iterancestors() ]
  ['d', 'root']
  >>> [ el.tag for el in root.iterdescendants() ]
  ['a', 'b', 'c', 'd', 'e']

Note how ``element.iterdescendants()`` does not include the element itself, as
opposed to ``element.getiterator()``. The latter effectively implements the
'descendant-or-self' axis in XPath.

All of these iterators support an additional ``tag`` keyword argument that
filters the generated elements by tag name::

  >>> [ el.tag for el in root.iterchildren(tag='a') ]
  ['a']
  >>> [ el.tag for el in d.iterchildren(tag='a') ]
  []
  >>> [ el.tag for el in root.iterdescendants(tag='d') ]
  ['d']
  >>> [ el.tag for el in root.getiterator(tag='d') ]
  ['d']

See also the section on the utility functions ``iterparse()`` and
``iterwalk()`` below.


Parsers
-------

One of the differences is the parser. There is support for both XML and
(broken) HTML. Both are based on libxml2 and therefore only support options
that are backed by the library. Parsers take a number of keyword arguments.
The following is an example for namespace cleanup during parsing, first with
the default parser, then with a parametrized one::

  >>> xml = '<a xmlns="test"><b xmlns="test"/></a>'

  >>> et = etree.parse(StringIO(xml))
  >>> print etree.tostring(et.getroot())
  <a xmlns="test"><b xmlns="test"/></a>

  >>> parser = etree.XMLParser(ns_clean=True)
  >>> et = etree.parse(StringIO(xml), parser)
  >>> print etree.tostring(et.getroot())
  <a xmlns="test"><b/></a>

HTML parsing is similarly simple. The parsers have a ``recover`` keyword
argument that the HTMLParser sets by default. It lets libxml2 try its best to
return something usable without raising an exception. Note that this
functionality depends entirely on libxml2. You should use libxml2 version
2.6.21 or newer to take advantage of this feature::

  >>> broken_html = "<html><head><title>test<body><h1>page title</h3>"

  >>> parser = etree.HTMLParser()
  >>> et = etree.parse(StringIO(broken_html), parser)

  >>> print etree.tostring(et.getroot())
  <html><head><title>test</title></head><body><h1>page title</h1></body></html>

Lxml has an HTML function, similar to the XML shortcut known from
ElementTree::

  >>> html = etree.HTML(broken_html)
  >>> print etree.tostring(html)
  <html><head><title>test</title></head><body><h1>page title</h1></body></html>

The use of the libxml2 parsers makes some additional information available at
the API level. Currently, ElementTree objects can access the DOCTYPE
information provided by a parsed document, as well as the XML version and the
original encoding::

  >>> pub_id = "-//W3C//DTD XHTML 1.0 Transitional//EN"
  >>> sys_url = "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
  >>> doctype_string = '<!DOCTYPE html PUBLIC "%s" "%s">' % (pub_id, sys_url)
  >>> xml_header = '<?xml version="1.0" encoding="ascii"?>'
  >>> xhtml = xml_header + doctype_string + '<html><body></body></html>'

  >>> tree = etree.parse(StringIO(xhtml))
  >>> docinfo = tree.docinfo
  >>> print docinfo.public_id
  -//W3C//DTD XHTML 1.0 Transitional//EN
  >>> print docinfo.system_url
  http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
  >>> docinfo.doctype == doctype_string
  True

  >>> print docinfo.xml_version
  1.0
  >>> print docinfo.encoding
  ascii


iterparse and iterwalk
----------------------

As known from ElementTree, the ``iterparse()`` utility function returns an
iterator that generates parser events for an XML file (or file-like object),
while building the tree. The values are tuples ``(event-type, object)``. The
event types are 'start', 'end', 'start-ns' and 'end-ns'.

The 'start' and 'end' events represent opening and closing elements and are
accompanied by the respective element. By default, only 'end' events are
generated::

  >>> xml = '''\
  ... <root>
  ... <element key='value'>text</element>
  ... <element>text</element>tail
  ... <empty-element xmlns="testns" />
  ... </root>
  ... '''

  >>> context = etree.iterparse(StringIO(xml))
  >>> for action, elem in context:
  ... print action, elem.tag
  end element
  end element
  end {testns}empty-element
  end root

The resulting tree is available through the ``root`` property of the iterator::

  >>> context.root.tag
  'root'

The other types can be activated with the ``events`` keyword argument::

  >>> events = ("start", "end")
  >>> context = etree.iterparse(StringIO(xml), events=events)
  >>> for action, elem in context:
  ... print action, elem.tag
  start root
  start element
  end element
  start element
  end element
  start {testns}empty-element
  end {testns}empty-element
  end root

You can modify the element and its descendants when handling the 'end' event.
To save memory, for example, you can remove subtrees that are no longer
needed::

  >>> context = etree.iterparse(StringIO(xml))
  >>> for action, elem in context:
  ... print len(elem),
  ... elem.clear()
  0 0 0 3
  >>> context.root.getchildren()
  []

**WARNING**: During the 'start' event, the descendants and following siblings
are not yet available and should not be accessed. During the 'end' event, the
element and its descendants can be freely modified, but its following siblings
should not be accessed. During either of the two events, you **must not**
modify or move the ancestors (parents) of the current element. You should
also avoid moving or discarding the element itself. The golden rule is: do
not touch anything that will have to be touched again by the parser later on.

If you have elements with a long list of children in your XML file and want to
save more memory during parsing, you can clean up the preceding siblings of
the current element::

  >>> for event, element in etree.iterparse(StringIO(xml)):
  ... # ... do something with the element
  ... element.clear() # clean up children
  ... if element.getprevious(): # clean up preceding siblings
  ... del element.getparent()[0]

You can use ``while`` instead of ``if`` if you skipped siblings using the
``tag`` keyword argument. The more selective your tag is, however, the more
thought you will have to put into finding the right way to clean up the
elements that were skipped. Therefore, it is sometimes easier to traverse all
elements and do the tag selection by hand in the event handler code.

The 'start-ns' and 'end-ns' events notify about namespace declarations and
generate tuples ``(prefix, URI)``::

  >>> events = ("start-ns", "end-ns")
  >>> context = etree.iterparse(StringIO(xml), events=events)
  >>> for action, obj in context:
  ... print action, obj
  start-ns ('', 'testns')
  end-ns None

It is common practice to use a list as namespace stack and pop the last entry
on the 'end-ns' event.

lxml.etree supports two extensions compared to ElementTree. It accepts a
``tag`` keyword argument just like ``element.getiterator(tag)``. This
restricts events to a specific tag or namespace.

  >>> context = etree.iterparse(StringIO(xml), tag="element")
  >>> for action, elem in context:
  ... print action, elem.tag
  end element
  end element

  >>> events = ("start", "end")
  >>> context = etree.iterparse(StringIO(xml), events=events, tag="{testns}*")
  >>> for action, elem in context:
  ... print action, elem.tag
  start {testns}empty-element
  end {testns}empty-element

The second extension is the ``iterwalk()`` function. It behaves exactly like
``iterparse()``, but works on Elements and ElementTrees::

  >>> root = context.root
  >>> context = etree.iterwalk(root, events=events, tag="element")
  >>> for action, elem in context:
  ... print action, elem.tag
  start element
  end element
  start element
  end element


Error handling on exceptions
----------------------------

Libxml2 provides error messages for failures, be it during parsing, XPath
evaluation or schema validation. Whenever an exception is raised, you can
retrieve the errors that occured and "might have" lead to the problem::

  >>> etree.clearErrorLog()
  >>> broken_xml = '<a>'
  >>> try:
  ... etree.parse(StringIO(broken_xml))
  ... except etree.XMLSyntaxError, e:
  ... pass # just put the exception into e
  >>> log = e.error_log.filter_levels(etree.ErrorLevels.FATAL)
  >>> print log
  <string>:1:FATAL:PARSER:ERR_TAG_NOT_FINISHED: Premature end of data in tag a line 1

This might look a little cryptic at first, but it is the information that
libxml2 gives you. At least the message at the end should give you a hint
what went wrong and you can see that the fatal error (FATAL) happened during
parsing (PARSER) line 1 of a string (<string>, or filename if available).
Here, PARSER is the so-called error domain, see lxml.etree.ErrorDomains for
that. You can get it from a log entry like this::

  >>> entry = log[0]
  >>> print entry.domain_name, entry.type_name, entry.filename
  PARSER ERR_TAG_NOT_FINISHED <string>

There is also a convenience attribute ``last_error`` that returns the last
error or fatal error that occurred::

  >>> entry = e.error_log.last_error
  >>> print entry.domain_name, entry.type_name, entry.filename
  PARSER ERR_TAG_NOT_FINISHED <string>

Alternatively, lxml.etree supports logging libxml2 messages to the Python
stdlib logging module. This is done through the ``etree.PyErrorLog`` class.
It disables the error reporting from exceptions and forwards log messages to a
Python logger. To use it, see the descriptions of the function
``etree.useGlobalPythonLog`` and the class ``etree.PyErrorLog`` for help.
Note that this does not affect the local error logs of XSLT, XMLSchema,
etc. which are described in their respective sections below.


Python unicode strings
----------------------

lxml.etree has broader support for Python unicode strings than the ElementTree
library. First of all, where ElementTree would raise an exception, the
parsers in lxml.etree can handle unicode strings straight away. This is most
helpful for XML snippets embedded in source code using the ``XML()``
function::

  >>> uxml = u'<test> \uf8d1 + \uf8d2 </test>'
  >>> uxml
  u'<test> \uf8d1 + \uf8d2 </test>'
  >>> root = etree.XML(uxml)

This requires, however, that unicode strings do not specify a conflicting
encoding themselves and thus lie about their real encoding::

  >>> etree.XML(u'<?xml version="1.0" encoding="ASCII"?>\n' + uxml)
  Traceback (most recent call last):
    ...
  ValueError: Unicode strings with encoding declaration are not supported.

Similarly, you will get errors when you try the same with HTML data in a
unicode string that specifies a charset in a meta tag of the header. You
should generally avoid converting XML/HTML data to unicode before passing it
into the parsers. It is both slower and error prone.

To serialize the result, you would normally use the ``tostring`` module
function, which serializes to plain ASCII by default or a number of other
encodings if asked for::

  >>> etree.tostring(root)
  '<test> &#63697; + &#63698; </test>'

  >>> etree.tostring(root, 'UTF-8', xml_declaration=False)
  '<test> \xef\xa3\x91 + \xef\xa3\x92 </test>'

As an extension, lxml.etree has a new ``tounicode()`` function that you can
call on XML tree objects to retrieve a Python unicode representation::

  >>> etree.tounicode(root)
  u'<test> \uf8d1 + \uf8d2 </test>'

  >>> el = etree.Element("test")
  >>> etree.tounicode(el)
  u'<test/>'

  >>> subel = etree.SubElement(el, "subtest")
  >>> etree.tounicode(el)
  u'<test><subtest/></test>'

  >>> et = etree.ElementTree(el)
  >>> etree.tounicode(et)
  u'<test><subtest/></test>'

The result of ``tounicode()`` can be treated like any other Python unicode
string and then passed back into the parsers. However, if you want to save
the result to a file or pass it over the network, you should use ``write()``
or ``tostring()`` with an encoding argument (typically UTF-8) to serialize the
XML. The main reason is that unicode strings returned by ``tounicode()``
never have an XML declaration and therefore do not specify their encoding.
These strings are most likely not parsable by other XML libraries.

In contrast, the ``tostring()`` function automatically adds a declaration as
needed that reflects the encoding of the returned string. This makes it
possible for other parsers to correctly parse the XML byte stream. Note that
using ``tostring()`` with UTF-8 is also considerably faster in most cases.


XPath
-----

lxml.etree supports the simple path syntax of the ``findall()`` etc. methods
on ElementTree and Element, as known from the original ElementTree library.
As an extension, these classes also provide an ``xpath()`` method that
supports expressions in the complete XPath syntax.

There are also specialized XPath evaluator classes that are more efficient for
frequent evaluation: ``XPath`` and ``XPathEvaluator``. See the `performance
comparison`_ to learn when to use which. Their semantics when used on
Elements and ElementTrees are the same as for the ``xpath()`` method described
here.

.. _`performance comparison`: performance.html#xpath

For ElementTree, the xpath method performs a global XPath query against the
document (if absolute) or against the root node (if relative)::

  >>> f = StringIO('<foo><bar></bar></foo>')
  >>> tree = etree.parse(f)

  >>> r = tree.xpath('/foo/bar')
  >>> len(r)
  1
  >>> r[0].tag
  'bar'

  >>> r = tree.xpath('bar')
  >>> r[0].tag
  'bar'

When ``xpath()`` is used on an element, the XPath expression is evaluated
against the element (if relative) or against the root tree (if absolute)::

  >>> root = tree.getroot()
  >>> r = root.xpath('bar')
  >>> r[0].tag
  'bar'

  >>> bar = root[0]
  >>> r = bar.xpath('/foo/bar')
  >>> r[0].tag
  'bar'

  >>> tree = bar.getroottree()
  >>> r = tree.xpath('/foo/bar')
  >>> r[0].tag
  'bar'

Optionally, you can provide a ``namespaces`` keyword argument, which should be
a dictionary mapping the namespace prefixes used in the XPath expression to
namespace URIs::

  >>> f = StringIO('''\
  ... <a:foo xmlns:a="http://codespeak.net/ns/test1"
  ... xmlns:b="http://codespeak.net/ns/test2">
  ... <b:bar>Text</b:bar>
  ... </a:foo>
  ... ''')
  >>> doc = etree.parse(f)
  >>> r = doc.xpath('/t:foo/b:bar', {'t': 'http://codespeak.net/ns/test1',
  ... 'b': 'http://codespeak.net/ns/test2'})
  >>> len(r)
  1
  >>> r[0].tag
  '{http://codespeak.net/ns/test2}bar'
  >>> r[0].text
  'Text'

There is also an optional ``extensions`` argument which is used to define
`extension functions`_ in Python that are local to this evaluation.

.. _`extension functions`: extensions.html

The return values of XPath evaluations vary, depending on the XPath expression
used:

* True or False, when the XPath expression has a boolean result

* a float, when the XPath expression has a numeric result (integer or float)

* a (unicode) string, when the XPath expression has a string result.

* a list of items, when the XPath expression has a list as result. The items
  may include elements, strings and tuples. Text nodes and attributes in the
  result are returned as strings (the text node content or attribute value).
  Comments are also returned as strings, enclosed by the usual ``<!--`` and
  ``-->`` markers. Namespace declarations are returned as tuples of strings:
  ``(prefix, URI)``.

A related convenience method of ElementTree objects is ``getpath(element)``,
which returns a structural, absolute XPath expression to find that element::

  >>> a = etree.Element("a")
  >>> b = etree.SubElement(a, "b")
  >>> c = etree.SubElement(a, "c")
  >>> d1 = etree.SubElement(c, "d")
  >>> d2 = etree.SubElement(c, "d")

  >>> tree = etree.ElementTree(c)
  >>> print tree.getpath(d2)
  /c/d[2]
  >>> tree.xpath(tree.getpath(d2)) == [d2]
  True


XSLT
----

lxml.etree introduces a new class, lxml.etree.XSLT. The class can be
given an ElementTree object to construct an XSLT transformer::

  >>> f = StringIO('''\
  ... <xsl:stylesheet version="1.0"
  ... xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  ... <xsl:template match="/">
  ... <foo><xsl:value-of select="/a/b/text()" /></foo>
  ... </xsl:template>
  ... </xsl:stylesheet>''')
  >>> xslt_doc = etree.parse(f)
  >>> transform = etree.XSLT(xslt_doc)

You can then run the transformation on an ElementTree document by simply
calling it, and this results in another ElementTree object::

  >>> f = StringIO('<a><b>Text</b></a>')
  >>> doc = etree.parse(f)
  >>> result = transform(doc)

The result object can be accessed like a normal ElementTree document::

  >>> result.getroot().text
  'Text'

but, as opposed to normal ElementTree objects, can also be turned into an (XML
or text) string by applying the str() function::

  >>> str(result)
  '<?xml version="1.0"?>\n<foo>Text</foo>\n'

The result is always a plain string, encoded as requested by the
``xsl:output`` element in the stylesheet. If you want a Python unicode string
instead, you should set this encoding to ``UTF-8`` (unless the `ASCII` default
is sufficient). This allows you to call the builtin ``unicode()`` function on
the result::

  >>> unicode(result)
  u'<?xml version="1.0"?>\n<foo>Text</foo>\n'

You can use other encodings at the cost of multiple recoding. Encodings that
are not supported by Python will result in an error::

  >>> xslt_tree = etree.XML('''\
  ... <xsl:stylesheet version="1.0"
  ... xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  ... <xsl:output encoding="UCS4"/>
  ... <xsl:template match="/">
  ... <foo><xsl:value-of select="/a/b/text()" /></foo>
  ... </xsl:template>
  ... </xsl:stylesheet>''')
  >>> transform = etree.XSLT(xslt_tree)

  >>> result = transform(doc)
  >>> unicode(result)
  Traceback (most recent call last):
    [...]
  LookupError: unknown encoding: UCS4

It is possible to pass parameters, in the form of XPath expressions, to the
XSLT template::

  >>> xslt_tree = etree.XML('''\
  ... <xsl:stylesheet version="1.0"
  ... xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  ... <xsl:template match="/">
  ... <foo><xsl:value-of select="$a" /></foo>
  ... </xsl:template>
  ... </xsl:stylesheet>''')
  >>> transform = etree.XSLT(xslt_tree)
  >>> f = StringIO('<a><b>Text</b></a>')
  >>> doc = etree.parse(f)

The parameters are passed as keyword parameters to the transform call. First
let's try passing in a simple string expression::

  >>> result = transform(doc, a="'A'")
  >>> str(result)
  '<?xml version="1.0"?>\n<foo>A</foo>\n'

Let's try a non-string XPath expression now::

  >>> result = transform(doc, a="/a/b/text()")
  >>> str(result)
  '<?xml version="1.0"?>\n<foo>Text</foo>\n'

There's also a convenience method on the tree object for doing XSL
transformations. This is less efficient if you want to apply the same XSL
transformation to multiple documents, but is shorter to write for one-shot
operations, as you do not have to instantiate a stylesheet yourself::

  >>> result = doc.xslt(xslt_tree, a="'A'")
  >>> str(result)
  '<?xml version="1.0"?>\n<foo>A</foo>\n'

By default, XSLT supports all extension functions from libxslt and libexslt as
well as Python regular expressions through EXSLT. Note that some extensions
enable style sheets to read and write files on the local file system. See the
`document loader documentation`_ on how to deal with this.

.. _`document loader documentation`: resolvers.html


RelaxNG
-------

lxml.etree introduces a new class, lxml.etree.RelaxNG. The class can
be given an ElementTree object to construct a Relax NG validator::

  >>> f = StringIO('''\
  ... <element name="a" xmlns="http://relaxng.org/ns/structure/1.0">
  ... <zeroOrMore>
  ... <element name="b">
  ... <text />
  ... </element>
  ... </zeroOrMore>
  ... </element>
  ... ''')
  >>> relaxng_doc = etree.parse(f)
  >>> relaxng = etree.RelaxNG(relaxng_doc)

You can then validate some ElementTree document against the schema. You'll get
back True if the document is valid against the Relax NG schema, and False if
not::

  >>> valid = StringIO('<a><b></b></a>')
  >>> doc = etree.parse(valid)
  >>> relaxng.validate(doc)
  1

  >>> invalid = StringIO('<a><c></c></a>')
  >>> doc2 = etree.parse(invalid)
  >>> relaxng.validate(doc2)
  0

Calling the schema object has the same effect as calling its validate
method. This is sometimes used in conditional statements::

  >>> invalid = StringIO('<a><c></c></a>')
  >>> doc2 = etree.parse(invalid)
  >>> if not relaxng(doc2):
  ... print "invalid!"
  invalid!

If you prefer getting an exception when validating, you can use the
``assert_`` or ``assertValid`` methods::

  >>> relaxng.assertValid(doc2)
  Traceback (most recent call last):
    [...]
  DocumentInvalid: Document does not comply with schema

  >>> relaxng.assert_(doc2)
  Traceback (most recent call last):
    [...]
  AssertionError: Document does not comply with schema

Starting with version 0.9, lxml now has a simple API to report the errors
generated by libxml2. If you want to find out why the validation failed in the
second case, you can look up the error log of the validation process and check
it for relevant messages::

  >>> log = relaxng.error_log
  >>> print log.last_error
  <string>:1:ERROR:RELAXNGV:ERR_LT_IN_ATTRIBUTE: Did not expect element c there

You can see that the error (ERROR) happened during RelaxNG validation
(RELAXNGV). The message then tells you what went wrong. Note that this error
is local to the RelaxNG object. It will only contain log entries that
appeares during the validation. The DocumentInvalid exception raised by the
``assertValid`` method above provides access to the global error log (like all
other lxml exceptions).

Similar to XSLT, there's also a less efficient but easier shortcut method to
do one-shot RelaxNG validation::

  >>> doc.relaxng(relaxng_doc)
  1
  >>> doc2.relaxng(relaxng_doc)
  0


XMLSchema
---------

lxml.etree also has a XML Schema (XSD) support, using the class
lxml.etree.XMLSchema. This support is very similar to the Relax NG
support. The class can be given an ElementTree object to construct a
XMLSchema validator::

  >>> f = StringIO('''\
  ... <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  ... <xsd:element name="a" type="AType"/>
  ... <xsd:complexType name="AType">
  ... <xsd:sequence>
  ... <xsd:element name="b" type="xsd:string" />
  ... </xsd:sequence>
  ... </xsd:complexType>
  ... </xsd:schema>
  ... ''')
  >>> xmlschema_doc = etree.parse(f)
  >>> xmlschema = etree.XMLSchema(xmlschema_doc)

You can then validate some ElementTree document with this. Like with
RelaxNG, you'll get back true if the document is valid against the XML
schema, and false if not::

  >>> valid = StringIO('<a><b></b></a>')
  >>> doc = etree.parse(valid)
  >>> xmlschema.validate(doc)
  1

  >>> invalid = StringIO('<a><c></c></a>')
  >>> doc2 = etree.parse(invalid)
  >>> xmlschema.validate(doc2)
  0

Calling the schema object has the same effect as calling its validate
method. This is sometimes used in conditional statements::

  >>> invalid = StringIO('<a><c></c></a>')
  >>> doc2 = etree.parse(invalid)
  >>> if not xmlschema(doc2):
  ... print "invalid!"
  invalid!

If you prefer getting an exception when validating, you can use the
``assert_`` or ``assertValid`` methods::

  >>> xmlschema.assertValid(doc2)
  Traceback (most recent call last):
    [...]
  DocumentInvalid: Document does not comply with schema

  >>> xmlschema.assert_(doc2)
  Traceback (most recent call last):
    [...]
  AssertionError: Document does not comply with schema

Error reporting works like for the RelaxNG class::

  >>> log = xmlschema.error_log
  >>> error = log.last_error
  >>> print error.domain_name
  SCHEMASV
  >>> print error.type_name
  SCHEMAV_ELEMENT_CONTENT

If you were to print this log entry, you would get something like the
following. Note that the error message depends on the libxml2 version in
use::

  <string>:1:ERROR::SCHEMAV_ELEMENT_CONTENT: Element 'c': This element is not expected. Expected is ( b ).

Similar to XSLT and RelaxNG, there's also a less efficient but easier shortcut
method to do XML Schema validation::

  >>> doc.xmlschema(xmlschema_doc)
  1
  >>> doc2.xmlschema(xmlschema_doc)
  0


xinclude
--------

Simple XInclude support exists. You can make xinclude statements in a
document be processed by calling the xinclude() method on a tree::

  >>> data = StringIO('''\
  ... <doc xmlns:xi="http://www.w3.org/2001/XInclude">
  ... <foo/>
  ... <xi:include href="doc/test.xml" />
  ... </doc>''')

  >>> tree = etree.parse(data)
  >>> tree.xinclude()
  >>> etree.tostring(tree.getroot())
  '<doc xmlns:xi="http://www.w3.org/2001/XInclude">\n<foo/>\n<a xml:base="doc/test.xml"/>\n</doc>'


write_c14n on ElementTree
-------------------------

The lxml.etree.ElementTree class has a method write_c14n, which takes
one argument: a file object. This file object will receive an UTF-8
representation of the canonicalized form of the XML, following the W3C
C14N recommendation. For example::

  >>> f = StringIO('<a><b/></a>')
  >>> tree = etree.parse(f)
  >>> f2 = StringIO()
  >>> tree.write_c14n(f2)
  >>> f2.getvalue()
  '<a><b></b></a>'
Something went wrong with that request. Please try again.