/
xpathxslt.txt
476 lines (346 loc) · 14.2 KB
/
xpathxslt.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
========================
XPath and XSLT with lxml
========================
lxml supports both XPath and XSLT through libxml2 and libxslt in a standards
compliant way.
.. contents::
..
1 XPath
1.1 The ``xpath()`` method
1.2 XPath return values
1.3 Generating XPath expressions
1.4 The ``XPath`` class
1.5 The ``XPathEvaluator`` classes
1.6 ``ETXPath``
1.7 Error handling
2 XSLT
2.1 XSLT result objects
2.2 Stylesheet parameters
2.3 The ``xslt()`` tree method
2.4 Profiling
The usual setup procedure::
>>> from lxml import etree
>>> from StringIO import StringIO
XPath
=====
lxml.etree supports the simple path syntax of the `find, findall and
findtext`_ methods on ElementTree and Element, as known from the original
ElementTree library (ElementPath_). As an lxml specific extension, these
classes also provide an ``xpath()`` method that supports expressions in the
complete XPath syntax, as well as `custom extension functions`_.
.. _ElementPath: http://effbot.org/zone/element-xpath.htm
.. _`find, findall and findtext`: http://effbot.org/zone/element.htm#searching-for-subelements
.. _`custom extension functions`: extensions.html
There are also specialized XPath evaluator classes that are more efficient for
frequent evaluation: ``XPath`` and ``XPathEvaluator``. See the `performance
comparison`_ to learn when to use which. Their semantics when used on
Elements and ElementTrees are the same as for the ``xpath()`` method described
here.
.. _`performance comparison`: performance.html#xpath
The ``xpath()`` method
----------------------
For ElementTree, the xpath method performs a global XPath query against the
document (if absolute) or against the root node (if relative)::
>>> f = StringIO('<foo><bar></bar></foo>')
>>> tree = etree.parse(f)
>>> r = tree.xpath('/foo/bar')
>>> len(r)
1
>>> r[0].tag
'bar'
>>> r = tree.xpath('bar')
>>> r[0].tag
'bar'
When ``xpath()`` is used on an Element, the XPath expression is evaluated
against the element (if relative) or against the root tree (if absolute)::
>>> root = tree.getroot()
>>> r = root.xpath('bar')
>>> r[0].tag
'bar'
>>> bar = root[0]
>>> r = bar.xpath('/foo/bar')
>>> r[0].tag
'bar'
>>> tree = bar.getroottree()
>>> r = tree.xpath('/foo/bar')
>>> r[0].tag
'bar'
The ``xpath()`` method has support for XPath variables::
>>> expr = "//*[local-name() = $name]"
>>> print root.xpath(expr, name = "foo")[0].tag
foo
>>> print root.xpath(expr, name = "bar")[0].tag
bar
>>> print root.xpath("$text", text = "Hello World!")
Hello World!
Optionally, you can provide a ``namespaces`` keyword argument, which should be
a dictionary mapping the namespace prefixes used in the XPath expression to
namespace URIs::
>>> f = StringIO('''\
... <a:foo xmlns:a="http://codespeak.net/ns/test1"
... xmlns:b="http://codespeak.net/ns/test2">
... <b:bar>Text</b:bar>
... </a:foo>
... ''')
>>> doc = etree.parse(f)
>>> r = doc.xpath('/t:foo/b:bar', {'t': 'http://codespeak.net/ns/test1',
... 'b': 'http://codespeak.net/ns/test2'})
>>> len(r)
1
>>> r[0].tag
'{http://codespeak.net/ns/test2}bar'
>>> r[0].text
'Text'
There is also an optional ``extensions`` argument which is used to define
`custom extension functions`_ in Python that are local to this evaluation.
XPath return values
-------------------
The return values of XPath evaluations vary, depending on the XPath expression
used:
* True or False, when the XPath expression has a boolean result
* a float, when the XPath expression has a numeric result (integer or float)
* a (unicode) string, when the XPath expression has a string result.
* a list of items, when the XPath expression has a list as result. The items
may include elements (also comments and processing instructions), strings
and tuples. Text nodes and attributes in the result are returned as strings
(the text node content or attribute value). Namespace declarations are
returned as tuples of strings: ``(prefix, URI)``.
Generating XPath expressions
----------------------------
ElementTree objects have a method ``getpath(element)``, which returns a
structural, absolute XPath expression to find that element::
>>> a = etree.Element("a")
>>> b = etree.SubElement(a, "b")
>>> c = etree.SubElement(a, "c")
>>> d1 = etree.SubElement(c, "d")
>>> d2 = etree.SubElement(c, "d")
>>> tree = etree.ElementTree(c)
>>> print tree.getpath(d2)
/c/d[2]
>>> tree.xpath(tree.getpath(d2)) == [d2]
True
The ``XPath`` class
-------------------
The ``XPath`` class compiles an XPath expression into a callable function::
>>> root = etree.XML("<root><a><b/></a><b/></root>")
>>> find = etree.XPath("//b")
>>> print find(root)[0].tag
b
The compilation takes as much time as in the ``xpath()`` method, but it is
done only once per class instantiation. This makes it especially efficient
for repeated evaluation of the same XPath expression.
Just like the ``xpath()`` method, the ``XPath`` class supports XPath
variables::
>>> count_elements = etree.XPath("count(//*[local-name() = $name])")
>>> print count_elements(root, name = "a")
1.0
>>> print count_elements(root, name = "b")
2.0
This supports very efficient evaluation of modified versions of an XPath
expression, as compilation is still only required once.
Prefix-to-namespace mappings can be passed as second parameter::
>>> root = etree.XML("<root xmlns='NS'><a><b/></a><b/></root>")
>>> find = etree.XPath("//n:b", {'n':'NS'})
>>> print find(root)[0].tag
{NS}b
By default, ``XPath`` supports regular expressions in the EXSLT_ namespace::
>>> regexpNS = "http://exslt.org/regular-expressions"
>>> find = etree.XPath("//*[re:test(., '^abc$', 'i')]",
... {'re':regexpNS})
>>> root = etree.XML("<root><a>aB</a><b>aBc</b></root>")
>>> print find(root)[0].text
aBc
.. _EXSLT: http://www.exslt.org/
You can disable this with the boolean keyword argument ``regexp`` which
defaults to True.
The ``XPathEvaluator`` classes
------------------------------
lxml.etree provides two other efficient XPath evaluators that work on
ElementTrees or Elements respectively: ``XPathDocumentEvaluator`` and
``XPathElementEvaluator``. They are automatically selected if you use the
XPathEvaluator helper for instantiation::
>>> root = etree.XML("<root><a><b/></a><b/></root>")
>>> xpatheval = etree.XPathEvaluator(root)
>>> print isinstance(xpatheval, etree.XPathElementEvaluator)
True
>>> print xpatheval("//b")[0].tag
b
This class provides efficient support for evaluating different XPath
expressions on the same Element or ElementTree.
``ETXPath``
-----------
ElementTree supports a language named ElementPath_ in its ``find*()`` methods.
One of the main differences between XPath and ElementPath is that the XPath
language requires an indirection through prefixes for namespace support,
whereas ElementTree uses the Clark notation (``{ns}name``) to avoid prefixes
completely. The other major difference regards the capabilities of both path
languages. Where XPath supports various sophisticated ways of restricting the
result set through functions and boolean expressions, ElementPath only
supports pure path traversal without nesting or further conditions. So, while
the ElementPath syntax is self-contained and therefore easier to write and
handle, XPath is much more powerful and expressive.
lxml.etree bridges this gap through the class ``ETXPath``, which accepts XPath
expressions with namespaces in Clark notation. It is identical to the
``XPath`` class, except for the namespace notation. Normally, you would
write::
>>> root = etree.XML("<root xmlns='ns'><a><b/></a><b/></root>")
>>> find = etree.XPath("//p:b", {'p' : 'ns'})
>>> print find(root)[0].tag
{ns}b
``ETXPath`` allows you to change this to::
>>> find = etree.ETXPath("//{ns}b")
>>> print find(root)[0].tag
{ns}b
Error handling
--------------
lxml.etree raises exceptions when errors occur while parsing or evaluating an
XPath expression::
>>> find = etree.XPath("\\")
Traceback (most recent call last):
...
XPathSyntaxError: Invalid expression
lxml will also try to give you a hint what went wrong, so if you pass a more
complex expression, you may get a somewhat more specific error::
>>> find = etree.XPath("//*[1.1.1]")
Traceback (most recent call last):
...
XPathSyntaxError: Invalid predicate
During evaluation, lxml will emit an XPathEvalError on errors::
>>> find = etree.XPath("//ns:a")
>>> find(root)
Traceback (most recent call last):
...
XPathEvalError: Undefined namespace prefix
This works for the ``XPath`` class, however, the other evaluators (including
the ``xpath()`` method) are one-shot operations that do parsing and evaluation
in one step. They therefore raise evaluation exceptions in all cases::
>>> root = etree.Element("test")
>>> find = root.xpath("//*[1.1.1]")
Traceback (most recent call last):
...
XPathEvalError: Invalid predicate
>>> find = root.xpath("//ns:a")
Traceback (most recent call last):
...
XPathEvalError: Undefined namespace prefix
>>> find = root.xpath("\\")
Traceback (most recent call last):
...
XPathEvalError: Invalid expression
Note that lxml versions before 1.3 always raised an ``XPathSyntaxError`` for
all errors, including evaluation errors. The best way to support older
versions is to except on the superclass ``XPathError``.
XSLT
====
lxml.etree introduces a new class, lxml.etree.XSLT. The class can be
given an ElementTree object to construct an XSLT transformer::
>>> f = StringIO('''\
... <xsl:stylesheet version="1.0"
... xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
... <xsl:template match="/">
... <foo><xsl:value-of select="/a/b/text()" /></foo>
... </xsl:template>
... </xsl:stylesheet>''')
>>> xslt_doc = etree.parse(f)
>>> transform = etree.XSLT(xslt_doc)
You can then run the transformation on an ElementTree document by simply
calling it, and this results in another ElementTree object::
>>> f = StringIO('<a><b>Text</b></a>')
>>> doc = etree.parse(f)
>>> result_tree = transform(doc)
By default, XSLT supports all extension functions from libxslt and libexslt as
well as Python regular expressions through the `EXSLT regexp functions`_.
Also see the documentation on `custom extension functions`_ and `document
resolvers`_. There is a separate section on `controlling access`_ to external
documents and resources.
.. _`EXSLT regexp functions`: http://www.exslt.org/regexp/
.. _`document resolvers`: resolvers.html
.. _`controlling access`: resolvers.html#i-o-access-control-in-xslt
XSLT result objects
-------------------
The result of an XSL transformation can be accessed like a normal ElementTree
document::
>>> f = StringIO('<a><b>Text</b></a>')
>>> doc = etree.parse(f)
>>> result = transform(doc)
>>> result.getroot().text
'Text'
but, as opposed to normal ElementTree objects, can also be turned into an (XML
or text) string by applying the str() function::
>>> str(result)
'<?xml version="1.0"?>\n<foo>Text</foo>\n'
The result is always a plain string, encoded as requested by the
``xsl:output`` element in the stylesheet. If you want a Python unicode string
instead, you should set this encoding to ``UTF-8`` (unless the `ASCII` default
is sufficient). This allows you to call the builtin ``unicode()`` function on
the result::
>>> unicode(result)
u'<?xml version="1.0"?>\n<foo>Text</foo>\n'
You can use other encodings at the cost of multiple recoding. Encodings that
are not supported by Python will result in an error::
>>> xslt_tree = etree.XML('''\
... <xsl:stylesheet version="1.0"
... xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
... <xsl:output encoding="UCS4"/>
... <xsl:template match="/">
... <foo><xsl:value-of select="/a/b/text()" /></foo>
... </xsl:template>
... </xsl:stylesheet>''')
>>> transform = etree.XSLT(xslt_tree)
>>> result = transform(doc)
>>> unicode(result)
Traceback (most recent call last):
[...]
LookupError: unknown encoding: UCS4
Stylesheet parameters
---------------------
It is possible to pass parameters, in the form of XPath expressions, to the
XSLT template::
>>> xslt_tree = etree.XML('''\
... <xsl:stylesheet version="1.0"
... xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
... <xsl:template match="/">
... <foo><xsl:value-of select="$a" /></foo>
... </xsl:template>
... </xsl:stylesheet>''')
>>> transform = etree.XSLT(xslt_tree)
>>> f = StringIO('<a><b>Text</b></a>')
>>> doc = etree.parse(f)
The parameters are passed as keyword parameters to the transform call. First
let's try passing in a simple string expression::
>>> result = transform(doc, a="'A'")
>>> str(result)
'<?xml version="1.0"?>\n<foo>A</foo>\n'
Let's try a non-string XPath expression now::
>>> result = transform(doc, a="/a/b/text()")
>>> str(result)
'<?xml version="1.0"?>\n<foo>Text</foo>\n'
The ``xslt()`` tree method
--------------------------
There's also a convenience method on ElementTree objects for doing XSL
transformations. This is less efficient if you want to apply the same XSL
transformation to multiple documents, but is shorter to write for one-shot
operations, as you do not have to instantiate a stylesheet yourself::
>>> result = doc.xslt(xslt_tree, a="'A'")
>>> str(result)
'<?xml version="1.0"?>\n<foo>A</foo>\n'
This is a shortcut for the following code::
>>> transform = etree.XSLT(xslt_tree)
>>> result = transform(doc, a="'A'")
>>> str(result)
'<?xml version="1.0"?>\n<foo>A</foo>\n'
Profiling
---------
If you want to know how your stylesheet performed, pass the ``profile_run``
keyword to the transform::
>>> result = transform(doc, a="/a/b/text()", profile_run=True)
>>> profile = result.xslt_profile
The value of the ``xslt_profile`` property is an ElementTree with profiling
data about each template, similar to the following::
<profile>
<template rank="1" match="/" name="" mode="" calls="1" time="1" average="1"/>
</profile>
Note that this is a read-only document. You must not move any of its elements
to other documents. Please deep-copy the document if you need to modify it.
If you want to free it from memory, just do::
>>> del result.xslt_profile