Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 270 lines (196 sloc) 10.979 kb
b90e7317 »
2007-08-29 [svn r2774] new doc file: what's new in lxml 2.0
1 =======================
2 What's new in lxml 2.0?
3 =======================
4
5 .. contents::
6 ..
7 1 Changes in etree and objectify
8 1.1 Incompatible changes
9 1.2 Enhancements
75d25285 »
2008-02-01 [svn r3242] r3395@delle: sbehnel | 2008-02-01 16:25:37 +0100
10 1.3 Deprecation
b90e7317 »
2007-08-29 [svn r2774] new doc file: what's new in lxml 2.0
11 2 New modules
75d25285 »
2008-02-01 [svn r3242] r3395@delle: sbehnel | 2008-02-01 16:25:37 +0100
12 2.1 lxml.usedoctest
13 2.2 lxml.html
14 2.3 lxml.cssselect
b90e7317 »
2007-08-29 [svn r2774] new doc file: what's new in lxml 2.0
15
16
17 During the development of the lxml 1.x series, a couple of quirks were
18 discovered in the design that made the API less obvious and its future
19 extensions harder than necessary. lxml 2.0 is a soft evolution of lxml 1.x
20 towards a simpler, more consistent and more powerful API - with some major
21 extensions. Wherever possible, lxml 1.3 comes close to the semantics of lxml
22 2.0, so that migrating should be easier for code that currently runs with 1.3.
23
cdb62b6d »
2008-01-18 [svn r3175] r3264@delle: sbehnel | 2008-01-16 10:43:18 +0100
24 One of the important internal changes was the switch from the Pyrex_
25 compiler to Cython_, which provides better optimisation and improved
26 support for newer Python language features. This allows the code of
27 lxml to become more Python-like again, while the performance improves
28 as Cython continues its own development. The code simplification,
29 which will continue throughout the 2.x series, will hopefully make it
30 even easier for users to contribute.
31
32 .. _Cython: http://www.cython.org/
33 .. _Pyrex: http://www.cosc.canterbury.ac.nz/~greg/python/Pyrex/
34
b90e7317 »
2007-08-29 [svn r2774] new doc file: what's new in lxml 2.0
35
36 Changes in etree and objectify
37 ==============================
38
39 A graduation towards a more consistent API cannot go without a certain amount
40 of incompatible changes. The following is a list of those differences that
41 applications need to take into account when migrating from lxml 1.x to lxml
42 2.0.
43
44 Incompatible changes
45 --------------------
46
47 * lxml 0.9 introduced a feature called `namespace implementation`_. The
48 global ``Namespace`` factory was added to register custom element classes
49 and have lxml.etree look them up automatically. However, the later
50 development of further class lookup mechanisms made it appear less and less
51 adequate to register this mapping at a global level, so lxml 1.1 first
52 removed the namespace based lookup from the default setup and lxml 2.0
53 finally removes the global namespace registry completely. As all other
54 lookup mechanisms, the namespace lookup is now local to a parser, including
55 the registry itself. Applications that use a module-level parser can easily
56 map its ``get_namespace()`` method to a global ``Namespace`` function to
57 mimic the old behaviour.
58
59 .. _`namespace implementation`: element_classes.html#implementing-namespaces
60
8bd0b4f5 »
2007-12-25 [svn r3132] r3177@delle: sbehnel | 2007-12-22 16:10:24 +0100
61 * Some API functions now require passing options as keyword arguments,
62 as opposed to positional arguments. This restriction was introduced
63 to make the API usage independent of future extensions such as the
64 addition of new positional arguments. Users should not rely on the
65 position of an optional argument in function signatures and instead
66 pass it explicitly named. This also improves code readability - it
67 is common good practice to pass options in a consistent way
68 independent of their position, so many people may not even notice
69 the change in their code. Another important reason is compatibility
70 with cElementTree, which also enforces keyword-only arguments in a
71 couple of places.
72
b6d88d25 »
2008-02-01 [svn r3243] r3398@delle: sbehnel | 2008-02-01 18:24:58 +0100
73 * XML tag names are validated when creating an Element. This does not
74 apply to HTML tags, where only HTML special characters are
75 forbidden. The distinction is made by the ``SubElement()`` factory,
76 which tests if the tree it works on is an HTML tree, and by the
77 ``.makeelement()`` methods of parsers, which behave differently for
78 the ``XMLParser()`` and the ``HTMLParser()``.
79
b90e7317 »
2007-08-29 [svn r2774] new doc file: what's new in lxml 2.0
80 * XPath now raises exceptions specific to the part of the execution that
81 failed: ``XPathSyntaxError`` for parser errors and ``XPathEvalError`` for
82 errors that occurred during the evaluation. Note that the distinction only
83 works for the ``XPath()`` class. The other two evaluators only have a
84 single evaluation call that includes the parsing step, and will therefore
85 only raise an ``XPathEvalError``. Applications can catch both exceptions
86 through the common base class ``XPathError`` (which also exists in earlier
87 lxml versions).
88
89 * Network access in parsers is now disabled by default, i.e. the
90 ``no_network`` option defaults to True. Due to a somewhat 'interesting'
91 implementation in libxml2, this does not affect the first document (i.e. the
92 URL that is parsed), but only subsequent documents, such as a DTD when
93 parsing with validation. This means that you will have to check the URL you
94 pass, instead of relying on lxml to prevent *any* access to external
95 resources. As this can be helpful in some use cases, lxml does not work
96 around it.
97
98 * The type annotations in lxml.objectify (the ``pytype`` attribute) now use
99 ``NoneType`` for the None value as this is the correct Python type name.
96d31fd6 »
2007-09-02 [svn r2814] typo
100 Previously, lxml 1.x used a lower case ``none``.
b90e7317 »
2007-08-29 [svn r2774] new doc file: what's new in lxml 2.0
101
102 * Another change in objectify regards the way it deals with ambiguous types.
103 Previously, setting a value like the string ``"3"`` through normal attribute
104 access would let it come back as an integer when reading the object
105 attribute. lxml 2.0 prevents this by always setting the ``pytype``
106 attribute to the type the user passed in, so ``"3"`` will come back as a
107 string, while the number ``3`` will come back as a number. To remove the
108 type annotation on serialisation, you can use the ``deannotate()`` function.
109
110 * The C-API function ``findOrBuildNodeNs()`` was replaced by the more generic
c1b351b9 »
2007-10-07 [svn r2946] docs
111 ``findOrBuildNodeNsPrefix()`` that accepts an additional default prefix.
b90e7317 »
2007-08-29 [svn r2774] new doc file: what's new in lxml 2.0
112
113
114 Enhancements
115 ------------
116
117 Most of the enhancements of lxml 2.0 were made under the hood. Most people
118 won't even notice them, but they make the maintenance of lxml easier and thus
119 facilitate further enhancements and an improved integration between lxml's
120 features.
121
55a6a00b »
2007-09-02 [svn r2813] pre-release cleanup
122 * lxml.objectify now has its own implementation of the `E factory`_. It uses
b90e7317 »
2007-08-29 [svn r2774] new doc file: what's new in lxml 2.0
123 the built-in type lookup mechanism of lxml.objectify, thus removing the need
124 for an additional type registry mechanism (as previously available through
125 the ``typemap`` parameter).
126
127 * XML entities are supported through the ``Entity()`` factory, an Entity
128 element class and a parser option ``resolve_entities`` that allows to keep
129 entities in the element tree when set to False. Also, the parser will now
130 report undefined entities as errors if it needs to resolve them (which is
131 still the default, as in lxml 1.x).
132
133 * A major part of the XPath code was rewritten and can now benefit from a
134 bigger overlap with the XSLT code. The main benefits are improved thread
135 safety in the XPath evaluators and Python RegExp support in standard XPath.
136
13e0e43f »
2008-01-26 [svn r3210] r3329@delle: sbehnel | 2008-01-26 12:51:27 +0100
137 * The string results of an XPath evaluation have become 'smart' string
138 subclasses. Formerly, there was no easy way to find out where a
139 string originated from. In lxml 2.0, you can call its
140 ``getparent()`` method to `find the Element that carries it`_. This
141 works for attributes (``//@attribute``) and for ``text()`` nodes,
142 i.e. Element text and tails. Strings that were constructed in the
143 path expression, e.g. by the ``string()`` function or extension
144 functions, will return None as their parent.
145
b6d88d25 »
2008-02-01 [svn r3243] r3398@delle: sbehnel | 2008-02-01 18:24:58 +0100
146 * Setting a ``QName`` object as value of the ``.text`` property or as
147 an attribute value will resolve its prefix in the respective context
148
149 * Following ElementTree 1.3, the ``iterfind()`` method supports
150 efficient iteration based on XPath-like expressions.
151
152 The parsers also received some major enhancements:
153
154 * ``iterparse()`` can parse HTML when passing the boolean ``html``
155 keyword.
156
157 * Parse time XML Schema validation by passing an
c524a066 »
2008-02-01 [svn r3241] r3394@delle: sbehnel | 2008-02-01 15:57:58 +0100
158 XMLSchema object to the ``schema`` keyword argument of a parser.
159
b6d88d25 »
2008-02-01 [svn r3243] r3398@delle: sbehnel | 2008-02-01 18:24:58 +0100
160 * Support for a ``target`` object that implements ElementTree's
161 `TreeBuilder interface`_.
162
163 * The ``encoding`` keyword allows overriding the document encoding.
164
c524a066 »
2008-02-01 [svn r3241] r3394@delle: sbehnel | 2008-02-01 15:57:58 +0100
165
55a6a00b »
2007-09-02 [svn r2813] pre-release cleanup
166 .. _`E factory`: objectify.html#tree-generation-with-the-e-factory
13e0e43f »
2008-01-26 [svn r3210] r3329@delle: sbehnel | 2008-01-26 12:51:27 +0100
167 .. _`find the Element that carries it`: tutorial.html#using-xpath-to-find-text
c524a066 »
2008-02-01 [svn r3241] r3394@delle: sbehnel | 2008-02-01 15:57:58 +0100
168 .. _`TreeBuilder interface`: http://effbot.org/elementtree/elementtree-treebuilder.htm
169
170
75d25285 »
2008-02-01 [svn r3242] r3395@delle: sbehnel | 2008-02-01 16:25:37 +0100
171 Deprecation
172 -----------
c524a066 »
2008-02-01 [svn r3241] r3394@delle: sbehnel | 2008-02-01 15:57:58 +0100
173
b6d88d25 »
2008-02-01 [svn r3243] r3398@delle: sbehnel | 2008-02-01 18:24:58 +0100
174 The following functions and methods are now deprecated. They are
175 still available in lxml 2.0 and will be removed in lxml 2.1:
c524a066 »
2008-02-01 [svn r3241] r3394@delle: sbehnel | 2008-02-01 15:57:58 +0100
176
75d25285 »
2008-02-01 [svn r3242] r3395@delle: sbehnel | 2008-02-01 16:25:37 +0100
177 * The ``tounicode()`` function was replaced by the call
178 ``tostring(encoding=unicode)``.
179
b6d88d25 »
2008-02-01 [svn r3243] r3398@delle: sbehnel | 2008-02-01 18:24:58 +0100
180 * CamelCaseNamed module functions and methods were renamed to their
181 underscore equivalents to follow `PEP 8`_ in naming.
75d25285 »
2008-02-01 [svn r3242] r3395@delle: sbehnel | 2008-02-01 16:25:37 +0100
182
45ab7f4e »
2008-04-23 [svn r3570] r4027@delle: sbehnel | 2008-04-22 23:29:33 +0200
183 - ``etree.clearErrorLog()``, use ``etree.clear_error_log()``
75d25285 »
2008-02-01 [svn r3242] r3395@delle: sbehnel | 2008-02-01 16:25:37 +0100
184
45ab7f4e »
2008-04-23 [svn r3570] r4027@delle: sbehnel | 2008-04-22 23:29:33 +0200
185 - ``etree.useGlobalPythonLog()``, use
f8d4a06a »
2008-04-23 [svn r3572] r4029@delle: sbehnel | 2008-04-22 23:40:39 +0200
186 ``etree.use_global_python_log()``
75d25285 »
2008-02-01 [svn r3242] r3395@delle: sbehnel | 2008-02-01 16:25:37 +0100
187
45ab7f4e »
2008-04-23 [svn r3570] r4027@delle: sbehnel | 2008-04-22 23:29:33 +0200
188 - ``etree.ElementClassLookup.setFallback()``, use
189 ``etree.ElementClassLookup.set_fallback()``
75d25285 »
2008-02-01 [svn r3242] r3395@delle: sbehnel | 2008-02-01 16:25:37 +0100
190
45ab7f4e »
2008-04-23 [svn r3570] r4027@delle: sbehnel | 2008-04-22 23:29:33 +0200
191 - ``etree.getDefaultParser()``, use ``etree.get_default_parser()``
75d25285 »
2008-02-01 [svn r3242] r3395@delle: sbehnel | 2008-02-01 16:25:37 +0100
192
45ab7f4e »
2008-04-23 [svn r3570] r4027@delle: sbehnel | 2008-04-22 23:29:33 +0200
193 - ``etree.setDefaultParser()``, use ``etree.set_default_parser()``
194
195 - ``etree.setElementClassLookup()``, use
196 ``etree.set_element_class_lookup()``
197
198 - ``XMLParser.setElementClassLookup()``, use ``.set_element_class_lookup()``
199
200 - ``HTMLParser.setElementClassLookup()``, use ``.set_element_class_lookup()``
201
202 Note that ``parser.setElementClassLookup()`` has not been removed
203 yet, although ``parser.set_element_class_lookup()`` should be used
204 instead.
205
206 - ``xpath_evaluator.registerNamespace()``, use
207 ``xpath_evaluator.register_namespace()``
208
209 - ``xpath_evaluator.registerNamespaces()``, use
210 ``xpath_evaluator.register_namespaces()``
211
212 - ``objectify.setPytypeAttributeTag``, use
213 ``objectify.set_pytype_attribute_tag``
214
215 - ``objectify.setDefaultParser()``, use
216 ``objectify.set_default_parser()``
75d25285 »
2008-02-01 [svn r3242] r3395@delle: sbehnel | 2008-02-01 16:25:37 +0100
217
218 * The ``.getiterator()`` method on Elements and ElementTrees was
219 renamed to ``.iter()`` to follow ElementTree 1.3.
220
221 .. _`PEP 8`: http://www.python.org/dev/peps/pep-0008/
55a6a00b »
2007-09-02 [svn r2813] pre-release cleanup
222
b90e7317 »
2007-08-29 [svn r2774] new doc file: what's new in lxml 2.0
223
224 New modules
225 ===========
226
227 The most visible changes in lxml 2.0 regard the new modules that were added.
228
229
230 lxml.usedoctest
231 ---------------
232
233 A very useful module for doctests based on XML or HTML is
13e0e43f »
2008-01-26 [svn r3210] r3329@delle: sbehnel | 2008-01-26 12:51:27 +0100
234 ``lxml.doctestcompare``. It provides a relaxed comparison mechanism
235 for XML and HTML in doctests. Using it for XML comparisons is as
01b7c423 »
2008-03-03 [svn r3394] r3706@delle: sbehnel | 2008-03-03 15:53:29 +0100
236 simple as:
237
238 .. sourcecode:: pycon
b90e7317 »
2007-08-29 [svn r2774] new doc file: what's new in lxml 2.0
239
240 >>> import lxml.usedoctest
241
01b7c423 »
2008-03-03 [svn r3394] r3706@delle: sbehnel | 2008-03-03 15:53:29 +0100
242 and for HTML comparisons:
243
244 .. sourcecode:: pycon
b90e7317 »
2007-08-29 [svn r2774] new doc file: what's new in lxml 2.0
245
246 >>> import lxml.html.usedoctest
247
248
249 lxml.html
250 ---------
251
252 The largest new package that was added to lxml 2.0 is `lxml.html`_. It
253 contains various tools and modules for HTML handling. The major features
254 include support for cleaning up HTML (removing unwanted content), a readable
255 HTML diff and various tools for working with links.
256
257 .. _`lxml.html`: lxmlhtml.html
258
259
260 lxml.cssselect
261 --------------
262
263 The Cascading Stylesheet Language (CSS_) has a very short and generic path
264 language for pointing at elements in XML/HTML trees (`CSS selectors`_). The module
265 lxml.cssselect_ provides an implementation based on XPath.
266
267 .. _lxml.cssselect: cssselect.html
268 .. _CSS: http://www.w3.org/Style/CSS/
269 .. _`CSS selectors`: http://www.w3.org/TR/CSS21/selector.html
Something went wrong with that request. Please try again.