From 27a9b5da463152b88a171e5565053d6ee2462f11 Mon Sep 17 00:00:00 2001 From: Stefan Behnel Date: Sun, 15 Oct 2023 12:01:32 +0200 Subject: [PATCH] docs: Prefer "fromstring()" over "parse()" for strings in the parsing documentation and clarify the relation between HTML() and fromstring(). Closes https://bugs.launchpad.net/lxml/+bug/2039353 --- doc/parsing.txt | 40 ++++++++++++++++++---------------------- 1 file changed, 18 insertions(+), 22 deletions(-) diff --git a/doc/parsing.txt b/doc/parsing.txt index a271dc032..e26bc09a3 100644 --- a/doc/parsing.txt +++ b/doc/parsing.txt @@ -90,7 +90,7 @@ parsing XML from an in-memory string: b'' To read from a file or file-like object, you can use the ``parse()`` function, -which returns an ``ElementTree`` object: +which returns an ``ElementTree`` object that wraps the document root: .. sourcecode:: pycon @@ -109,9 +109,9 @@ efficient) to pass a filename: lxml can parse from a local file, an HTTP URL or an FTP URL. It also auto-detects and reads gzip-compressed XML files (.gz). -If you want to parse from memory and still provide a base URL for the document -(e.g. to support relative paths in an XInclude), you can pass the ``base_url`` -keyword argument: +If you want to parse from a string (bytes or text) and still provide a base URL +for the document (e.g. to support relative paths in an XInclude), you can pass +the ``base_url`` keyword argument: .. sourcecode:: pycon @@ -127,8 +127,8 @@ example is easily extended to clean up namespaces during parsing: .. sourcecode:: pycon >>> parser = etree.XMLParser(ns_clean=True) - >>> tree = etree.parse(StringIO(xml), parser) - >>> etree.tostring(tree.getroot()) + >>> xml_root = etree.fromstring(xml, parser) + >>> etree.tostring(xml_root) b'' The keyword arguments in the constructor are mainly based on the libxml2 @@ -249,9 +249,9 @@ this feature. >>> broken_html = "test<body><h1>page title</h3>" >>> parser = etree.HTMLParser() - >>> tree = etree.parse(StringIO(broken_html), parser) + >>> html_root = etree.fromstring(broken_html, parser) - >>> result = etree.tostring(tree.getroot(), + >>> result = etree.tostring(html_root, ... pretty_print=True, method="html") >>> print(result) <html> @@ -263,24 +263,20 @@ this feature. </body> </html> -Lxml has an HTML function, similar to the XML shortcut known from -ElementTree: +As a nicer alias for parsing HTML literals, lxml has an ``HTML()`` function, +similar to the ``XML()`` shortcut known from ElementTree: .. sourcecode:: pycon - >>> html = etree.HTML(broken_html) - >>> result = etree.tostring(html, pretty_print=True, method="html") - >>> print(result) - <html> - <head> - <title>test - - -

page title

- - + >>> html_root = etree.HTML(""" + ... + ... + ...

page title

+ ... + ... + ... """) -The support for parsing broken HTML depends entirely on libxml2's recovery +Note: The support for parsing broken HTML depends entirely on libxml2's recovery algorithm. It is *not* the fault of lxml if you find documents that are so heavily broken that the parser cannot handle them. There is also no guarantee that the resulting tree will contain all data from the original document. The