docs: Prefer "fromstring()" over "parse()" for strings in the parsing…

… documentation and clarify the relation between HTML() and fromstring(). Closes https://bugs.launchpad.net/lxml/+bug/2039353
lxml · Oct 15, 2023 · 27a9b5d · 27a9b5d
1 parent bf6a273
commit 27a9b5d
Showing 1 changed file with 18 additions and 22 deletions.
diff --git a/doc/parsing.txt b/doc/parsing.txt
@@ -90,7 +90,7 @@ parsing XML from an in-memory string:
   b'<a xmlns="test"><b xmlns="test"/></a>'
 
 To read from a file or file-like object, you can use the ``parse()`` function,
-which returns an ``ElementTree`` object:
+which returns an ``ElementTree`` object that wraps the document root:
 
 .. sourcecode:: pycon
 
@@ -109,9 +109,9 @@ efficient) to pass a filename:
 lxml can parse from a local file, an HTTP URL or an FTP URL.  It also
 auto-detects and reads gzip-compressed XML files (.gz).
 
-If you want to parse from memory and still provide a base URL for the document
-(e.g. to support relative paths in an XInclude), you can pass the ``base_url``
-keyword argument:
+If you want to parse from a string (bytes or text) and still provide a base URL
+for the document (e.g. to support relative paths in an XInclude), you can pass
+the ``base_url`` keyword argument:
 
 .. sourcecode:: pycon
 
@@ -127,8 +127,8 @@ example is easily extended to clean up namespaces during parsing:
 .. sourcecode:: pycon
 
   >>> parser = etree.XMLParser(ns_clean=True)
-  >>> tree   = etree.parse(StringIO(xml), parser)
-  >>> etree.tostring(tree.getroot())
+  >>> xml_root = etree.fromstring(xml, parser)
+  >>> etree.tostring(xml_root)
   b'<a xmlns="test"><b/></a>'
 
 The keyword arguments in the constructor are mainly based on the libxml2
@@ -249,9 +249,9 @@ this feature.
   >>> broken_html = "<html><head><title>test<body><h1>page title</h3>"
 
   >>> parser = etree.HTMLParser()
-  >>> tree   = etree.parse(StringIO(broken_html), parser)
+  >>> html_root   = etree.fromstring(broken_html, parser)
 
-  >>> result = etree.tostring(tree.getroot(),
+  >>> result = etree.tostring(html_root,
   ...                         pretty_print=True, method="html")
   >>> print(result)
   <html>
@@ -263,24 +263,20 @@ this feature.
     </body>
   </html>
 
-Lxml has an HTML function, similar to the XML shortcut known from
-ElementTree:
+As a nicer alias for parsing HTML literals, lxml has an ``HTML()`` function,
+similar to the ``XML()`` shortcut known from ElementTree:
 
 .. sourcecode:: pycon
 
-  >>> html = etree.HTML(broken_html)
-  >>> result = etree.tostring(html, pretty_print=True, method="html")
-  >>> print(result)
-  <html>
-    <head>
-      <title>test</title>
-    </head>
-    <body>
-      <h1>page title</h1>
-    </body>
-  </html>
+  >>> html_root = etree.HTML("""
+  ...   <html>
+  ...      <body>
+  ...         <h1>page title</h1>
+  ...     </body>
+  ...   </html>
+  ... """)
 
-The support for parsing broken HTML depends entirely on libxml2's recovery
+Note: The support for parsing broken HTML depends entirely on libxml2's recovery
 algorithm.  It is *not* the fault of lxml if you find documents that are so
 heavily broken that the parser cannot handle them.  There is also no guarantee
 that the resulting tree will contain all data from the original document.  The