Skip to content

Commit

Permalink
docs: Prefer "fromstring()" over "parse()" for strings in the parsing…
Browse files Browse the repository at this point in the history
… documentation and clarify the relation between HTML() and fromstring().

Closes https://bugs.launchpad.net/lxml/+bug/2039353
  • Loading branch information
scoder committed Oct 15, 2023
1 parent bf6a273 commit 27a9b5d
Showing 1 changed file with 18 additions and 22 deletions.
40 changes: 18 additions & 22 deletions doc/parsing.txt
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ parsing XML from an in-memory string:
b'<a xmlns="test"><b xmlns="test"/></a>'

To read from a file or file-like object, you can use the ``parse()`` function,
which returns an ``ElementTree`` object:
which returns an ``ElementTree`` object that wraps the document root:

.. sourcecode:: pycon

Expand All @@ -109,9 +109,9 @@ efficient) to pass a filename:
lxml can parse from a local file, an HTTP URL or an FTP URL. It also
auto-detects and reads gzip-compressed XML files (.gz).

If you want to parse from memory and still provide a base URL for the document
(e.g. to support relative paths in an XInclude), you can pass the ``base_url``
keyword argument:
If you want to parse from a string (bytes or text) and still provide a base URL
for the document (e.g. to support relative paths in an XInclude), you can pass
the ``base_url`` keyword argument:

.. sourcecode:: pycon

Expand All @@ -127,8 +127,8 @@ example is easily extended to clean up namespaces during parsing:
.. sourcecode:: pycon

>>> parser = etree.XMLParser(ns_clean=True)
>>> tree = etree.parse(StringIO(xml), parser)
>>> etree.tostring(tree.getroot())
>>> xml_root = etree.fromstring(xml, parser)
>>> etree.tostring(xml_root)
b'<a xmlns="test"><b/></a>'

The keyword arguments in the constructor are mainly based on the libxml2
Expand Down Expand Up @@ -249,9 +249,9 @@ this feature.
>>> broken_html = "<html><head><title>test<body><h1>page title</h3>"

>>> parser = etree.HTMLParser()
>>> tree = etree.parse(StringIO(broken_html), parser)
>>> html_root = etree.fromstring(broken_html, parser)

>>> result = etree.tostring(tree.getroot(),
>>> result = etree.tostring(html_root,
... pretty_print=True, method="html")
>>> print(result)
<html>
Expand All @@ -263,24 +263,20 @@ this feature.
</body>
</html>

Lxml has an HTML function, similar to the XML shortcut known from
ElementTree:
As a nicer alias for parsing HTML literals, lxml has an ``HTML()`` function,
similar to the ``XML()`` shortcut known from ElementTree:

.. sourcecode:: pycon

>>> html = etree.HTML(broken_html)
>>> result = etree.tostring(html, pretty_print=True, method="html")
>>> print(result)
<html>
<head>
<title>test</title>
</head>
<body>
<h1>page title</h1>
</body>
</html>
>>> html_root = etree.HTML("""
... <html>
... <body>
... <h1>page title</h1>
... </body>
... </html>
... """)

The support for parsing broken HTML depends entirely on libxml2's recovery
Note: The support for parsing broken HTML depends entirely on libxml2's recovery
algorithm. It is *not* the fault of lxml if you find documents that are so
heavily broken that the parser cannot handle them. There is also no guarantee
that the resulting tree will contain all data from the original document. The
Expand Down

0 comments on commit 27a9b5d

Please sign in to comment.