Skip to content


Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

Update README to encourage installing LXML, as HTML5LIB is failing to…

… parse even basic, well-formed XHTML docs.
  • Loading branch information...
commit 18b22c917efef90e21b99833f6db5ff96f2fe695 1 parent e22894e
@lethain authored
Showing with 18 additions and 1 deletion.
  1. +18 −1 README.rst
19 README.rst
@@ -50,7 +50,24 @@ The simplest way to install Extraction is via PyPi::
pip install extraction
-If you want to develop against extraction, you can install from GitHub::
+You'll also have to install a parser for `BeautifulSoup4 <>`,
+and while ``extraction`` already pulls down [html5lib](
+through it's requirements, I really recommend installing `lxml <>` as well,
+because there are some extremely gnarly issues with ``html5lib``
+failing to parse XHTML pages (for example, PyPi fails to parse entirely
+with html5lib::
+ >>> bs4.BeautifulSoup(text, ["html5lib"]).find_all("title")
+ []
+ >>> bs4.BeautifulSoup(text, ["lxml"]).find_all("title")
+ [<title>extraction 0.1.3 : Python Package Index</title>]
+You should be able to install `lxml <>` via pip::
+ pip install lxml
+If you want to develop extraction, then after installing `lxml`,
+you can install from GitHub::
git clone
cd extraction
Please sign in to comment.
Something went wrong with that request. Please try again.