Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

Update README to encourage installing LXML, as HTML5LIB is failing to…

… parse even basic, well-formed XHTML docs.
  • Loading branch information...
commit 18b22c917efef90e21b99833f6db5ff96f2fe695 1 parent e22894e
@lethain authored
Showing with 18 additions and 1 deletion.
  1. +18 −1 README.rst
View
19 README.rst
@@ -50,7 +50,24 @@ The simplest way to install Extraction is via PyPi::
pip install extraction
-If you want to develop against extraction, you can install from GitHub::
+You'll also have to install a parser for `BeautifulSoup4 <http://www.crummy.com/software/BeautifulSoup/>`,
+and while ``extraction`` already pulls down [html5lib](http://code.google.com/p/html5lib/)
+through it's requirements, I really recommend installing `lxml <http://lxml.de/>` as well,
+because there are some extremely gnarly issues with ``html5lib``
+failing to parse XHTML pages (for example, PyPi fails to parse entirely
+with html5lib::
+
+ >>> bs4.BeautifulSoup(text, ["html5lib"]).find_all("title")
+ []
+ >>> bs4.BeautifulSoup(text, ["lxml"]).find_all("title")
+ [<title>extraction 0.1.3 : Python Package Index</title>]
+
+You should be able to install `lxml <http://lxml.de/>` via pip::
+
+ pip install lxml
+
+If you want to develop extraction, then after installing `lxml`,
+you can install from GitHub::
git clone
cd extraction
Please sign in to comment.
Something went wrong with that request. Please try again.