Skip to content
fast python port of arc90's readability tool, updated to match latest readability.js!
Python
Find file
Pull request Compare This branch is 31 commits ahead, 62 commits behind buriy:master.
Latest commit 7dc373e Apr 21, 2012 @mitechie Add the title and the short title to the metadata set.
- Tested for perf. hit, 100 iterations add .03s total time.
- Added the -m flag to the cmd line client to get all metadata output.
- Added test for making sure title/short title come back as well.
Failed to load latest commit information.
src Add the title and the short title to the metadata set. Apr 21, 2012
.gitignore
CREDITS Add credits file Apr 17, 2012
LICENSE Add a license file Apr 18, 2012
Makefile
README.rst Fix docs for changed method Apr 21, 2012
setup.py Fix setup.py to pull the rst readme Apr 18, 2012

README.rst

readability_lxml

This is a python port of a ruby port of arc90's readability project

Given a html document, it pulls out the main body text and cleans it up. It also can clean up title based on latest readability.js code.

Inspiration

Try it out!

You can try out the parser by entering your test urls on the following test service.

http://readable.bmark.us

Installation

$ easy_install readability-lxml
# or
$ pip install readability-lxml

Usage

Command Line Client

$ readability http://pypi.python.org/pypi/readability-lxml
$ readability /home/rharding/sampledoc.html

As a Library

from readability.readability import Document
import urllib
html = urllib.urlopen(url).read()
readable_article = Document(html).summary()
readable_title = Document(html).short_title()

You can also use the get_summary_with_metadata method to get back other metadata such as the confidence score found while processing the input.

doc = Document(html).summary_with_metadata()
print doc.html
print doc.confidence

Optional Document keyword argument:

  • attributes:
  • debug: output debug messages
  • min_text_length:
  • retry_length:
  • url: will allow adjusting links to be absolute

Test and BUild Status

Tests are run against the package at:

http://build.bmark.us/job/readability-lxml/

You can view it for build history and test status.

History

  • 0.2.5 Update setup.py for uploading .tar.gz to pypi
Something went wrong with that request. Please try again.