Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Python interface to Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages
branch: master

Update README.rst

Changing chardet to charade as dependency.
latest commit 9c50d4315a
Misja Hoebe authored
Failed to load latest commit information.
src/boilerpipe Changing refs from chardet to charade
LICENSE.txt Preparing for PyPI
README.rst Update README.rst
setup.cfg Preparing for PyPI
setup.py Adding license kw to setup()

README.rst

python-boilerpipe

A python wrapper for Boilerpipe, an excellent Java library for boilerplate removal and fulltext extraction from HTML pages.

Configuration

Dependencies: jpype, charade

The boilerpipe jar files will get fetched and included automatically when building the package.

Usage

Be sure to have set JAVA_HOME properly since jpype depends on this setting.

The constructor takes a keyword argment extractor, being one of the available boilerpipe extractor types:

  • DefaultExtractor
  • ArticleExtractor
  • ArticleSentencesExtractor
  • KeepEverythingExtractor
  • KeepEverythingWithMinKWordsExtractor
  • LargestContentExtractor
  • NumWordsRulesExtractor
  • CanolaExtractor

If no extractor is passed the DefaultExtractor will be used by default. Additional keyword arguments are either html for HTML text or url.

from boilerpipe.extract import Extractor
extractor = Extractor(extractor='ArticleExtractor', url=your_url)

Then, to extract relevant content:

extracted_text = extractor.getText()

extracted_html = extractor.getHTML()
Something went wrong with that request. Please try again.