Skip to content
This repository


Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Python interface to Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages

branch: master

Update README.rst

Changing chardet to charade as dependency.
latest commit 9c50d4315a
misja authored
Octocat-spinner-32 src Changing refs from chardet to charade
Octocat-spinner-32 LICENSE.txt Preparing for PyPI
Octocat-spinner-32 README.rst Update README.rst
Octocat-spinner-32 setup.cfg Preparing for PyPI
Octocat-spinner-32 Adding license kw to setup()


A python wrapper for Boilerpipe, an excellent Java library for boilerplate removal and fulltext extraction from HTML pages.


Dependencies: jpype, charade

The boilerpipe jar files will get fetched and included automatically when building the package.


Be sure to have set JAVA_HOME properly since jpype depends on this setting.

The constructor takes a keyword argment extractor, being one of the available boilerpipe extractor types:

  • DefaultExtractor
  • ArticleExtractor
  • ArticleSentencesExtractor
  • KeepEverythingExtractor
  • KeepEverythingWithMinKWordsExtractor
  • LargestContentExtractor
  • NumWordsRulesExtractor
  • CanolaExtractor

If no extractor is passed the DefaultExtractor will be used by default. Additional keyword arguments are either html for HTML text or url.

from boilerpipe.extract import Extractor
extractor = Extractor(extractor='ArticleExtractor', url=your_url)

Then, to extract relevant content:

extracted_text = extractor.getText()

extracted_html = extractor.getHTML()
Something went wrong with that request. Please try again.