Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Python interface to Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages

branch: master

Update README.rst

Changing chardet to charade as dependency.
latest commit 9c50d4315a
misja authored
Octocat-spinner-32 src Changing refs from chardet to charade
Octocat-spinner-32 LICENSE.txt Preparing for PyPI
Octocat-spinner-32 README.rst Update README.rst
Octocat-spinner-32 setup.cfg Preparing for PyPI
Octocat-spinner-32 setup.py Adding license kw to setup()
README.rst

python-boilerpipe

A python wrapper for Boilerpipe, an excellent Java library for boilerplate removal and fulltext extraction from HTML pages.

Configuration

Dependencies: jpype, charade

The boilerpipe jar files will get fetched and included automatically when building the package.

Usage

Be sure to have set JAVA_HOME properly since jpype depends on this setting.

The constructor takes a keyword argment extractor, being one of the available boilerpipe extractor types:

  • DefaultExtractor
  • ArticleExtractor
  • ArticleSentencesExtractor
  • KeepEverythingExtractor
  • KeepEverythingWithMinKWordsExtractor
  • LargestContentExtractor
  • NumWordsRulesExtractor
  • CanolaExtractor

If no extractor is passed the DefaultExtractor will be used by default. Additional keyword arguments are either html for HTML text or url.

from boilerpipe.extract import Extractor
extractor = Extractor(extractor='ArticleExtractor', url=your_url)

Then, to extract relevant content:

extracted_text = extractor.getText()

extracted_html = extractor.getHTML()
Something went wrong with that request. Please try again.