Skip to content

misja/python-boilerpipe

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 

python-boilerpipe

A python wrapper for Boilerpipe, an excellent Java library for boilerplate removal and fulltext extraction from HTML pages.

Configuration

Dependencies:

  • jpype
  • chardet

The boilerpipe jar files will get fetched and included automatically when building the package.

Installation

Checkout the code:

git clone https://github.com/misja/python-boilerpipe.git
cd python-boilerpipe

virtualenv

virtualenv env
source env/bin/activate
pip install -r requirements.txt
python setup.py install

Fedora

sudo dnf install -y python2-jpype
sudo python setup.py install

Usage

Be sure to have set JAVA_HOME properly since jpype depends on this setting.

The constructor takes a keyword argument extractor, being one of the available boilerpipe extractor types:

  • DefaultExtractor
  • ArticleExtractor
  • ArticleSentencesExtractor
  • KeepEverythingExtractor
  • KeepEverythingWithMinKWordsExtractor
  • LargestContentExtractor
  • NumWordsRulesExtractor
  • CanolaExtractor

If no extractor is passed the DefaultExtractor will be used by default. Additional keyword arguments are either html for HTML text or url.

from boilerpipe.extract import Extractor
extractor = Extractor(extractor='ArticleExtractor', url=your_url)

Then, to extract relevant content:

extracted_text = extractor.getText()

extracted_html = extractor.getHTML()

For KeepEverythingWithMinKWordsExtractor we have to specify kMin parameter, which defaults to 1 for now:

extractor = Extractor(extractor='KeepEverythingWithMinKWordsExtractor', url=your_url, kMin=20)

About

Python interface to Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages