Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
My fork of the boilerpipe code base
Java
branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
javadoc
lib
src
.gitignore
INSTALL.txt
LICENSE.txt
NOTICE.txt
README.md
boilerpipe.pom
build.xml

README.md

Summary

The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.

Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate.

Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0.

Note

The real boilerpipe repository is available on google code at http://code.google.com/p/boilerpipe/. This is just my copy of the code base which I'm using to understand how it's been designed.

Something went wrong with that request. Please try again.