My fork of the boilerpipe code base
Java
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
javadoc
lib
src
.gitignore
INSTALL.txt
LICENSE.txt
NOTICE.txt
README.md
boilerpipe.pom
build.xml

README.md

Summary

The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.

Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate.

Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0.

Note

The real boilerpipe repository is available on google code at http://code.google.com/p/boilerpipe/. This is just my copy of the code base which I'm using to understand how it's been designed.