Skip to content
Interface to the boilerpipe Java library by Christian Kohlschutter (http://code.google.com/p/boilerpipe/)
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R
data
inst
java
man
vignettes
.Rbuildignore
.gitignore
.travis.yml
DESCRIPTION
Makefile
NAMESPACE
README.md

README.md

boilerpipeR

Build Status

boilerpipeR is an R-package which provides an interface to boilerpipe, a Java library written by Christian Kohlschütter [1]. It supports the generic extraction of main text content from HTML files and therefore removes ads, side-bars and headers from the HTML source content. The extraction heuristics from boilerpipe show a robust performance for a wide range of web site templates.

Install

To install the latest version from CRAN simply

install.packages("boilerpipeR")

Using the devtools package you can easily install the latest development version of boilerpipeR from github with

library(devtools)
install_github("mannau/boilerpipeR")

Windows users need to use the following command to install from github:

library(devtools)
install_github("mannau/boilerpipeR", args = "--no-multiarch")

Usage

To download and extract the main text from e.g. the R-Studio blog you can use the following commands:

library(boilerpipeR)

url <- "http://blog.rstudio.org/2014/05/09/reshape2-1-4/"
maintext <- ArticleExtractor(url, asText = FALSE)
cat(maintext)

References

[1] Christian Kohlschütter, Exploiting Links and Text Structure on the Web — A Quantitative Approach to Improving Search Quality, PhD Thesis

License

boilerpipe and boilerpipeR are both released under the Apache Version 2 License

You can’t perform that action at this time.