An R package to assess the effects of text preprocessing decisions.
Switch branches/tags
Nothing to show
Clone or download
Failed to load latest commit information.
R version bump and date bump Jan 12, 2018
data updated to change encoding to ASCII from UTF-8 to deal with CRAN NOTE Jul 8, 2017
tests use 10 documents but no ngrams May 26, 2017
vignettes update vignette May 26, 2017
.Rbuildignore ignore cran comments Oct 6, 2016
NAMESPACE use quanteda coef May 26, 2017 fix read May 29, 2017 Update R version in cran-comments Jan 12, 2018

preText Travis-CI Build Status CRAN_Status_Badge

An R package to assess the consequences of text preprocessing decisions.

[getting started with preText vignette].

The paper detailing the procedure can be found at the link below:

  • Matthew J. Denny, and Arthur Spirling (2017). "Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It". []


The easiest way to do this is to install the package from CRAN via the standard install.packages command:


If you want to get the latest version from GitHub, start by checking out the Requirements for using C++ code with R section in the following tutorial: Using C++ and R code Together with Rcpp. You will likely need to install either Xcode or Rtools depending on whether you are using a Mac or Windows machine before you can install the preText package via GitHub, since it makes use of C++ code.


Now we can install from Github using the following line:


Once the GERGM package is installed, you may access its functionality as you would any other package by calling:


If all went well, you should be able to replicate the steps in the vignette("getting_started").

Basic Usage

The basic functionality of this package is detailed in a vignette, which is [available here]. Beyond this basic functionality the package includes a number of additional utility and analysis functions for exploring and comparing multiple document--term matrices.

Bug Reporting