An R source package implementing the Count-Min-Sketch with conservative updating
C++ R C
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
cmscu
README.md
bench.R
cmscu.tar.gz
mkn.R

README.md

r-cmscu

An R source package implementing the Count-Min-Sketch with conservative updating.

Source Distribution

The tarball cmscu.tar.gz in this repository contains the latest packaged source. It can be installed via RStudio or R CMD INSTALL cmscu.tar.gz from the command line. Note: the package Rcpp is required before installation, e.g. install.packages('Rcpp').

Further note: the parallel features require OpenMP support in your build environment. Typically this means that the parallel functions won't actually run in parallel unless installed in a Linux environment. Nevertheless, it is still considerably faster to run dict$query(list_of_strings) than sapply(list_of_strings, dict$query) for large lists.

Sample Usage:

require(cmscu);
require(tm);
require(rjson);

# read in the official Yelp dataset review file
yelp <- file("yelp_academic_dataset_review.json", "r", FALSE);
training <- sapply(readLines(con = yelp, n = 1000), function(review) fromJSON(review)$text, USE.NAMES = FALSE);
testing  <- sapply(readLines(con = yelp, n = 1000), function(review) fromJSON(review)$text, USE.NAMES = FALSE);

# use the tm package to clean up some text from the Yelp review dataset
clean_data <- function(data) {
    tcorpus <- VCorpus(VectorSource(data));
    tcorpus <- tm_map(tcorpus, stripWhitespace);
    tcorpus <- tm_map(tcorpus, content_transformer(tolower));
    tcorpus <- tm_map(tcorpus, removePunctuation);
    tcorpus <- tm_map(tcorpus, stemDocument);
    return(tcorpus);
}

# instead of using a DocumentTermMatrix from tm, we use the cmscu package
dict <- new(FrequencyDictionary, 4, 2 ^ 20); # this uses 16mb of RAM
invisible(sapply(clean_data(training), function(review) {
    words <- unlist(strsplit(as.character(review), " ", fixed = TRUE), use.names = FALSE);
    bigrams <- do.call(paste, list(head(words, -1), tail(words, -1)));
    dict$store(bigrams);
}));

# some diagnostics:
#   we hope dict$uncertainty is small (less than 1, or whatever your application requires)
#   this is an upper bound on our over-estimation, dict$confidence% of the time.
if (dict$uncertainty >= 1) {
    print("Consider increasing the number of entries per hash table in the dictionary");
}
print(c(dict$uncertainty, dict$confidence));

# finally, let's look at the distribution of mean information density across 1000 reviews
normalization <- log2(dict$entries + 0.5 * dict$unique_entries);
info <- sapply(clean_data(testing), function(review) {
    words <- unlist(strsplit(as.character(review), " ", fixed = TRUE), use.names = FALSE);
    bigrams <- do.call(paste, list(head(words, -1), tail(words, -1)));

    # Use Dirichlet smoothing to avoid divide-by-zero problems
    return(-mean(log2(dict$query(bigrams) + 0.5) - normalization));
}, USE.NAMES = FALSE);

# plot the histogram
hist(info);

Extras

The file mkn.R contains an implementation of the modified Kneser-Ney smoothing algorithm powered by the cmscu library. It is still undergoing development and documentation, so use at your own risk! With that said,

# initialize the smoothing for ngrams of up to order 3
# (pass the other arguments to each instance of cmscu objects,
#  of which there are 2*n, where n is the order of the ngram)
smoother <- mkn(3, 4, 2^20);
# sentence is a character vector of words
train(smoother, a_sentence);
# train can be called many, many times
train(smoother, another_sentence);
# finalize when there is no more data
smoother <- finalize(smoother);
# smoother is now a function which takes a character vector
# of trigram tokens, and returns their probabilities
# (note: this includes all suffixing-ngram probabilities)

Acknowledgments

The authors of this software gratefully acknowledge funding from NSF grant #1344279. We further thank the author of MurmurHash for their public domain implementation of the hashing function used in this software.