Capstone n-grams: how much processing power is required?
Students in the Johns Hopkins University Data Science Specialization Capstone course typically struggle with the course project because of the amount of memory consumed by the objects needed to analyze text. To help reduce the guesswork in the memory utilization, here is a table that illustrates the amount of RAM consumed by objects required to analyze the files for the Swift Key sponsored capstone: predicting text.
To assess our ability to process the complete corpus of data, we Used an HP Omen laptop with the following specifications.
|HP Omen laptop||
All text processing was completed with the quanteda package. The three input files for blogs, news, and twitter data were read as character strings and combined into a single object that was used as input to the
corpus() function. The total number of texts processed across the combined file is 4,269,678.
Note that due to the size of the objects, a machine with a minimum of 16Gb of RAM is required to process the entire data set. The tokenized texts consume about 5.1 Gb of RAM, and must remain in memory in order to use them as input to the
quanteda::ngrams() function. Therefore, the minimum number of objects in memory at any time is 2 -- the tokenized texts, and the output
ngrams object. Since the object output by
ngrams() is also over 5Gb, one must be judicious about deleting objects not needed before progressing to subsequent steps in order to avoid running out of memory on the machine, even if it has 16Gb of RAM.
|Activity||Memory Used||Processing Time|
|Load data from the three raw data files into a corpus||1.0Gb||37 seconds|
|Tokenize corpus using ||1.3Gb||509 seconds|
|Build 2-grams||6.3Gbs||619 seconds|
|Build 3-grams||6.5Gbs||894 seconds|
|Build 4-grams||6.5Gbs||925 seconds|
|Build 5-grams||6.3Gbs||930 seconds|
|Build 6-grams||6.1Gbs||1,007 seconds|
Processing with Less Memory
Most students do not have 16Gb of RAM on the computers they use for the Capstone project. In this situation, students have two options for processing the data: sampling, and iterative processing.
The sampling approach is relatively straightforward: take a random sample of the documents, and perform subsequent steps against the sampled documents.
The iterative approach is more complex because one must complete the following steps in sequence to process the data.
Break the incoming documents into
ngroups, each of which is small enough to process within the RAM limits of the computer used for the analysis.
For each item, complete the following steps:
- Build the corpus
- Tokenize the corpus
- generate n-grams of varying sizes
Assemble the subcomponent files by n-gram size, and break the n-grams into base and predicted words
aggregate to summarize each n-gram file into frequencies by base
Depending on the RAM available on one's computer, this approach can take a long time. Also the total clock time required will increase in inverse proportion to the RAM on a machine. That is, a machine with 2Gb of RAM will require smaller processing chunks and therefore more clock time than a machine with 4Gb or 8Gb of RAM.
Example: Sampling Approach on Macbook Pro
The performance timings in this section were taken on a Macbook Pro with the following configuration.
|Apple Macbook Pro||
As stated above, we need to sample at a level where the combined size of the tokenized corpus and the result n-gram object are less then the amount of RAM available on the machine. We selected a 25% sample, resulting in a tokenized words object of 1.4Gb in size. Since we expect the resulting n-gram objects to be 1.5 - 3 times the size of the tokenized words object, a 25% sample will process within the 8Gb of RAM on the Macbook Pro we used to generate n-grams. Summarizing in the same manner as we did with the analysis on the HP Omen, here are the object sizes and timings for the 25% sample.
|Activity||Memory Used||Processing Time|
|Load data from the three raw data files into a corpus||265 Mb||6 seconds|
|Tokenize corpus using ||1.4Gb||59 seconds|
|Build 2-grams||2.0Gbs||79 seconds|
|Build 3-grams||2.9Gbs||162 seconds|
|Build 4-grams||3.6Gbs||420 seconds|
|Build 5-grams||3.9Gbs||339 seconds|
|Build 6-grams||4.0Gbs||343 seconds|
Appendix: Choosing a Text Analysis Package for the Capstone
Given the diversity of R packages (over 9,000 available as of May 2017) and the popularity of natural language processing as a domain for data science, students have a wide variety of R packages from which to choose for the project.
There are two key considerations for selecting a package to use during the Capstone project: features and performance. First, does a particular package have the features one needs to complete the required tasks? Feature rich packages allow students to spend more time understanding the data instead of manually coding algorithms in R. Second, how fast does the package complete the work, given the amount of data to be analyzed. For the Capstone project, the data includes a total of 4,269,678 texts as we stated earlier in the article.
R conducts all of its processing in memory (versus disk), so the text algorithms must be able to fit the data in memory in order to process them. Text mining packages that use memory efficiently will handle larger problems than those that use memory less efficiently. In practical terms, R packages that use C/C++ will be more efficient, handle larger problems, and run faster than those which use Java.
The CRAN Task View for Natural Language Processing provides a comprehensive list of packages that can be used for textual analysis with R. Some of the packages used by students during the Capstone course include:
Each package has its strengths and weaknesses. For example,
ngram is fast but it's capability is limited solely to the production of ngrams.
tm have a broader set of text mining features, but have significantly slower performance and do not scale well to a large corpus such as the one we must use for the Capstone project.
Why use quanteda?
quanteda provides a rich set of text analysis features coupled with excellent performance relative to Java-based R packages for text analysis. Quoting Kenneth Benoit from the quanteda github README:
Built for efficiency and speed. All of the functions in
quantedaare built for maximum performance and scale while still being as R-based as possible. The package makes use of three efficient architectural elements: the
stringipackage for text processing, the
Matrixpackage for sparse matrix objects, and the
data.tablepackage for indexing large documents efficiently. If you can fit it into memory,
quantedawill handle it quickly. (And eventually, we will make it possible to process objects even larger than available memory.)
The aspect of quanteda being "R like" is very useful, in contrast to packages like
ngram. Also, since
quanteda relies on
data.table, it's particularly well suited to use for the Capstone. Why?
data.table has features to index a data table so students can retrieve values by index rather than having to sequentially process an entire data frame to extract a small number of rows. Since the final deliverable for the Capstone project is a text prediction app written in Shiny, students will find
data.table is an effective and efficient mechanism to use with a text prediction algorithm.
last updated: 12 August 2017