Switch branches/tags
Nothing to show
Find file
Fetching contributors…
Cannot retrieve contributors at this time
129 lines (92 sloc) 9.74 KB

Capstone n-grams: how much processing power is required?

Students in the Johns Hopkins University Data Science Specialization Capstone course typically struggle with the course project because of the amount of memory consumed by the objects needed to analyze text. To help reduce the guesswork in the memory utilization, here is a table that illustrates the amount of RAM consumed by objects required to analyze the files for the Swift Key sponsored capstone: predicting text.

To assess our ability to process the complete corpus of data, we Used an HP Omen laptop with the following specifications.

Computer Configuration
HP Omen laptop
  • Operating system: Microsoft Windows 10, 64bit
  • Processor: Intel i7-4710HQ at 2.5Ghz, turbo up to 3.5Ghz, four cores with two threads each
  • Memory: 16 gigabytes
  • Disk: 512 gigabytes, solid state drive
  • Date built: December 2013

All text processing was completed with the quanteda package. The three input files for blogs, news, and twitter data were read as character strings and combined into a single object that was used as input to the corpus() function. The total number of texts processed across the combined file is 4,269,678.

Note that due to the size of the objects, a machine with a minimum of 16Gb of RAM is required to process the entire data set. The tokenized texts consume about 5.1 Gb of RAM, and must remain in memory in order to use them as input to the quanteda::ngrams() function. Therefore, the minimum number of objects in memory at any time is 2 -- the tokenized texts, and the output ngrams object. Since the object output by ngrams() is also over 5Gb, one must be judicious about deleting objects not needed before progressing to subsequent steps in order to avoid running out of memory on the machine, even if it has 16Gb of RAM.

ActivityMemory UsedProcessing Time
Load data from the three raw data files into a corpus1.0Gb37 seconds
Tokenize corpus using quanteda::tokenize()1.3Gb509 seconds
Build 2-grams6.3Gbs619 seconds
Build 3-grams6.5Gbs894 seconds
Build 4-grams6.5Gbs925 seconds
Build 5-grams6.3Gbs930 seconds
Build 6-grams6.1Gbs1,007 seconds

Processing with Less Memory

Most students do not have 16Gb of RAM on the computers they use for the Capstone project. In this situation, students have two options for processing the data: sampling, and iterative processing.

The sampling approach is relatively straightforward: take a random sample of the documents, and perform subsequent steps against the sampled documents.

The iterative approach is more complex because one must complete the following steps in sequence to process the data.

  1. Break the incoming documents into n groups, each of which is small enough to process within the RAM limits of the computer used for the analysis.

  2. For each item, complete the following steps:

    • Build the corpus
    • Tokenize the corpus
    • generate n-grams of varying sizes

  3. Assemble the subcomponent files by n-gram size, and break the n-grams into base and predicted words

  4. aggregate to summarize each n-gram file into frequencies by base

Depending on the RAM available on one's computer, this approach can take a long time. Also the total clock time required will increase in inverse proportion to the RAM on a machine. That is, a machine with 2Gb of RAM will require smaller processing chunks and therefore more clock time than a machine with 4Gb or 8Gb of RAM.

Example: Sampling Approach on Macbook Pro

The performance timings in this section were taken on a Macbook Pro with the following configuration.

Computer Configuration
Apple Macbook Pro
  • Operating system: OS X Sierra 10.12.6 (16G29)
  • Processor: Intel i5 at 2.6Ghz, turbo up to 3.3Ghz, two cores with two threads each
  • Memory: 8 gigabytes
  • Disk: 512 gigabytes, solid state drive
  • Date built: April 2013

As stated above, we need to sample at a level where the combined size of the tokenized corpus and the result n-gram object are less then the amount of RAM available on the machine. We selected a 25% sample, resulting in a tokenized words object of 1.4Gb in size. Since we expect the resulting n-gram objects to be 1.5 - 3 times the size of the tokenized words object, a 25% sample will process within the 8Gb of RAM on the Macbook Pro we used to generate n-grams. Summarizing in the same manner as we did with the analysis on the HP Omen, here are the object sizes and timings for the 25% sample.

ActivityMemory UsedProcessing Time
Load data from the three raw data files into a corpus265 Mb6 seconds
Tokenize corpus using quanteda::tokenize()1.4Gb59 seconds
Build 2-grams2.0Gbs79 seconds
Build 3-grams2.9Gbs162 seconds
Build 4-grams3.6Gbs420 seconds
Build 5-grams3.9Gbs339 seconds
Build 6-grams4.0Gbs343 seconds

Appendix: Choosing a Text Analysis Package for the Capstone

Given the diversity of R packages (over 9,000 available as of May 2017) and the popularity of natural language processing as a domain for data science, students have a wide variety of R packages from which to choose for the project.

Key Considerations

There are two key considerations for selecting a package to use during the Capstone project: features and performance. First, does a particular package have the features one needs to complete the required tasks? Feature rich packages allow students to spend more time understanding the data instead of manually coding algorithms in R. Second, how fast does the package complete the work, given the amount of data to be analyzed. For the Capstone project, the data includes a total of 4,269,678 texts as we stated earlier in the article.

R conducts all of its processing in memory (versus disk), so the text algorithms must be able to fit the data in memory in order to process them. Text mining packages that use memory efficiently will handle larger problems than those that use memory less efficiently. In practical terms, R packages that use C/C++ will be more efficient, handle larger problems, and run faster than those which use Java.

The CRAN Task View for Natural Language Processing provides a comprehensive list of packages that can be used for textual analysis with R. Some of the packages used by students during the Capstone course include:

Each package has its strengths and weaknesses. For example, ngram is fast but it's capability is limited solely to the production of ngrams. RWeka and tm have a broader set of text mining features, but have significantly slower performance and do not scale well to a large corpus such as the one we must use for the Capstone project.

Why use quanteda?

quanteda provides a rich set of text analysis features coupled with excellent performance relative to Java-based R packages for text analysis. Quoting Kenneth Benoit from the quanteda github README:

Built for efficiency and speed. All of the functions in quanteda are built for maximum performance and scale while still being as R-based as possible. The package makes use of three efficient architectural elements: the stringi package for text processing, the Matrix package for sparse matrix objects, and the data.table package for indexing large documents efficiently. If you can fit it into memory, quanteda will handle it quickly. (And eventually, we will make it possible to process objects even larger than available memory.)

The aspect of quanteda being "R like" is very useful, in contrast to packages like ngram. Also, since quanteda relies on data.table, it's particularly well suited to use for the Capstone. Why? data.table has features to index a data table so students can retrieve values by index rather than having to sequentially process an entire data frame to extract a small number of rows. Since the final deliverable for the Capstone project is a text prediction app written in Shiny, students will find data.table is an effective and efficient mechanism to use with a text prediction algorithm.

last updated: 12 August 2017