Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

HTRC Book Models

Within-book topic modeling on HTRC feature extraction files.

This is an early release. If you are interested in trying it for yourself and instructions are insufficient, please email me at or tweet @POrg.



Install above depedencies from their respective sites.

Clone this repository.

git clone
cd htrc-book-models

Download the R packages required for this project and set MALLET_HOME if it is not already set.

Rscript R/install-packages.r
export MALLET_HOME=/path/to/mallet

Step-by-step: building and visualizing a book model


$ rsync -v*/*t75t3gh25.json.bz2 texts/
$ python texts/uc2.ark+\=13960\=t75t3gh25.json.bz2
$ ./ ThescarletletterByNathanielHawthorne 15
$ Rscript graph-topics.r "The Scarlet Letter" images/scarlet.png

The Scarlet Letter topics

1. Download a book from the htrc feature extraction dataset

Since the current feature extraction dataset (Oct 2014) is built on the Hathitrust's non-Google-digitized public domain works, the basis of the Hathitrust Research Center's sandbox system, you can find books to visualize by using the sandbox system's search engine.

This time, let's find Nathaniel Hawthorne's The Scarlet Letter. There are a few copies, but this one should do fine.

The HTRC feature extraction dataset gives you page-level feature information for 250k works, such as part-of-speech tagged term counts. It can be downloaded as as large export or, through rsync, individual files can be downloaded. We can make use of the latter approach to download just the features for Scarlet Letter.

On our search page, we see that the Volume ID is "uc2.ark:/13960/t75t3gh25". We can find the matching file with the last string, t75t3gh25.

$ rsync -v*/*t75t3gh25.json.bz2 texts/

2. Prepare document to for topic modeling

We'll be using Mallet for topic modeling using LDA, training models from individual pages and then inferring topics for a sliding scale of pages. To do so, the Feature pages need to be processed into a form that Mallet can read. This is done using a python script,

usage: [-h] [-f FRAME_SIZE] [-o OUTPATH] input

This script uses the HTRC Feature Reader library. Let's process the Scarlet Letter:

$ python texts/uc2.ark+\=13960\=t75t3gh25.json.bz2

Note that you can input a list of input documents if you are processing more than one.

Now, there are two files in tmp/: train-ThescarletletterByNathanielHawthorne.txt and infer-ThescarletletterByNathanielHawthorne.txt.

3. Build topic model, then inference against sliding frame

A shell script does all the hard work with Mallet.


Here, NAME is the string used in the temoprary files from the previous step (/tmp/train-{{NAME}}.txt). You may want to edit the script for further Mallet customizations. The script has a high number of iterations specified, if it is too slow you can reduce this.

$ ./ ThescarletletterByNathanielHawthorne 15

Inferred topics are saved to tmp/inferred-pageframe-topics.txt.

4. Visualize topics

Visualization is done through R. On the command line, you can supply a work name, for the chart title, and an output path. You'll be alerted if any libraries are missing; all the dependencies are available through CRAN.

$ Rscript graph-topics.r "The Scarlet Letter" images/scarlet.png


Within-book topic modeling on HTRC feature extraction files






No releases published


No packages published