Prototype text analysis against the EEBO collection
HTML XSLT Shell CSS TeX Eagle Other
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
bin moved collections to collections; made building collections more verb… Jan 25, 2016

EEBO TCP Workset Browser

The EEBO-TCP Workset Browser is a suite of software designed to support "distant reading" against the corpus called the Early English Books Online - Text Creation Partnership corpus. [1] Using the Browser it is possible to: 1) search a "catalog" of the corpus's metadata, 2) create a list of identifiers representing a subset of content for study, 3) feed the identifiers to a set of files which will mirror the content locally, index it, and do some rudimentary analysis outputting as set of HTML files, structured data, and graphs. The reader is then expected to examine the output more "closely" (all puns intended) using their favorite Web browser, text editor, spreadsheet, database, or statistical application. The purpose and functionality of this suite is very similar to the purpose and functionality of HathiTrust Research Center Workset Browser. [2]


The Browser is designed to work on Linux and Macintosh computers with Python and the Bash Shell installed. At the very least, the Browser also requires wget as well as a Python library called libxml2. [3, 4] The dependency on wget can (and probably will) be removed. Installing libxml2 can be a real pain. I know. I'm sorry. This too can (and probably will) be removed.

Quick start

Here's a set of quick start instructions. Download the software and uncompress it, if necessary. Second, open your terminal and change directories to the distribution. Third, create a collection with the following command:

cat ./etc/catalog-lute.txt | ./bin/ lute

If Bash, Python, wget, and libxml2 are installed, then the command will create a directory named lute, mirror bunches o' XML/TEI files, index them, create a "catalog" of them, do the rudimentary analysis, and generate data for further analysis. Of special interest is the file named lute/catalog.db. Import this file into your favorite spreadsheet or database program to search, sort, and browse characteristics of the collection. The next report of special interest is lute/about.html. It provides an overall narrative describing the collection.

If you saved the Browser to a file system accessible via HTTP, then edit ./bin/ changing the value of ROOT found near the top of the file to the root URL where the Browser software can be found. Then recreate the catalog's db and html files:

./bin/ lute

If you have Java installed, then you can transform the mirrored XML/TEI into pretty HTML. To do so, you need to get a copy of Saxon (HE). [5] Download Saxon, uncompress the archive, and then edit ./bin/ Specifically, change the value of JAR found near the top of the file to point to the Saxon (HE)'s .jar file. You can now do the transformations with the following command, and the results will be found in lute/html/:

./bin/ lute

If you have R installed, then you can create quite a number of graphs literally illustrating characteristics of the collection, but you will also need a few R libraries, specifically: NLP, tm, RColorBrewer, and wordcloud. Once these libraries are installed, generate the graphs with the following command, and the results will be saved in lute/graphs/:

./bin/ lute

Extra credit

If you have gotten this far, then create a collection of Shakespeare's works with the following commands:

cat etc/catalog-shakespeare.txt | ./bin/ shakespeare

./bin/ shakespeare

./bin/ shakespeare &>/dev/null &

You are now ready for extra credit. Edit ./bin/ and uncomment the following lines found towards the beginning and end of the file:

     #echo "transforming TEI to HTML"
     #./bin/ $NAME 

     #echo "making graphs"
     #./bin/ $NAME

You can now create a collection with a single command:

cat etc/catalog-mather.txt | ./bin/ mather

Creating your own collections

Creating your own collections is a matter of creating a set of EEBO-TCP identifiers and feeding them to ./bin/ As of this writing, there is a complete collection of EEBO-TCP identifiers and their associated pieces of metadata found in the the file named eebo.db. For example, try some of the following commands to get feel for how the word lute is used in the entire corpus's metadata:

./bin/ lute

./bin/ lute facets

./bin/ lute facets | less

./bin/ lute facets | ./bin/

./bin/ lute facets | ./bin/ | less

./bin/ lute facets | ./bin/ lute | less

./bin/ lute facets | ./bin/ lute > lute.html

./bin/ lute facets | ./bin/ new-lute

Please be forewarned because ./bin/ can only search for single words. Phrase searching is not supported.

Power-readers will want to peruse the indexes found in ./etc/eebo-index-*.idx. Search, sort, and browse these files using your favorite spreadsheet, database, or text editor application. Power-readers will also want to import eebo.db into the same applications. Believe me, the results will be eye-opening!


  1. EEBO-TCP -
  2. HTRC Workset Browser -
  3. wget -
  4. libxml2 -
  5. Saxon (HE) -


Eric Lease Morgan <>

June 23, 2015