Prototype text analysis tool
Shell HTML Python CSS R
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

HathiTrust Research Center Workset Browser

README - An introductory description to this, the HathiTrust Research Center Workset Browser

draft Draft Draft! DRAFT! DRAFT!!! D R A F T ! ! !

This, the HathiTrust Research Center Workset Browser ("Browser"), is a suite of software used to do "distant reading" against sets of digitized items from the HathiTrust. With the advent of globally networked computers, the availability of textual items in the public domain has increased significantly. Moreover, the ability to turn the power of computing against textual items has made it easier to "read" said items in a manner never (really) possible before. The Browser is a tool for doing such "reading".

The current home page of the Browser is

An example of its current output is:

Quick start

1. Download the software putting the bin and etc directories in the same directory

2. Change to the directory where the bin and etc directories have been saved.

3. Build a collection by issuing the following command:

     ./bin/ thoreau etc/

If all goes well, the Browser will create a new directory named thoreau, rsync a bunch o' JSON files from the HathiTrust to your computer, index the JSON files, do some textual analysis against the corpus, create a simple database ("catalog"), and create a few more reports.

If you have R installed, as well as some of R's friends, then you can optionally run the following command to generate some charts:

./bin/ thoreau

You can then peruse the files in the newly created thoreau directory. Of special interest is the file named about.html because it summarizes the collection. If you created the charts, then there ought to be no broken links in the file. If the file is being served via a Web server, then the search interface will work. If not, then you can still search the collection with the following command:

./bin/ love thoreau | less

If you got this far, then repeat the process for the other rsync files found in the etc directory.


The Browser requires a computer running a "sane" version of a Linux-ish operating system coupled with the Python scripting language. Few, if any, specialized pieces of software are needed with the possible exception of an application called rsync. Similarly, few, if any, special Python modules are required with the exception of the Natural Language Toolkit, and this requirement may go to the wayside since all it currently does is supply lists of stop words. All that said, the Browser ought to run on most out-of-the box Internet network servers and Macintosh computers. Since the ultimate goal is to implement the Browser as a Web-based service, no special effort has been made to accommodate Windows.


This software is distributed under the GNU General Public License.


[Describe how to install the Browser here.]


Here is an outline of how the Browser is intended to be used:

0. Articulate a research question such as "What are some of the
   characteristics of early American literature, and how might some of
   its authors be compared & contrasted?" or "What are some of the
   definitions of a 'great' man, and how have these definitions changed
   over time?"

1. Use the HathiTrust Research Center Portal to create or refine a
   corpus of works that may help you answer the research question.

2. Use the Research Center's Algorithm interface to download an rsync
   file listing the items in your corpus.

3. Feed the rsync file to the Browser, and it will harvest your
   corpus locally, create a catalog of the corpus, index the full text
   of the corpus, create sets of files against the corpus intended to
   be used by other (operating system independent) applications, and
   generate a report summarizing the content the corpus.

4. Use and understand your corpus -- try to answer your research
   question -- by reading the Browser-generated report, searching the
   corpus's content with the supplied search interface, or applying
   other applications/tools (such as Excel, Solr, MySQL, R's tm module,
   or Python's NLTK library) to the Browser's underlying output files.

5. Go to Step #1.

The current version of the Browser is akin to cartoon as drawn by a Renaissance artist or a blueprint as drawn by an architect. As such is an outline of possible things to come, and its implementation is -- by design -- simple and foundational. You have been warned.


[Here goes the documentation of the scripts.]


[This is a list of the ways the scripts can be chained together to do "kewl" stuff.]

Eric Lease Morgan, Librarian University of Notre Dame

June 2, 2015