Paper Machines is an open-source extension for the Zotero bibliographic management software. Its purpose is to allow individual researchers to generate analyses and visualizations of user-provided corpora, without requiring extensive computational resources or technical knowledge.
In order to run Paper Machines, you will need the following:
- a corpus of documents (preferably with high-quality metadata)
- Python (download for Windows)
- Java (download)
To begin, right-click (control-click) on the collection you wish to analyze and select "Extract Texts for Paper Machines." Once the extraction process is complete, this right-click menu will offer several different processes that may be run on a collection, each with an accompanying visualization.
Show word frequency as a function of size. An oft-maligned, but still arguably useful way to get a quick impression of the most common words in your collection. After it is generated, it will appear in the Tags pane of Zotero.
Finds phrases that follow a certain pattern, such as "x and y," and displays the most common pairings as the largest words. This method is derived from a Many Eyes visualization).
Generates a map linking texts to the places they mention, filtered by time. This uses Yahoo!'s Placemaker service, and is limited to the first 50k of each file (approximately 10,000 words per text).
Annotates files using the DBpedia Spotlight service, providing a look at what named entities (people, places, organizations, etc.) are mentioned in the text. More often mentioned entities are displayed larger.
Shows the proportional prevalence of different "topics" (collections of words likely to co-occur) in the corpus, by time or by subcollection. This uses the MALLET package to perform latent Dirichlet allocation, and by default displays the 5 most "coherent" topics, based on a metric devised by Mimno et al.
This allows you to train the computer to infer the common features of the documents under each subcollection; subsequently, a set of texts in a different folder can be sorted automatically based on this training. At the moment, the probability distribution for each text is given in plain text; a visualization of this data is forthcoming.
Thanks to Google Summer of Code for funding this work, and to Matthew Battles and Jo Guldi for overseeing it. My gratitude also to the creators of all the open-source projects and services upon which this work relies: