Skip to content


Subversion checkout URL

You can clone with
Download ZIP
A Zotero extension for analysis and visualization in the digital humanities.
JavaScript Python Shell CSS
Pull request Compare This branch is 71 commits behind master.

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.

Paper Machines


Paper Machines is an open-source extension for the Zotero bibliographic management software. Its purpose is to allow individual researchers to generate analyses and visualizations of user-provided corpora, without requiring extensive computational resources or technical knowledge.

This project is a collaboration between historian Jo Guldi and digital ethnomusicologist Chris Johnson-Roberson, graciously supported by Google Summer of Code, the William F. Milton Fund, and metaLAB @ Harvard.


In order to run Paper Machines, you will need the following (Python and Java should be installed automatically on Mac OS X 10.6 and above):

  • Zotero with PDF indexing tools installed (see the Search pane of Zotero's Preferences)
  • a corpus of documents with full text PDF/HTML and high-quality metadata (recommended: at least 1,000 for topic modeling purposes)
  • Python (download page)
  • Java (download page)


Paper Machines should work either in Zotero for Firefox or Zotero Standalone. To install, you must download the XPI file. If you wish to use the extension in the Standalone version, right-click on the link and save the XPI file in your Downloads folder. Then, in Zotero Standalone, go to the Tools menu -> Add-Ons. Select the gear icon at the right, then "Install Add-On From File." Navigate to your Downloads folder (or wherever you have saved the XPI file) and open it.


To begin, right-click (control-click for Mac) on the collection you wish to analyze and select "Extract Texts for Paper Machines." Once the extraction process is complete, this right-click menu will offer several different processes that may be run on a collection, each with an accompanying visualization. Once these processes have been run, selecting "Export Output of Paper Machines..." will allow you to choose which visualizations to export.

Word Cloud

Displays words scaled according to the frequency of their occurrence. An oft-maligned, but still arguably useful way to get a quick impression of the most common words in your collection. Either a basic word cloud, a word cloud with tfidf filtering to remove unimportant words, or multiple word clouds (divided up by subcollection or time interval, specified in days) can be generated. The multiple word clouds can be filtered using tfidf, Dunning's log-likelihood, or Mann-Whitney U tests, each of which will provide different results depending on the data. By default, a basic word cloud will appear in the Tags pane of Zotero once text has been extracted.

Phrase Net

Finds phrases that follow a certain pattern, such as "x and y," and displays the most common pairings. This method is derived from a Many Eyes visualization).


Flight Paths

Generates a map linking texts from their places of publication to the places they mention, filtered by time.


Generates a map showing regions of relative intensity for mentions in the text. Same as the flight path visualization without the link data; may be more usable on large datasets).

Export Geodata to CSV

Creates a CSV file with place name, latitude/longitude, the Zotero item ID number, and some context around the mention.

DBpedia Annotation

Annotates files using the DBpedia Spotlight service, providing a look at what named entities (people, places, organizations, etc.) are mentioned in the texts. Entities are scaled according to the frequency of their occurrence.

Topic Modeling

Shows the proportional prevalence of different "topics" (collections of words likely to co-occur) in the corpus, by time or by subcollection. This uses the MALLET package to perform latent Dirichlet allocation, and by default displays the 5 most "coherent" topics, based on a metric devised by Mimno et al. A variety of topic model parameters can be specified before the model is created. The default values should be suitable for general purpose use, but they may be adjusted to produce a better model.

After the model is generated, clicking "Save" in display will open a new window with the graph displayed free of interactive controls; this window may be saved as an ".SVG" file or captured via screenshot. It will also, in the original window, preserve the current selection of topics, search terms, and time scale as a permalink; please bookmark this if you wish to return to a specific view with interactive controls intact.

JSTOR Data For Research

The topic model can be supplemented with datasets from JSTOR Data For Research. You must first register for an account, after which you may search for additional articles based on keywords, years of publiation, specific journals, and so on. Once the search is to your liking, go to the Dataset Requests menu at the upper right and click "Submit New Request." Check the "Citations" and "Word Counts" boxes, select CSV output format, and enter a short job title that describes your query. Once you click "Submit Job", you will be taken to a history of your submitted requests. You will be e-mailed once the dataset is complete. Click "Download (#### docs)" in the Full Dataset column, and a zip file timestamped with the request time will be downloaded. This file (or several files with related queries) may then be incorporated into a model by selecting "By Time (With JSTOR DFR)" in the Topic Modeling submenu of Paper Machines. Multiple dataset zips will be merged and duplicates discarded before analysis begins; be warned, this may take a considerable amount of time before it begins to show progress (~15-30 minutes).


This allows you to train the computer to infer the common features of the documents under each subcollection; subsequently, a set of texts in a different folder can be sorted automatically based on this training. At the moment, the probability distribution for each text is given in plain text; the ability to automatically generate a new collection according to this sorting is forthcoming.


Currently, the language stoplist in use, types of data to extract, default parameters for topic modeling, and an experimental periodical import feature (intended for PDFs with OCR and correct metadata) may be adjusted in the preference pane.


Special thanks to Matthew Battles for providing space, guidance, and support for me at metaLAB. My gratitude also to the creators of all the open-source projects and services upon which this project relies:

Something went wrong with that request. Please try again.