Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
A Zotero extension for analysis and visualization in the digital humanities.
JavaScript Python Shell CSS

This branch is 194 commits behind master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
chrome
.gitignore
README.md
chrome.manifest
install.rdf

README.md

Paper Machines

Overview

Paper Machines is an open-source extension for the Zotero bibliographic management software. Its purpose is to allow individual researchers to generate analyses and visualizations of user-provided corpora, without requiring extensive computational resources or technical knowledge.

Prerequisites

In order to run Paper Machines, you will need the following:

Usage

To begin, right-click (control-click) on the collection you wish to analyze and select "Extract Texts for Paper Machines." Once the extraction process is complete, this right-click menu will offer several different processes that may be run on a collection, each with an accompanying visualization.

Word Cloud

Show word frequency as a function of size. An oft-maligned, but still arguably useful way to get a quick impression of the most common words in your collection. After it is generated, it will appear in the Tags pane of Zotero.

Phrase Net

Finds phrases that follow a certain pattern, such as "x and y," and displays the most common pairings as the largest words. This method is derived from a Many Eyes visualization).

Geoparser

Generates a map linking texts to the places they mention, filtered by time. This uses Yahoo!'s Placemaker service, and is limited to the first 50k of each file (approximately 10,000 words per text).

DBpedia Annotation

Annotates files using the DBpedia Spotlight service, providing a look at what named entities (people, places, organizations, etc.) are mentioned in the text. More often mentioned entities are displayed larger.

Topic Modeling

Shows the proportional prevalence of different "topics" (collections of words likely to co-occur) in the corpus, by time or by subcollection. This uses the MALLET package to perform latent Dirichlet allocation, and by default displays the 5 most "coherent" topics, based on a metric devised by Mimno et al.

Classification

This allows you to train the computer to infer the common features of the documents under each subcollection; subsequently, a set of texts in a different folder can be sorted automatically based on this training. At the moment, the probability distribution for each text is given in plain text; a visualization of this data is forthcoming.

Acknowledgements

Thanks to Google Summer of Code for funding this work, and to Matthew Battles and Jo Guldi for overseeing it. My gratitude also to the creators of all the open-source projects and services upon which this work relies:

Something went wrong with that request. Please try again.