An R package for estimating category proportions in an unlabeled set of documents given a labeled set, by implementing the method described in Jerzak, King, and Strezhnev (2019). This method is meant to improve on the ideas in Hopkins and King (2010), which introduced a quantification algorithm to estimate category proportions without directly classifying individual observations. This version of the software refines the original method by implementing a technique for selecitng optimal textual features in order to minimize the error of the estimated category proportions. Automatic differentiation, stochastic gradient descent, and batch re-normalization are used to carry out the optimization. Other pre-processing functions are available, as well as an interface to the earlier version of the algorithm for comparison. The package also provides users with the ability to extract the generated features for use in other tasks.
(Here's the abstract from our paper: Some scholars build models to classify documents into chosen categories. Others, especially social scientists who tend to focus on population characteristics, instead usually estimate the proportion of documents in each category -- using either parametric "classify-and-count" methods or "direct" nonparametric estimation of proportions without individual classification. Unfortunately, classify-and-count methods can be highly model dependent or generate more bias in the proportions even as the percent of documents correctly classified increases. Direct estimation avoids these problems, but can suffer when the meaning of language changes between training and test sets or is too similar across categories. We develop an improved direct estimation approach without these issues by including and optimizing continuous text features, along with a form of matching adapted from the causal inference literature. Our approach substantially improves performance in a diverse collection of 73 data sets. We also offer easy-to-use software software that implements all ideas discussed herein.)
The most recent version of
readme2 can be installed directly from the repository using the
readme2 depends on the
FNN packages which can be installed directly from CRAN
It also utilizes
tensorflow, the R interface to the TensorFlow API -- an open source machine learning library. To install
tensorflow follow the instructions available at the RStudio page on R and TensorFlow.
First, install the R package via github
Then, install TensorFlow itself via the R function
readme uses pre-trained dictionaries of word vectors as part of the process of translating the words in documents to a numerical representation. We suggest a default dictionary that can be downloaded and installed to the
readme install directory using the
download_wordvecs() function. Be aware that it may be worthwhile to change these if you are using a different language than English or texts with very unique forms of language use.
In this section, we provide a step-by-step vignette illustrating how to use
readme2 to estimate category proportions in an unlabeled set of documents. To begin, we assume that the user has a set of labeled documents, with every document assigned a single, mutually exclusive category label (such as via manual coding). We observe an unlabeled set of documents and seek to estimate the proportion of documents with each category label.
The central intuition of the original
readme is that for any individual feature S in both the labeled and unlabeled set, we know that the average value of that feature S is equal to the sum of the conditional expectation of S in each category multiplied by the share of documents in that category. While we observe the average of S in the unlabeled set, we do not observe the conditional expectations of S. We estimate these conditional expectations using the labeled set conditional frequency and solve for the unlabeled set proportions via standard linear programming methods.
There are many possible features S that can be extracted from the text. The main contribution of
readme2 is to develop a way for selecting optimal sets of features from a large space of potential document summaries, morphing the space in which the readme regression is run to yield the best possible estimates of the category proportions. We begin by converting each document into a vector of summaries based on the word vector representations of each term in the document.
Processing the text documents
We illustrate the method using the provided
clinton dataset of a subset of handcoded blogposts from the original Hopkins and King (2010) paper. This dataset is comprised of 1,676 documents coded into 6 mutually exclusive categories (
library(readme) data(clinton, package="readme")
The first task is to convert the raw text for each document (
TEXT) into a document-feature matrix using the word vector summaries; to load the word vector summaries into a table we use the
undergrad() function (named in honor of those who have do so much coding by hand!).
Prior to using
undergrad() it is necessary to obtain a word vector dictionary that maps text tokens to a vector representation. We recommend using an off-the-shelf pre-trained word vector set based on a well-known language corpus. We include a function,
download_wordvecs(), which when run will download our suggested default - a Stanford GloVe: 'Global Vectors for Word Representation' pre-trained dataset - to the directory in which
readme is installed.
undergrad() function takes as input the raw document texts and the word vector dictionary and returns a set of feature summaries for each document in the dataset.
cleanme() pre-processes the text. Setting
NULL will cause the function to search for the default word vector file called
glove.6B.200d.txt in the
readme installation directory (which is the default file downloaded by
download_wordvecs()). If this file is found, it will read it directly and use it as the word vector dictionary.
## Generate a word vector summary for each document wordVec_summaries = undergrad(documentText = cleanme(clinton$TEXT), wordVecs = NULL)
Estimating topic proportions with
With the topic, training set labels and features we can start estimating the model.
# Estimate category proportions set.seed(2138) # Set a seed if you choose readme.estimates <- readme(dfm = wordVec_summaries , labeledIndicator = clinton$TRAININGSET, categoryVec = clinton$TRUTH)
We can compare the output with the true category codings
# Output proportions estimate readme.estimates$point_readme # Compare to the truth table(clinton$TRUTH[clinton$TRAININGSET == 0])/sum(table((clinton$TRUTH[clinton$TRAININGSET == 0])))
Creative Commons Attribution-Noncommercial-No Derivative Works 4.0, for academic use only.
Our thanks to Neal Beck, Aykut Firat, Ying Lu, and Jonathan Chiang for data and helpful comments.