## How we do things with words: Analyzing text as social and cultural data

*Dong Nguyen, Maria Liakata, Simon DeDeo, Jacob Eisenstein, David Mimno, Rebekah Tromble, and Jane Winters*

"Choices regarding how to operationalize and analyze these concepts (such as hate speech in social media) can raise serioius concerns about conceptual validity and may lead to shallow or obvious conclusions, rather than findings that reflect the depth of the questions we seek to address."

Reading Framework:

* identification of research questions;
* data selection;
* conceptualization and operationalization;
* analysis and interpretation.

Goals of the reading:

* "shed light on thorny issues not always at the forefront of discussions about computational text analysis;
* provide a set of best practices for workign with thick social and cultural concepts;
* and to help promote interdisciplinary collaborations."

### RESEARCH QUESTIONS

* How some phenomena in language has changed over time
* How do you set boundries that define a particular phenomena
* prediction analysis vs. perfect labeling
* "dual use" concerns creating a tool to analyze a problem of social concern could then be co-opted by users for malicious intentions
* Engaging reviews versus dialogue


### DATA

##### Data Aquisition
* Consent and similar concerns of "dual use" and or malicious uses for "born-digital" data, big social networks have clamped down their APIs as a result, even being reticent about academic researchers' access.
* Provenance and contextualisation (reference data feminism on the tendency for marginalized groups' data and perspectives are excluded from the data set creation process)
* Limited access to "black box" APIs that generate data sets (such as search results) and the biases that exisit in the API cannot be examined.

##### Compiling Data
* Involves making sense of "cleaning" data processes. Are simply meta data, duplicates, or non-study specific data being removed or are we affecting interpretation by removing data that might be relevant but is simply difficult to use because it is noisy and/or needs a lot of individualized attention to make it usable.
* Examining metadata can illuminate potential inconsistencies and biases in data sets. For example: does the data set have a particular weighted focus or are particular time periods emphasized over others.

### CONCEPTUALIZATION

* "translating social and cultural concepts into measurable quantities;"
* Need to define domain experts and look at previous approaches to analysis;
* "Background Concept" – full and diverse set of meanings that might be associated with a particular term. Past research and definitions can help determine what is the most appropriate definition to be used for the study;
* "Systematized Concept" is the arrived at formulation for the study. And frames the study in a particular context and does not presume any absolute truthfulness in the chosen path. This pushes against ideas of a "ground truth" or "gold standard" supposedly arrived at in many a machine learning model;

### OPERATIONALIZATION

Labeling and scoring proecesses

#### Modeling considerations

* Variable types and definitions – categories and boundries;
* Categorization schemes – issues with binaries and other limitations;
* Supervised vs. unsupervised – supervised for when we know what we're looking for vs. when we are looking to build topic models;
* Units of interest – How to breakdown the text object: by story, sentence, phrase, bi-gram, n-gram of one, etc...
* Interpretability – models that can easily communicate

#### Annotation

* Human coders to train an annotation model that is then applied larger data sets for analysis;
* Annotation choices are thought of as the "codebook;"
* Who are the annotators and what skill level is required for the process;
* Disagreement between annotators can signal weaknesses in the codebook or illuminate a need for a more meaningful approach to future analysis.

#### Data pre-processing

* important to document whatever processes were undertaken to prepare and re-structure data before the analysis is applied;
* OCR errors as an example – they can vary within a corpus and over time as the quality of the OCR toolset has evolved.
* Tokenization processes don't do well with emoji, creative orthography (sh!t, U$A), and missing spaces.
* Lots of other processes to consider – lowercasing, removing punctuation, stemming (removing suffixes), lemmatization, normalization (groupings of similar words and/or abbreviations);
* Stop word lists;
* What guides the choices in these processing steps...

#### Dictionary-based approaches

* word lists used for scoring

#### Supervised models

* An example is to create a classifier based on a small set of annotations and then apply it to a larger set;
* Definition and label types are of great consequence;
* Features of the model are the "abilities" of the model – content-based single words, sequences of words, insight models, word embeddings;
* Concerns include issues of how the model might interact with particular datasets that are not well-balanced, or has noisy annotations;
* Spurious features and the need for interpretability – survey data that needs to be interpreted based on the unbalanced response rates;

#### Topic modeling

* usually unsupervised;
* "creates a set of probability distributions over the vocabulary of the collection, which, when combined toghether in different proportions, best match the contnet of the collection;"
* the probabilities of words then can give a sense of what the topic is "about;"
* there may be a need to manage stopword lists if there is a need to improve language filtering.

#### Validation

* How good is our "scoring" system – is it validly measuring what it's supposed to measure;
* Comparing machine generated result to a human annotated example;
* Accuracy and measures of precision, such as F-scores, are sometimes used, but often it's a better measure of accuracy then validity;
* Good to have additional forms of validation, such as close readings basic observations of the 'sensibility' of the results;
* Validation through comparison to other approaches to the same concept;

### ANALYSIS

* Using our models to explore answers to our research questions;
* "Errors" may provide insights into future studies;
