# Lecture 1: Introduction to Text as Data

## Overview

### Text is High Dimensional
- Suppose that there is a sample of documents, each document is $n_L$ words long, drawn from vocabulary of $n_V$ words
- Then an unique representation of each document has dimensionality (**in information theory, the number of possible states or the size of the data sapce**) ${n_V}^{n_L}$
- For example, a representation of a 30-word Twitter message using only the one thousand most common words will have dimensionality $={1000}^{30}$
- Therefore, due to the curse of dimensionality (dimensionality grows expotentially as feature number, e.g. the word count of a document, increases), raw text is hard to be used as a input variable for prediction or causal inference.
- Instead, some procedures are needed to retain useful information from the text and compress its dimensionality

### Common Procedure of Text Analysis
1. Convert raw text $D$ to a numerical array $C$ (e.g. the elements of $C$ can be counts over tokens or binary indicators of the positions of tokens)
2. Map C to predicted values $\hat{V}$ of unknown outcomes $V$ (the information that is assumed to be "useful")
3. The $\hat{V}$ is learned through supervised ML for labeled $C_i$ and $V_i$ or unsupervised ML for unlabeled $C_i$ (e.g. learn topics or principal dimensions)
4. Finally use $\hat{V}$ for subsequent descriptive or causal analysis

### Basic Concepts of Text Analysis
- **Document:** a sequence of characters, the fundamental unit of analysis; 
    - the determination of document, however, will vary depending on our question; 
    - for example, we may count a pharagrpah as a document or a whole article as a document
- **Corpus:** the set of documents
- **Tokens:** words or phrases
- **Unstructured data:** data such that useful information is mixed with lots of useless information
    - text data is unstructured

### Relating Text to Metadata
- Text data (documents) are not that meaningful by themselves; they are useful only if related to metadata.
- For example, measuring positive-negative sentiment $Y_i$ in each political speeches is not that meaningful by itself.
- But we can relate sentiment $Y_{ijkt}$ to the information of politician $j$, topic $k$, and time $t$; these are metadata (data that is not from the text).
- Then we can look at interesting question such as: does sentiment vary over time $t$ or does politician $j$ express more negative sentiment toward topic $k$.

### Text Analysis ML Methods
There are many methods that can be used to map raw text (documents) to some more compressed objects and retain "useful" information from them:
- Dictionaries, Tokenization, and Document Distance
- Topic Models and ML with text, Word Embeddings and Linguistic Parsing
- Transformers and LLMs

## Quantity (Text Length) as Data

### Judge Age and Writing Style
The paper is written by Ash, Goessmann, and MacLeod (2022):
- They count average word length and sentence length from documents written by different judges
- They regress average word length and sentence length on age of judges.
- There is a negative correlation between average word length and judge age (elders tend to use simplier words)
- There is a positive correlation between average sentence length and judge age (elders tend to use more complex sentences)

### Optimal Legal Complexity
The paper is written by Katz and Bommarito (2014):
- **Motivation:**
    - More legal detail added to laws, more properly they specified rules and target incentives to activities and groups
    - But there are costs to understanding, following, and maintaining complex laws
    - Given the trade-off between complexity and readability, there should exist an optimal legal complexity
- **Methodology:**
    - Katz and Bommarito measure legal complexity using number of words and also word entropy (diversity of the vocabulary in the law, or predictivity of the writting)
    - Word entropy: 
        - for a corpus treated as a set of sequences of words, let $V$ denote the vocabulary
        - let $p(w)$ be the empirical probability of observing word $w$ in each document: $$ p(w)= \frac{count(w)}{\sum_{v \in V}{count(v)}} $$
        - $count(v)$ is the count of occurence of word $v$ in the document
        - word entropy (for each document) is defined as: $$ H=-\sum_{w \in V}{p(w){log}_{2}(p(w))} $$
        - this measure gets larger if there are many words appear with small probability (higher diversity)
- **Results:**
    - They rank law titles by word number and word entropy
    - They found that Public Health and Welfare, Conservation, and Commerce and Trade have both high word number and entropy (most complex)

## Dictionary-Based Methods

### Overview of Dictionary-Based Methods
- Dictionary-based methods use dictionary (a pre-selected list of words or phrases) to analyze a corpus (map document to more compressed objects and retain useful information)
- These methods identify patterns or counts defined by the dictionary using regular expressions
- These methods are corpus-specific, that is, they identify sets of words or phrases across documents of a corpus in the same way

### Example 1: Measuring Uncertainty in Macroeconomy
The paper is written by Baker, Bloom, and Davis (2016)
- Methodology:
    - Filter each newspaper on each day since 1985 by:
        - Article contains “uncertain” or “uncertainty”
        - Article contains “economic” or “economy”
        - Article contains “congress” or “deficit” or “federal reserve” or “legislation” or “regulation” or “white house”
    - Normalize resulting article counts by total newspaper articles that month to create a news-based economic policy uncertainty index
- Result:
    - They found that the uncertainty index peaked during economic crisis in history

### Example 2: Identifying Race-Related Research in Economics
- Motivation:
    - How does economics compare to other social sciences in study of race-related issues?
- Methodology:
    - Corpus:
        - Considered all journals that JSTOR characterizes as comprising the disciplines of economics, sociology, and political science
        - Created a corpus of publications from 1960 to 2020: 224,855 publications from 231 economics journals, 138,188 publications from 185 sociology journals, and 110,835 publications from 213 political science journals
    - Dictionary:
        - The list of keywords are created along two dimensions: (i) the racial or ethnic group being studied; and (ii) the issue being studied
        - Examples of keywords along the group dimension are race, african-american, person of color, and ethnicity
        - Examples of issue keywords include discrimination, prejudice, and stereotype
    - Identifying method:
        - A publication is identified as race-related if: 
            - (i) at least one group keyword is in the title
            - (ii) at least one group keyword and at least one issue keyword are mentioned in the title or abstract
        - For rule (ii) they drop the last sentence of the abstract to avoid false positives from research that only mentions race parenthetically (robustness check rather than the primary focus of study)
    - Further categorization:
        - Band 0 consists of generic keywords denoting racial and ethnic groups (e.g. race, ethnic, under represented minority)
        - Band 1 adds group keywords relating to the main minority groups in the U.S. (e.g. African American, Latinos and Native Americans)
        - Band 2 adds less salient group keywords (e.g. White, South Asian, Indian American, Japanese American) and other minorities based on religious beliefs (e.g. Muslim, Jewish).
        - Words and phrases are also broadly split across five broader topics: discrimination, inequality, diversity, identity, and historical issues
- Results:
    - The share of race-issue related publications from economics are far behind political science and sociology for decades
    - The weighted number of race-related publications by journal quality is also dominated by sociology, political science, and finally economics.

### General Dictionaries
Researchers can either use self-defined dictioary or dictionary defined by others; below are some common pre-defined dictionaries:
- **WordNet:**
    - It is an English word database with 118K nouns, 12K verbs, 22K adjectives, 5K adverbs
    - It contains:
        - Snonym sets (synsets): a group of near-synonyms, plus a gloss (definition), e.g. good - great
        - Antonyms (opposites): e.g. good - bad
        - Holonyms/meronyms (part-whole): e.g. leaf - tree
        - And different meanings (senses) of the word
    - Nouns are organized in categorical hierarchy (that is why it is called WordNet):
        - “hypernym”: the higher category that a word is a member of, e.g. motion
        - "hyponyms”: members of the category identified by a word, e.g. walking, flying, swimming
- **Stopwords:**
    - Also called function words, these are words that does not contain much information
    - Include words such as for, rather and than
    - Removing them can help us to get at non-topical dimensions (include more information)
- **Linguistic Inquiry and Word Counts (LIWC):**
    - $>$ 10000 words from $>$ 100 lists of category-relevant words
    - Categories include: “emotion”, “cognition”, “work”, “family”, “positive”, “negative” , etc.
- **Emotional words:**
    - Mohammad and Turney (2011) coded 10,000 words along four emotional dimensions
        - These are joy–sadness, anger-fear, trust-disgust, anticipation-surprise
    - Warriner et al (2013) coded 14,000 words along three emotional dimensions
        - These are valence, arousal, dominance

## Sentiment Analysis

### Overview of Sentiment Analysis
- The aim is to extracting a “tone” dimension from the document: such as positive, negative, and neutral
- The standard approach is lexicon (dictionary) based:
    - e.g. if "good" appears, it is positive
    - but they fail easily: e.g., “good” versus “not good” versus "not very good"
- Alternative approaches are transformer-based sentiment models (HuggingFace)
    - However, these methods can still introduce bias if the model is trained on biased corpora
        - For example, if the transformer model is trained on online writing (informal)
        - It may not work for legal text (formal)
    - Besides, there are ethical concern over supervised NLP models:
        - Some times specific names or countries can induce change in sentiment score (e.g. typically white name increases sentiment, typically black name decreases it)
        - This is because sentiment models that are trained on annotated datasets also learn from the correlated non-sentiment information
        - For example, in the training set maybe people complain about Mexican food more often than Italian food because Italian restaurants tend to be more upscale
        - This will make the model confounded by the correlation between "Mexican" and negative sentiment, though "Mexican" has no real implication on sentiment score
        - (supervised models learn features that are correlated with the label being annotated)
    - Dictionary methods, while having other limitations, mitigate this problem
        - The researcher intentionally “regularizes” out spurious confounders with the targeted language dimension
        - Helps explain why economists often still use dictionary methods