# Text preprocessing for flexible analysis

DS 7800

Raf Alvarado

29 Februrary 2024

# Da's Fallacy

![](nanda.png)

> [Nan Da 2019](https://www.journals.uchicago.edu/doi/full/10.1086/702594?journalCode=ci)

This is incorrect.

In CLS data work, the most import decisions are about **structure**.

Structure is **geometry**.

Counting without structure is meaningless.

# Two Approaches

1. Work with **text collections**
    - Requires prior existence
    - Allows for many analytic pathways
2. Work with **text extracts**
    - Can be created in the process of doing research
    - Allows for restricted analysis

# Data Models

- In both cases, there are target data models
- **Text Collections**
    - LIB
    
    - TOKEN
    - VOCAB
- **Extracts**
    - LIB
    - EXTRACT
    - ANNOTATION

See [Lecture](https://docs.google.com/presentation/d/1JqEMMAygGLuvZGl_SSMgCdIkipTtJlgmiq8qCAhc_AY/edit?usp=sharing)

# Steps

1. **Collect** sources
2. **Learn** structure
3. **Parse** into tables
4. **Annotate** for linguistic features
5. **Vectorize** into analytic tables
6. **Model** and **Visualize**

# 1. Collect Sources

- Curated digital collections
- Web scraping
- API
- By hand

# Example Collection Sites

- [Faulkner](https://faulkner.drupal.shanti.virginia.edu/)
- [Multepal](https://multepal.spanitalport.virginia.edu/)

# 2. Learn Structure

- Structure represenation varies by text type:
    - Plain Text -- use of line breaks, punctuation, etc.
    - XML -- DTD, Schema
    - HTML -- Browser inspector
- In each, trying to discover OHCO
    - Ordered Hierarchy of Content Objects

# Example: Project Gutenberg


<div style="float:left;">
<pre>
- BOOK
    - Chapter 1 &larr; Single lines with the word "Chapter"
        - Paragraph 1 &larr; Double line breaks
            - Sentence 1 &larr; Punctuation
            - Sentence 2
            - ...
        - Paragraph 2
        - ...
    - Chapter 2
        - ...
</pre>
</div>
<img src="sample-gutenberg.png" width="500" style="float:right;">            

# 3. Parse into tables

- After learning the structure, choose tools and approach
    - Every collection is different
- Goal is to parse into tables:
    - `LIBRARY`: A table with info about each text, e.g. title, date, author, etc.
    - `DOC`: A table with a row for each parsed chunk of text.
    - Or: `TOKEN`: A table with token as a row.
    - `VOCAB`: A table with each unique word

# Example

![](M02TextModels.png)

Example: [Importing _Persuasion_](https://github.com/ontoligent/DS5001-2024-01-R/blob/main/lessons/M02_TextModels/M02_01_Importing-Persuasion.ipynb)

# Annotate

- Use NLP libraries (such as NLTK) to provide grammatical information
    - Part-of-Speech for each token
    - Most frequenct part-of-speech for each term
    - Whether or not a term is a stopword
    - Corpus frequency of a words
    - Named Entities
- Examples: 
    - [Annotating with NLTK](https://github.com/ontoligent/DS5001-2024-01-R/blob/main/lessons/M04_NLP/M04_00_NLTK_Intro.ipynb)
    - [Annotating a small corpus](https://github.com/ontoligent/DS5001-2024-01-R/blob/main/lessons/M04_NLP/M04_01_Pipeline.ipynb)


# Vectorize

- Convert DOC or TOKEN table into a "bag-of-words" representation (BOW)
- Convert the BOW into a document-term matrix (DTM)
- Convert into other vector spaces (e.g. time-token)
- Apply TF-IDF to compute the signifance of words in documents and in the corpus 
- Examples: 
    - [BOW to TFIDF](https://github.com/ontoligent/DS5001-2024-01-R/blob/main/lessons/M05_VectorSpaceModels/M05_01_BOW_TFIDF.ipynb)
    - [Time Token Matrix](https://github.com/ontoligent/DS5001-2024-01-R/blob/main/lessons/M05_VectorSpaceModels/M05_02_TimeTokenMatrices.ipynb)

# Model and Visualize

- One your text is in a vector space format, you can apply various models to it
- Since each row is a vector in word space, you can measure distances between documents and words
- These distance measures can be used to "cluster" documents and words in various ways
- You can also apply methods like PCA (Principal Component Analysis) to the vector space to explore the deeper semantics of the corpus
- Examples:
    - [Hierarchical Clustering]()
    - [PCA]()

